edit upload recent print


* WikiDocs

Main » Grammars

Context Free Grammars (CFGs)

Grammar comes from the Greek syntaxis, meaning 'setting out together, or arrangement', and refers to the way items are arranged together. Notice we say 'items' instead of 'words' though most dictionary definitions will make some reference to written or spoken language. For our purposes however, we can imagine grammars pertaining to words, paragraphs, musical notes, recorded samples, drawing instructions, images, or even algorithmic processes (programming languages are good examples of grammars, specifically 'context-free' grammars which we'll be talk more about below). We can even use grammars as a means of procedurally defining other grammars, say a 'story' grammar which generates a paragraph grammar, which generates a sentence-level grammar, which generates a word-level grammar, and so on...

So what is a 'context-free' grammar or CFG? In essence, its simply a set of rules (or 'productions') which defines a valid sequence of items. For example purposes I'll refer to this sequence as a 'sentence', but remember that it can be a combination of nearly any type of items. If we are given such a 'sentence', a CFG can tell us whether it is grammatical (that is, syntactically correct) or not -- similar to how a compiler tells us whether a program is 'correct' syntactically. Perhaps more important for our purposes, once we have a valid CFG (we'll discuss what this means in a moment), we can use it to generate new 'sentences' accorded to its rules. Here's a simple context-free grammar for a small fragment of English:

s -> np vp
np -> det n
vp -> v
vp -> v np
det -> 'a'
det -> 'the'
n -> 'woman'
n -> 'man'
v -> 'shoots'

(Its important to note that while we can represent portions of natural languages (like English) using CFGs, they are generally too complex at larger scales. Such 'natural' languages are often called 'context-sensitive'.)

So, what are the ingredients of the little grammar above? Well, first note that it contains three types of symbols:

  • First, there's the -> symbol, which is often used to define the rules.
  • Next there are symbols like these: s, np, vp, det, n, v. These symbols are called non-terminal symbols; we'll soon learn why. Here, s is short for sentence, np is short for noun phrase, vp is short for verb phrase, and det is short for determiner. Each of these symbols is shorthand for a grammatical category, though (again) they can stand for anything you want.
  • Finally there are the symbols in quotations: 'a', 'the', 'woman', 'man', and 'shoots'. A computer scientist might call these terminal symbols, and linguists might call them 'lexical items'; we'll generally just call these words or strings.

Now, the grammar above contains nine rules. A context-free rule (by definition) consists of a single non-terminal symbol, followed by '->', followed by a finite sequence made up of terminal (strings) and/or non-terminal symbols. All nine items listed above have this form, so they are all legitimate context-free rules. What do these rules mean? They tell us how different grammatical categories can be built up.

Read -> as 'can consist of', or 'is built from'. For example, the first rule (generally an 's' for either 'start' or 'sentence') tells us that a sentence (s) can consist of a noun phrase (np) followed by a verb phrase (vp). The third rule tells us that a verb phrase (vp) can consist of a verb (v) followed by a noun phrase (np), while the fourth rule tells us that there is another way to build a verb phrase (vp); simply use a verb (v). The last five rules tell us that 'a' and 'the' are determiners, that 'man' and 'woman' are nouns, and that 'shoots' is a verb.

Here is the same grammar (with one change):

s -> np vp
np -> det n
vp -> v | v np
det -> 'a'
det -> 'the'
n -> 'woman'
n -> 'man'
v -> 'shoots'

Note that where before we had 2 rules that started with 'vp' on the left-side, we now have one. On the right-side of this one 'vp' rule, we find: v np | v, which simply means a verb (v) followed by a noun-phrase (np) OR (shown by a '|' character) a verb (v). So we combined 2 rules into one using the OR operator. This grammar is more compact (8 rules vs. 9 above), but otherwise fully equivalent to the one above. Notice we can also do the same for the 'det' and 'n' rules.

s -> np vp
np -> det n
vp -> v | v np
det -> 'a' | 'the'
n -> 'woman' | 'man'
v -> 'shoots'

So, how can we use such a grammar to generate text?

Well, we simply start from the top (or start-symbol) and follow the rules, make substitutions as we go. Any time we come to an OR symbol, we randomly pick one of the choices (here I'll just take the first choice). So, for example, we might proceed as follows, starting from symbol s.


                       np   vp 

                    det   n      v 

                  'a'  'woman' 'shoots'

Of course, we can repeat this process as many times as we want, making other 'random choices' and getting other variations.

Now lets look at how at the RiGrammar object in RiTa. Here's an example of a RiTa grammar in action... and here's the grammar file.

Ok, so how would we translate the grammar above into the RiTa/RiGrammar format? Give it a try...

our example grammar in RiTa format

Other Types of Grammars

When Noam_Chomsky first formalized generative grammars in 1956, he classified them into types now known as the Chomsky_hierarchy. The difference between these types is that they have increasingly strict production rules and can express fewer formal languages. Some important grammar 'types' include 'context-sensitive' grammars, 'context-free' grammars (see above) and 'regular' grammars. The languages that can be described with such a grammar are called context-sensitive, context-free, and regular languages, respectively. The latter two are the types of grammars most often used in natural_language_processing because they, as opposed to context-sensitive grammars like English or Chinese, can be efficiently implemented in software (see RiTa's RiGrammar object).

Context-free vs. Context-sensitive

Context-free grammars those in which the left hand side of a production rule is formed by a single non-terminal symbol. The following are context-free grammar rules:

    * (A → B); which means replace A with B
    * (A → B); (B → AB); which means replace A with B, then replace B with AB

Context-sensitive grammars are those in which production rules can depend on their neighbors. The following are context-sensitive rules:

    * (A <B> A → A); which says if B is surrounded by two A's, replace B with A
    * (A <B> B → B); which says if B is preceded by A and followed by B, leave B alone

Deterministic vs. Probabilistic

Deterministic grammars (according to the strict definition) have one production rule for each symbol. Probabilistic (or stochastic) grammars may contain several production rules for a symbol, each of which may be selected according to its probability values. For instance, the rules

    * (A → B) with P = 0.75 or (A → AB) with P = 0.25

means that each time A is obtained during the iterative process, a random choice must be made as whether to replace A with B or with AB based on probability P, so that the 1st rule will be selected 3 times as frequently (on average) as the 2nd.

RiTa's RiGrammar object also supports probabilistic grammars, allowing for 'multiplier' tags to be added to any valid grammar with no other changes. An example from the grammar above:

    <v>  | 
    [3] <v> <np> 

This specifies that the second rule, vp → v np is assigned 3 times the probability of the first; that is, a 75% probability vs. 25%.

Another useful tool is the RiGrammarEditor, which allows you to edit your grammar while your program is running... Here is an an example.

Instructor: Daniel C. Howe



Page last modified on March 21, 2010, at 07:56 PM EST