Contextfree grammar
From Seo Wiki  Search Engine Optimization and Programming Languages
In formal language theory, a contextfree grammar (CFG), sometimes also called a phrase structure grammar is a grammar which naturally generates a formal language in which clauses can be nested inside clauses arbitrarily deeply, but where grammatical structures are not allowed to overlap.
The canonical example is matching parentheses: parentheses of different types must open and close correctly inside each other, like this:
 ([ [ [ ()() [ ][ ] ] ]([ ]) ])
like any context free grammars, the logical units, the contents of corresponding matched parentheses, nest cleanly.
A language which is not context free is that of two different types of parentheses, each separately balanced disregarding the other, but where the two types don't nest inside one another, for example:
 [ [ [ [((((( ] ] ] ]))))(([ ))(([ ))([ )( ])( ])( ])
The logical units in this language overlap.
In terms of production rules, every production of a context free grammar is of the form
 V → w
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). These rewriting rules applied successively produce a parse tree, where the nonterminal symbols are nodes, the leaves are the terminal symbols, and each node expands by the production into the next level of the tree. The tree describes the nesting structure of the expression.
In a context free grammar the left hand side of a production rule is always a single nonterminal symbol. In a general grammar, it could be a string of terminal and/or nonterminal symbols. The term "contextfree" means that the rules by which nonterminals are rewritten do not depend on the surrounding symbols, the context.
Contextfree languages are exactly those which can be understood by a finite state computer with a single infinite stack. In order to keep track of nested units, one pushes the current parsing state at the start of the unit, and one recovers it at the end.
Contextfree grammars play a central role in the description and design of programming languages and compilers. They are also used for analyzing the syntax of natural languages. Noam Chomsky has posited that all human languages are based on context free grammars at their core, with additional processes that can manipulate the output of the context free component (the transformations of early Chomskyan theory).
Contents 
Background
Since the time of Pāṇini, at least, linguists have described the grammars of languages in terms of their block structure, and described how sentences are recursively built up from smaller phrases, and eventually individual words or word elements.
An essential property of these block structures is that logical units never overlap. For example, the sentence:
John, whose blue car was in the garage, walked to the green store.
can be logically parenthesized as follows:
 (John, (whose (blue car) (was (in the garage)), (walked to the (green store)).
Natural human languages never allow for overlapping constructions, like:
 [ John walked to (the store ] is green)
which hypothetically could mean (John walked to the store) and [the store is green] in a Martian version of English. This sentence is ungrammatical to human beings, because it does not parenthesize into nonoverlapping units.
The formalism of contextfree grammars was developed in the mid1950s by Noam Chomsky (he called them phrasestructure grammars) ^{[1]} Chomsky showed that the nonoverlapping recursive constructions which linguists had described grammar were of a very special type, general grammars could be imagined which would be much more complex. Chomsky believed that this revealed a special propensity of the human mind, that it prefers certain context free grammars.
A contextfree grammar provides a simple and precise mechanism for describing the methods by which phrases in some natural language are built from smaller blocks, capturing the "block structure" of sentences in a natural way. Its simplicity makes the formalism amenable to rigorous mathematical study. Important features of natural language syntax such as agreement and reference are not part of the context free grammar, but the basic recursive structure of sentences, the way in which clauses nest inside other clauses, and the way in which lists of adjectives and adverbs are swallowed by nouns and verbs, is described exactly.
Block structure was introduced into computer programming languages by the Algol project, which, as a consequence, also featured a contextfree grammar to describe the resulting Algol syntax. This became a standard feature of computer languages, and the notation for grammars used in concrete descriptions of computer languages came to be known as BackusNaur Form, after two members of the Algol language design committee.
The "block structure" aspect that contextfree grammars capture is so fundamental to grammar that the terms syntax and grammar are often identified with contextfree grammar rules, especially in computer science. Formal constraints not captured by the grammar are then considered to be part of the "semantics" of the language.
Contextfree grammars are simple enough to allow the construction of efficient parsing algorithms which, for a given string, determine whether and how it can be generated from the grammar. An Earley parser is an example of such an algorithm, while the widely used LR and LL parsers are more efficient algorithms that deal only with more restrictive subsets of contextfree grammars.
Formal definitions
A contextfree grammar G is a 4tuple:
<math>G = (V\,, \Sigma\,, R\,, S\,)</math> where
1. <math>V\,</math> is a finite set of nonterminal characters or variables. They represent different types of phrase or clause in the sentence. They sometimes called syntactic categories. Each variable represents a language.
2. <math>\Sigma\,</math> is a finite set of terminals, disjoint from <math>V\,</math>, which make up the actual content of the sentence.The set of terminals is the alphabet of the language defined by the grammar.
3. <math>R\,</math> is a relation from <math>V\,</math> to <math>(V\cup\Sigma)^{*}</math> such that <math>\exist\, w\in (V\cup\Sigma)^{*}: (S,w)\in R</math>. These relations called productions or rewrite rules.
4. <math>S\,</math> is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of <math>V\,</math>.
<math>R\,</math> is a finite set. The members of <math>R\,</math> are called the rules or productions of the grammar. The asterisk represents the Kleene star operation.
Additional Definition 1
For any strings <math>u, v\in (V\cup\Sigma)^{*}</math>, we say <math>u\,</math> yields <math>v\,</math>, written as <math>u\Rightarrow v\,</math>, if <math>\exists (\alpha, \beta)\in R, u_{1}, u_{2}\in (V\cup\Sigma)^{*}</math> such that <math>u\,=u_{1}\alpha u_{2}</math> and <math>v\,=u_{1}\beta u_{2}</math>. Thus, <math>\! v</math> is the result of applying the rule <math>\! (\alpha, \beta)</math> to <math>\! u</math>.
Additional Definition 2
For any <math>u, v\in (V\cup\Sigma)^{*}, u\stackrel{*}{\Rightarrow} v</math> (or <math>u\Rightarrow\Rightarrow v\,</math> in some textbooks) if <math>\exists u_{1}, u_{2}, \cdots u_{k}\in (V\cup\Sigma)^{*}, k\geq 0</math> such that <math>u\Rightarrow u_{1}\Rightarrow u_{2}\cdots\Rightarrow u_{k}\Rightarrow v</math>
Additional Definition 3
The language of a grammar <math>G = (V\,, \Sigma\,, R\,, S\,)</math> is the set
 <math>L(G) = \{ w\in\Sigma^{*} : S\stackrel{*}{\Rightarrow} w\}</math>
Additional Definition 4
A language <math>L\,</math> is said to be a contextfree language (CFL) if there exists a CFG, <math>G\,</math> such that <math>L\,=\,L(G)</math>.
Additional Definition 5
A contextfree grammar is said to be proper if it has
 no inaccessible symbols: <math>\forall N \in V: \exists \alpha,\beta \in V^*: S \stackrel{*}{\Rightarrow} \alpha{N}\beta</math>
 no improductive symbols: <math>\forall N \in V: \exists w \in \Sigma^*: N \stackrel{*}{\Rightarrow} w</math>
 no εproductions: <math>\forall N \in V, w \in \Sigma^*: (N, w) \in R \Rightarrow w \neq</math> ε
 no cycles: <math>\neg\exists N \in V: N \stackrel{*}{\Rightarrow} N</math>
Examples
Example 0
The canonical example of a context free grammar is parentheses matching, which is representative of the general case. There are two terminal symbols ( and ) and one nonterminal symbol S. The production rules are
 S → SS
 S → (S)
 S → ()
The first rule allows S's to multiply, The second rule allows S's to become enclosed by matching parentheses, and the third rule terminates the recursion.
Starting with S, applying the rules, one can construct:
 S → SS → SSS → (S)SS → ((S))SS → ((SS))S(S)
 → ((()S))S(S) → ((()()))S(S) → ((()()))()(S)
 → ((()()))()(())
Example 1
A second canonical example is two different kinds of matching nested parentheses, described by the productions:
 S → SS
 S → ()
 S → (S)
 S → []
 S → [S]
with terminal symbols [ ] ( ) and nonterminal S.
Example 2
 S → a
 S → aS
 S → bS
The terminals here are a and b, while the only nonterminal is S. The language described is all nonempty strings of <math>a</math>s and <math>b</math>s that end in <math>a</math>.
This grammar is regular: no rule has more than one nonterminal in its righthand side, and each of these nonterminals is at the same end of the righthand side.
Every regular grammar corresponds directly to a nondeterministic finite automaton, so we know that this is a regular language.
It is common to list all righthand sides for the same lefthand side on the same line, using  to separate them, like this:
 S → a  aS  bS
So that this is the same grammar described in a terser way.
Example 3
In a contextfree grammar, we can pair up characters the way we do with brackets. The simplest example:
 S → aSb
 S → ab
This grammar generates the language <math> \{ a^n b^n : n \ge 1 \} </math>, which is not regular (according to Pumping Lemma for regular languages).
The special character ε stands for the empty string. By changing the above grammar to
 S → aSb  ε
we obtain a grammar generating the language <math> \{ a^n b^n : n \ge 0 \} </math> instead. This differs only in that it contains the empty string while the original grammar did not.
Example 4
Here is a contextfree grammar for syntactically correct infix algebraic expressions in the variables x, y and z:
 S → x
 S → y
 S → z
 S → S + S
 S → S  S
 S → S * S
 S → S / S
 S → ( S )
This grammar can, for example, generate the string
 ( x + y ) * x  z * y / ( x + x )
as follows:
 S (the start symbol)
 → S  S (by rule 5)
 → S * S  S (by rule 6, applied to the leftmost S)
 → S * S  S / S (by rule 7, applied to the rightmost S)
 → ( S ) * S  S / S (by rule 8, applied to the leftmost S)
 → ( S ) * S  S / ( S ) (by rule 8, applied to the rightmost S)
 → ( S + S ) * S  S / ( S ) (etc.)
 → ( S + S ) * S  S * S / ( S )
 → ( S + S ) * S  S * S / ( S + S )
 → ( x + S ) * S  S * S / ( S + S )
 → ( x + y ) * S  S * S / ( S + S )
 → ( x + y ) * x  S * y / ( S + S )
 → ( x + y ) * x  S * y / ( x + S )
 → ( x + y ) * x  z * y / ( x + S )
 → ( x + y ) * x  z * y / ( x + x )
Note that many choices were made underway which S
was going to be rewritten next.
These choices look quite arbitrary. As a matter of fact, they are.
Also, many choices were made on which rule to apply to the selected S
.
These do not look so arbitrary: they usually affect which terminal string comes out at the end.
To see this we can look at the parse tree of this derivation:
S

/\
S  S
/ \
/\ /\
S * S S / S
/   \
/\ x /\ /\
( S ) S * S ( S )
/   \
/\ z y /\
S + S S + S
   
x y x x
Starting at the top, step by step, an S in the tree is expanded, until no more unexpanded S
es remain.
Picking a different order of expansion will produce a different derivation, but the same parse tree.
The parse tree will only change if we pick a different rule to apply at some position in the tree.
But can a different parse tree still produce the same terminal string,
which is ( x + y ) * x  z * y / ( x + x )
in this case?
Yes, for this grammar, this is possible.
Grammars with this property are called ambiguous.
For example, x + y * z
can be produced with these two different parse trees:
S S
 
/\ /\
S * S S + S
/ \ / \
/\ z x /\
x + y y * z
However, the language described by this grammar is not ambiguous: an alternative, unambiguous grammar can be given for the language, for example:
 T → x
 T → y
 T → z
 S → S + T
 S → S  T
 S → S * T
 S → S / T
 T → ( S )
 S → T
(once again picking S
as the start symbol).
Example 5
A contextfree grammar for the language consisting of all strings over {a,b} containing an unequal number of a's and b's:
 S → U  V
 U → TaU  TaT
 V → TbV  TbT
 T → aTbT  bTaT  ε
Here, the nonterminal T can generate all strings with the same number of a's as b's, the nonterminal U generates all strings with more a's than b's and the nonterminal V generates all strings with fewer a's than b's.
Example 6
Another example of a nonregular language is <math> \{ b^n a^m b^{2n} : n \ge 0, m \ge 0 \} </math>. It is contextfree as it can be generated by the following contextfree grammar:
 S → bSbb  A
 A → aA  ε
Other examples
The formation rules for the terms and formulas of formal logic fit the definition of contextfree grammar, except that the set of symbols may be infinite and there may be more than one start symbol.
Contextfree grammars are not limited in application to mathematical ("formal") languages. For example, it has been suggested that a class of Tamil poetry called Venpa is described by a contextfree grammar.^{[2]}
Derivations and syntax trees
There are two common ways to describe how a given string can be derived from the start symbol of a given grammar. The simplest way is to list the consecutive strings of symbols, beginning with the start symbol and ending with the string, and the rules that have been applied. If we introduce a strategy such as "always replace the leftmost nonterminal first" then for contextfree grammars the list of applied grammar rules is by itself sufficient. This is called the leftmost derivation of a string. For example, if we take the following grammar:
 (1) S → S + S
 (2) S → 1
 (3) S → a
and the string "1 + 1 + a" then a left derivation of this string is the list [ (1), (1), (2), (2), (3) ]. Analogously the rightmost derivation is defined as the list that we get if we always replace the rightmost nonterminal first. In this case this could be the list [ (1), (3), (1), (2), (2)].
The distinction between leftmost derivation and rightmost derivation is important because in most parsers the transformation of the input is defined by giving a piece of code for every grammar rule that is executed whenever the rule is applied. Therefore it is important to know whether the parser determines a leftmost or a rightmost derivation because this determines the order in which the pieces of code will be executed. See for an example LL parsers and LR parsers.
A derivation also imposes in some sense a hierarchical structure on the string that is derived. For example, if the string "1 + 1 + a" is derived according to the leftmost derivation:
 S → S + S (1)
 → S + S + S (1)
 → 1 + S + S (2)
 → 1 + 1 + S (2)
 → 1 + 1 + a (3)
the structure of the string would be:
 { { { 1 }_{S} + { 1 }_{S} }_{S} + { a }_{S} }_{S}
where { ... }_{S} indicates a substring recognized as belonging to S. This hierarchy can also be seen as a tree:
S /\ /  \ /  \ S '+' S /\  /  \  S '+' S 'a'   '1' '1'
This tree is called a concrete syntax tree (see also abstract syntax tree) of the string. In this case the presented leftmost and the rightmost derivations define the same syntax tree; however, there is another (leftmost) derivation of the same string
 S → S + S (1)
 → 1 + S (2)
 → 1 + S + S (1)
 → 1 + 1 + S (2)
 → 1 + 1 + a (3)
and this defines the following syntax tree:
S /\ /  \ /  \ S '+' S  /\  /  \ '1' S '+' S   '1' 'a'
If, for certain strings in the language of the grammar, there is more than one parsing tree, then the grammar is said to be an ambiguous grammar. Such grammars are usually hard to parse because the parser cannot always decide which grammar rule it has to apply. Usually, ambiguity is a feature of the grammar, not the language, and an unambiguous grammar can be found which generates the same contextfree language. However, there are certain languages which can only be generated by ambiguous grammars; such languages are called inherently ambiguous.
Normal forms
Every contextfree grammar that does not generate the empty string can be transformed into one in which no rule has the empty string as a product [a rule with ε as a product is called an εproduction]. If it does generate the empty string, it will be necessary to include the rule <math>S \rarr \epsilon</math>, but there need be no other εrule. Every contextfree grammar with no εproduction has an equivalent grammar in Chomsky normal form or Greibach normal form. "Equivalent" here means that the two grammars generate the same language.
Because of the especially simple form of production rules in Chomsky Normal Form grammars, this normal form has both theoretical and practical implications. For instance, given a contextfree grammar, one can use the Chomsky Normal Form to construct a polynomialtime algorithm which decides whether a given string is in the language represented by that grammar or not (the CYK algorithm).
Undecidable problems
Some questions that are undecidable for wider classes of grammars become decidable for contextfree grammars; e.g. the emptiness problem (whether the grammar generates any terminal strings at all), is undecidable for contextsensitive grammars, but decidable for contextfree grammars.
Still, many problems remain undecidable. Examples:
Universality
Given a CFG, does it generate the language of all strings over the alphabet of terminal symbols used in its rules?
A reduction can be demonstrated to this problem from the wellknown undecidable problem of determining whether a Turing machine accepts a particular input (the Halting problem). The reduction uses the concept of a computation history, a string describing an entire computation of a Turing machine. We can construct a CFG that generates all strings that are not accepting computation histories for a particular Turing machine on a particular input, and thus it will accept all strings only if the machine accept that input.
Language equality
Given two CFGs, do they generate the same language?
The undecidability of this problem is a direct consequence of the previous: we cannot even decide whether a CFG is equivalent to the trivial CFG defining the language of all strings.
Language inclusion
Given two CFGs, can the first generate all strings that the second can generate?
Being in a lower level of the Chomsky hierarchy
Given a contextsensitive grammar, does it describe a contextfree language? Given a contextfree grammar, does it describe a regular language?
Each of these problems is undecidable.
Extensions
An obvious way to extend the contextfree grammar formalism is to allow nonterminals to have arguments, the values of which are passed along within the rules. This allows natural language features such as agreement and reference, and programming language analogs such as the correct use and definition of identifiers, to be expressed in a natural way. E.g. we can now easily express that in English sentences, the subject and verb must agree in number.
In computer science, examples of this approach include affix grammars, attribute grammars, indexed grammars, and Van Wijngaarden twolevel grammars.
Similar extensions exist in linguistics.
Another extension is to allow additional terminal symbols to appear at the left hand side of rules, constraining their application. This produces the formalism of contextsensitive grammars. Even allowing multiple nonterminal symbols produces arbitrary (formal) grammars.
Restrictions
Some important subclasses of the context free grammars are:
 Deterministic grammars
 LR(k) grammars, LALR(k) grammars
 LL(l) grammars
These classes are important in parsing: they allow string recognition to proceed deterministically, e.g. without backtracking.
This subclass of the LL(1) grammars is mostly interesting for its theoretical property that language equality of simple grammars is decidable, while language inclusion is not.
These have the property that the terminal symbols are divided into left and right bracket pairs that always match up in rules.
In linear grammars every right hand side of a rule has at most one nonterminal.
This subclass of the linear grammars describes the regular languages, i.e. they correspond to finite automata and regular expressions.
Linguistic applications
Chomsky initially hoped to overcome the limitations of contextfree grammars by adding transformation rules.^{[1]}
Such rules are another standard device in traditional linguistics; e.g. passivization in English. Much of generative grammar has been devoted to finding ways of refining the descriptive mechanisms of phrasestructure grammar and transformation rules such that exactly the kinds of things can be expressed that natural language actually allows. Allowing arbitrary transformations doesn't meet that goal: they are much too powerful, being Turing complete unless significant restrictions are added (e.g. no transformations that introduce and then rewrite symbols in a contextfree fashion).
Chomsky's general position regarding the noncontextfreeness of natural language has held up since then^{[3]}, although his specific examples regarding the inadequacy of context free grammars (CFGs) in terms of their weak generative capacity were later disproved.^{[4]} Gerald Gazdar and Geoffrey Pullum have argued that despite a few noncontextfree constructions in natural language (such as crossserial dependencies in Swiss German^{[3]} and reduplication in Bambara^{[5]}), the vast majority of forms in natural language are indeed contextfree.^{[4]}
See also
 Contextsensitive grammar
 Formal grammar
 Parsing
 Parsing expression grammar
 Stochastic contextfree grammar
 Algorithms for contextfree grammar generation
Notes
 ↑ ^{1.0} ^{1.1} Chomsky, Noam (Sep 1956). "Three models for the description of language". Information Theory, IEEE Transactions 2 (3): 113–124. doi:. http://ieeexplore.ieee.org/iel5/18/22738/01056813.pdf?isnumber=22738&prod=STD&arnumber=1056813&arnumber=1056813&arSt=+113&ared=+124&arAuthor=+Chomsky%2C+N.. Retrieved 20070618.
 ↑ L, BalaSundaraRaman; Ishwar.S, Sanjeeth Kumar Ravindranath (20030822). "Context Free Grammar for Natural Language Constructs  An implementation for Venpa Class of Tamil Poetry". Proceedings of Tamil Internet, Chennai, 2003. International Forum for Information Technology in Tamil. pp. 128–136. http://citeseer.ist.psu.edu/balasundararaman03context.html. Retrieved 20060824.
 ↑ ^{3.0} ^{3.1} Shieber, Stuart (1985). "Evidence against the contextfreeness of natural language". Linguistics and Philosophy 8: 333–343. doi:. http://www.eecs.harvard.edu/~shieber/Biblio/Papers/shieber85.pdf.
 ↑ ^{4.0} ^{4.1} Pullum, Geoffrey K.; Gerald Gazdar (1982). "Natural languages and contextfree languages". Linguistics and Philosophy 4: 471–504. doi: .
 ↑ Culy, Christopher (1985). "The Complexity of the Vocabulary of Bambara". Linguistics and Philosophy 8: 345–351. doi: .
References
 Chomsky, Noam (Sept. 1956). "Three models for the description of language". Information Theory, IEEE Transactions 2 (3).
Further reading
 Michael Sipser (1997). Introduction to the Theory of Computation. PWS Publishing. ISBN 053494728X. Section 2.1: ContextFree Grammars, pp.91–101. Section 4.1.2: Decidable problems concerning contextfree languages, pp.156–159. Section 5.1.1: Reductions via computation histories: pp.176–183.
Template:Formal languages and grammarsbn:প্রসঙ্গমুক্ত ব্যাকরণ cs:Bezkontextová gramatika de:Kontextfreie Grammatik es:Gramática libre de contexto fr:Grammaire non contextuelle gl:Gramática libre de contexto ko:문맥 자유 문법 hr:Kontekstno neovisna gramatika it:Grammatica libera dal contesto hu:Környezetfüggetlen nyelvtan mk:Контексно слободна граматика nl:Contextvrije grammatica ja:文脈自由文法 pl:Gramatyka bezkontekstowa pt:Gramática livre de contexto ru:Контекстносвободная грамматика sk:Bezkontextová gramatika sr:Контекстно слободна граматика sh:Kontekstno neovisna gramatika fi:Yhteydetön kielioppi sv:Kontextfri grammatik ta:இடம் சாரா இலக்கணம் zh:上下文无关文法