The class of deterministic context-free grammars has proven to be very useful in common computer applications. These include compilers, as well as any other application in which one language needs to be translated into another language. The utility of the class of deterministic context-free grammars stems from the fact that they are expressive enough for a wide variety of applications, yet they are also easily and efficiently parsed. Some of the properties of this class of grammars and its common subclasses are considered here.

There are two major categories of context-free grammar that are of interest. These are the LL and the LR classes of grammars. These classes of grammars are distinguished by the methods that are used to parse them. The LL grammars are sometimes referred to as top-down grammars because they are parsed from the top to the bottom. The LR grammars are parsed from the bottom to the top, and so are called bottom-up grammars. Each of these classes of grammar will be described primarily from the perspective of the parsing methods that are applied to them.

A context-free grammar consists of a set of nonterminals, a set of terminals, a set of productions, and a single start symbol from the set of nonterminals. The nonterminals in a grammar can be thought of as variables. This is because during the course of parsing a grammar, the nonterminals are subject to change. The terminal symbols can be thought of as being the constants in a context-free grammar. The terminal symbols cannot be changed because they constitute the vocabulary of the language that is derived from the grammar. This language consists of all strings of terminal symbols that can be formed using the rules of the grammar. The language of a grammar G is referred to as L(G). A definition of a context-free grammar follows.

Definition: Acontext-free grammarG is described as G = ( N, T, P, S ) where N is a finite set of nonterminal symbols. T is a finite set of terminal symbols, disjoint from N. P is a finite set of productions of the form A -->alphawhere AinNalphain( NunionT )^{*}S is a unique start symbol from N.

The set of productions constitutes the rules of the grammar. These are sometimes referred to as rewriting rules because they specify how a nonterminal can be rewritten, or expanded into a string of symbols from the set

( NConsider the following grammar:unionT )^{*}

A --> B c D Grammar 2.1 B --> D --> d

The first production in grammar 2.1 describes a way of rewriting nonterminal A. The rule says that nonterminal A can be rewritten in the form "BcD". Several conventions are implied in this grammar. The capital letters are taken to be nonterminal symbols, and the lowercase letters represent terminal symbols in the grammar. By convention, the left hand side nonterminal of the first production in the grammar is assumed to be the start symbol of the grammar. These conventions are useful because they make it possible to completely specify a grammar simply by listing its productions. The following can be inferred from the listing of grammar 2.1 above:

N = { A, B, D } T = { c, d } S = A

A reduced grammar is one in which all of the nonterminals are able to participate in the derivation of some string in the language of the grammar. This can be represented symbolically as:

S ==>for all nonterminals "A" in the grammar. The double arrow represents the application of a production in the grammar. The star has its usual meaning of zero or more occurrences of something, in this case productions. The statement is read "S derives^{*}alphaAbeta==>^{*}w

Symbols in N: A, B, C,... Symbols in T: a, b, c,... Symbols in NunionT: U, V, W,... Strings in (NunionT)^{*}:alpha,beta,gamma,... Strings in T^{*}: u, v, w,...

Figure 2.1 Symbolic Conventions for Context-free Grammars

Definition: Aderivationin a grammar is a sequence of applications of the productions of the grammar. If A -->deltais a production in a grammar, thenalphaAbeta==>alphadeltabetais a derivation. The symbols ==>^{*}and ==>^{+}indicate a derivation using zero or more and one or more productions, respectively.

Derivations of strings in the language of a grammar can also be represented graphically. These are referred to as parse trees. A definition of a parse tree follows.

Definition: Aparse treeis a tree that illustrates a derivation in a grammar G = ( N, T, P, S ). The nodes of the parse tree are described as follows. 1. The root is labeled with the start symbol of the grammar, S. 2. Interior nodes are labeled with a nonterminal. 3. The children of a node are related to their parent by some production in P. The parent is the left hand side of the production. The children are the right hand side symbols of the production, and follow the left to right ordering of the production. 4. Leaves are labeled with a terminal orepsilonA leftmost derivation corresponds to a preorder traversal of a parse tree. A rightmost derivation corresponds to a postorder traversal of a parse tree, in reverse order.

A context-free grammar is said to be ambiguous if a string in the language of the grammar can be represented by two or more different parse trees. In many cases, symmetry in a production results in ambiguity in the grammar. This point is illustrated in the following grammar:

A --> A B A Grammar 2.2 A --> a B --> b

Grammar 2.2 derives strings of the form "ababa..." and is ambiguous. This can more easily be seen by examining the graphical representation of two derivations in the grammar. Two different parse trees for the same string in the language of Grammar 2.2 are shown in figure 2.2. The corresponding production sequences for the derivations for the two parse trees in figure 2.2 are listed in figure 2.3.

A | ----------------------- | | | A B A | | | --------------- | | | | | | | A B A | | | | | | | a b a b a A | ------------------------ | | | A B A | | | | | --------------- | | | | | | | A B A | | | | | a b a b a

Figure 2.2 Two Parse Trees for "ababa"

Derivation 1: A --> A B A A --> A B A A --> a B --> b A --> a B --> b A --> a Derivation 2: A --> A B A A --> a B --> b A --> A B A A --> a B --> b A --> a

Figure 2.3 Production Sequences For Two Derivations of "ababa"

Ambiguity is generally considered to be an undesirable property of a context-free grammar. The possibility of having to consider more than one derivation for the same string in a language greatly complicates the parsing process. However, a certain amount of controlled ambiguity can be useful in a grammar. The IF-ELSE ambiguity of some programming languages is a good example of this. Here the parser is made to be deterministic by augmenting it with a rule that specifies that an ELSE token is always associated with the closest enclosing IF statement. All of the classes of grammars that are considered here are deterministic, and therefore cannot be ambiguous.

The methods for deterministic parsing that are commonly used today for applications such as compilers were first introduced in the 1960's. Top-down, or predictive parsing is based on the properties of a class of grammars known as LL(k). There are several variants of the LL(k) class of grammars. One such class is known as the simple LL(1) grammars. The primary distinguishing feature of the simple LL(1) grammars is that they do not include null productions. Korenjak and Hopcroft (1966) were the first to investigate the simple LL(1) grammars. This work was expanded by Lewis and Stearns (1968) who defined the more general class of LL(k) grammars, and their properties. Rosenkrantz and Stearns (1970) further developed the theory of LL(k) grammars. Other early contributors to the theory of deterministic top-down grammars are Knuth (1967), Culik (1968), Kurki-Suonio (1969), and Wood (1969, 1970). Lewis and Rosenkrantz (1971) describe an ALGOL compiler that is based on an LL(1) grammar. Aho and Ullman (1972) is an early monograph on the theory of parsing that provides extensive coverage of LL(k) parsing. An updated treatment can be found in the two volume work on parsing theory by Sippu and Soisalon-Soininen (1988, 1990).

The terminology of top-down grammars is strongly tied to the methods by which they are analyzed and parsed. In the term LL(k), the first "L" means that the input is scanned from left to right. The second "L" implies that a leftmost parse derivation is constructed. The term top-down comes from the fact that the nodes of the parse tree are discovered from the root downwards toward the leaves. Another way of viewing this process is to say that a top-down parse corresponds to a preorder traversal of the parse tree. The "k" in the term LL(k) means that only the next k input symbols are used to make each parsing decision. A parsing decision in the case of top-down grammars is the act of choosing which alternate production will be used to expand the current nonterminal symbol that has been encountered in the course of the parse. This choice can be viewed as a prediction, and so the term predictive parsing is often applied to top-down grammars. That is, given the current state of the parse, the top-down grammars have the property that the next parsing action can be predicted without the delay that is sometimes associated with other methods.

It is often necessary to specify that a particular derivation in a grammar is a leftmost derivation. This is done by adding an "lm" subscript to the derivation arrow as follows.

S ==>This represents a partial specification of the second derivation for Grammar 2.2 that is given in figure 2.2 and figure 2.3. The specification states that the string abABA is an intermediate stage in a leftmost derivation that results in the string ababa. figure 2.4 shows graphically how the parse tree for this particular derivation is constructed one stage at a time._{lm}^{*}abABA ==>_{lm}^{*}ababa

A --> A | ------------------------ | | | A B A A --> A | | ------------------------ ------------------------ | | | | | | A B A A B A | | | a a b A --> A | | ------------------------ ------------------------ | | | | | | A B A A B A | | | | | | | | --------------- | | --------------- | | | | | | | | | | | | A B A | | A B A | | | | | a b a b a A --> A | | ------------------------ ------------------------ | | | | | | A B A A B A | | | | | | | | --------------- | | --------------- | | | | | | | | | | | | A B A | | A B A | | | | | | | | | | a b a b a b a b a

Figure 2.4 Top-down Parse Derivation for "ababa"

The preorder parse tree traversal of top-down parsing lends itself well to a recursive implementation. The recursion can be embedded in the programming language, or the recursion can be explicit. When the recursive feature of a programming language is used, the method is referred to as recursive descent parsing. This implementation of top-down parsing usually defines a procedure in the programming language for each nonterminal in the grammar. These procedures call each other as specified by the production hierarchy of the grammar. The parse begins with a call to the start symbol procedure, and continues until all of the input string has been processed. Programming language recursion comes into play when a procedure calls itself. This will happen any time the left hand side nonterminal of a production also appears on the right hand side of the production. Indirect recursion occurs when a nonterminal can derive itself through a series of productions in some derivation in the grammar.

Top-down parsing can also be implemented using an explicit stack to keep track of the recursion. In this case, the method is referred to as table-driven top-down parsing. Here a parse table guides the actions of the parsing routine. The parse begins by pushing the start symbol of the grammar onto the parse stack. Then the stack symbols are popped one at a time until the stack is empty, or the input string has been exhausted. As each symbol is popped from the stack, its type is determined. If the symbol is a nonterminal, the parse table is consulted to see what production should be pushed back onto the stack. If the symbol is a terminal, it is matched against the input, and the parse continues. Errors are detected when the stack symbol does not match the input, or when the parse table does not contain a valid production to be used in the current parsing context.

The theory of bottom-up parsing is based on the LR(k) class of grammars. These were first defined by Knuth (1965). Improvements to the LR parsing techniques were made by DeRemer (1969), Korenjak (1969), and Aho and Ullman (1971). The theory of LR(k) parsing was further developed by Aho and Ullman (1972, 1973). Much of the terminology of LR(k) parsing that is used today can be attributed to them. Extensive treatments of the theory of LR(k) parsing can also be found in Aho, Sethi, and Ullman (1985) and in Sippu and Soisalon-Soininen (1990).

In the term LR(k), the "L" means that the input is scanned from left to right. The "R" means that a rightmost parse is produced. The term bottom-up comes from the fact that the parse tree is constructed from the bottom to the top. That is, the leaves of the tree are recognized first, then their direct parents are recognized, and so on up the hierarchy of the parse tree. Another way of looking at bottom-up parsing is to say that it follows a postorder traversal of the parse tree. In many cases, the same string can be analyzed from the top down using the LL methods, and from the bottom up using the LR methods of parsing. For example, the second derivation of string ababa that is given in figure 2.3 would be as follows if considered from the bottom up:

A --> a B --> b A --> a A --> A B A B --> b A --> a A --> A B AThis bottom-up parse is shown graphically in figure 2.5.

A --> B A | | | a b a A | --------------- | | | A B A --> A B A | | | | | | a b a a b a B A --> A B A | | | | | | --------------- | | --------------- | | | | | | | | | | A B A | | A B A | | | | | | | | | b a b a a b a b a A | ------------------------ | | | A B A | | | | | --------------- | | | | | | | A B A | | | | | a b a b a

Figure 2.5 Bottom-up Parse of "ababa"

The technique for accomplishing bottom-up parsing is known as shift-reduce. Shift-reduce parsing uses a finite state machine along with a stack. A state encodes the current symbol being parsed and all of the possible valid parse continuations, given the current parsing context. In shift-reduce parsing, state numbers are shifted onto a stack until the right hand side of a production in the grammar is recognized. Those state numbers are then reduced to a single state that represents the left hand side symbol of the production in the current parsing context. This process continues until the input is exhausted, and the start symbol of the grammar is recognized. The shift-reduce parsing algorithm is shown in figure 2.6.

ShiftReduce() { State state; Token token; Production production; token = Scan(); Push ( StartState ); loop { state = Top(); switch ( Action ( state, token, &production ) ) { case SHIFT: Push ( GoTo ( state, token ) ); token = Scan(); break; case REDUCE: Pop ( production.length ); state = Top(); Push ( GoTo ( state, production.lhs ) ); break; case ACCEPT: return; default: ProcessError(); } } }

Figure 2.6 Shift-reduce Parsing Algorithm

In the shift-reduce algorithm, the GoTo function gives the next state that is to be pushed onto the stack. The Action function tells the parser what action to take given the current state and the next input token. If the action is to shift, the GoTo function is invoked using the current state and the input token as arguments. It returns the next state to be pushed back onto the stack. In this case the input must be scanned to get the next token. If the action is to reduce, the Action function must also specify the production that is being reduced. The parser then pops the states corresponding to that production from the stack. The next state is then obtained from the GoTo function using the new current state that is now at the top of the stack. In the case of a reduce action, the second argument to the GoTo function is the left hand side nonterminal of the production that is being reduced, rather than the input token. The third type of action that can be specified is to accept the input, meaning that a successful parse has been completed.

The LR(k) class of grammars is of particular note because it generates exactly the class of deterministic context-free languages. This class of languages has proven to be the most useful for common computer science applications, such as compilers. This means that the bottom-up parsing technique is the most powerful one that is in general usage. The additional power that is possessed by the bottom-up parsing method stems from its ability to postpone parsing decisions. This is done by accumulating information on the stack until the correct action to take can be determined. In bottom-up parsing, the parse decision is the reduction of a production. This corresponds to the top-down parsing decision of predicting the next production from the alternates of a given nonterminal. The difference is in the direction of parsing. In the top-down method, the left hand side of the production is expanded into the right hand side. In bottom-up parsing, the right hand side of the same production is condensed into its left hand side nonterminal. In essence, the bottom-up method gets to see the entire production before choosing it, whereas the top-down method must predict which production to use given only a much more limited amount of information.

A detailed comparison of the bottom-up and the top-down parsing methodologies is difficult without choosing specific variants of the two techniques. The most widely used variant of top-down parsing has traditionally been the LL(1) method. The LL(1) class of grammar is an easily-parsable subclass of the LL(k) grammars in which the amount of lookahead that is used to make parsing decisions is a single terminal symbol. The most commonly used class of bottom-up grammars is the LALR(1) variant of the LR(k) class of grammars. As with LL(1), only a single token of lookahead is used at a time in LALR(1) parsing. The main advantage of the LALR(1) technique over the more general LR(1) technique is that it requires much less space on average. The space reduction is achieved by coalescing some of the states of the LR(1) parser. The space saving is considered to be well worth the relatively minor loss of expressivity in the resulting parser. Some points of comparison between the LL(1) and the LALR(1) parsing methods as summarized in Fischer and LeBlanc (1991) are:

1. Simplicity 2. Generality 3. Action Symbols 4. Error Repair 5. Table Sizes

The parsing algorithms that are used in both the top-down and the bottom-up techniques are fairly simple, and easy to understand. However, the bottom-up method of analyzing a grammar for the purpose of constructing a parser is much more complicated than that of the top-down methods. In the bottom-up grammar analysis, a large number of state sets must be constructed and manipulated. This process can be automated, but its relationship to the actual parsing process is somewhat obscured in the details of the construction. This can become a problem even when using an automated tool if the grammar needs to be debugged, as is often the case. On the other hand, the top-down analysis of a grammar generally follows the more familiar concept of recursion. This tends to make it easier to trace a derivation in the grammar in order to see where a problem may lie. The top-down grammars have the advantage over the bottom-up grammars in the area of simplicity.

If the most powerful classes of top-down and bottom-up grammars are compared for their expressive power, the results are clear. The LL(k) grammars are well known to be a proper subset of the LR(k) grammars. Correspondingly, the LL(k) languages are a proper subset of the LR(k) languages. This gives a decided advantage to the bottom-up parsing techniques in terms of expressive power. In addition, most programming language definitions come complete with a reference grammar that is in LALR(1) form, or very nearly so. The process of translating this grammar to LL(1) form is not trivial. Producing an LL(1) grammar from an LALR(1) specification also can result in a less readable grammar due to the restrictions placed on LL(1) grammars that do not apply to LALR(1) grammars. The bottom-up grammars have the advantage over the top-down grammars in the area of generality.

The placement of action symbols is much more flexible in top-down grammars than in bottom-up grammars. Action symbols can be placed at any point on the right hand side of a production in a top-down grammar. Bottom-up grammars require that actions be placed only at the end of a production. The reason for this is that an action cannot be taken until a production is reduced in bottom-up parsing. Up until that point, the bottom-up algorithm is considering several different productions in parallel. Since this is the case, it is not possible to implement an intermediate action because the final production that will be chosen is unknown until it is actually recognized, and reduced. Some parser generators circumvent this problem by creating new productions whenever an action is encountered. However, the advantage still belongs to top-down parsers in terms of action placement.

Error detection and repair are areas in which the top-down parsers have a clear advantage over the bottom-up parsers. The predictive nature of top-down parsing allows errors to be detected at the earliest possible time. The generality of the bottom-up method is a hindrance in this respect. The postponement of parsing decisions also can tend to delay the detection of errors. The top-down parse stack can be used as an aid in error repair because it contains information on what is expected to be seen. In contrast, the bottom-up parse stack only contains information pertaining to what has already been encountered in the parsing process. This makes it more difficult to decide how to proceed if an error is detected.

In the worst case, the size of an LALR(1) parser can be exponential in the size of the grammar. This is not the case with LL(1) parsers whose size is always bounded by the square of the size of the grammar. However, for common programming languages, the LALR(1) grammars do not display exponential size blowup. A more useful comparison would be between the size of the LL(1) and the LALR(1) parsers for a typical programming language such as Pascal. Fischer and LeBlanc (1991) provide just such a comparison. It indicates that the size of the bottom-up parser is roughly twice the size of the top-down parser. So the advantage goes to top-down parsing in terms of the space required for the parser.

To sum up, the top-down parsing methods seem to be superior to the bottom-up methods in all respects but one. That is generality. This can however, be an overriding point that makes all the others moot. If a top-down grammar cannot be found for a language, then the other advantages of top-down parsing make little difference. This is not considered to be a significant problem in the area of common programming languages because top-down grammars can usually be found for most programming language constructs that are of interest. It should also be noted that the generality advantage enjoyed by the bottom-up methods is weakened in most parsing applications. This is because of the need for actions in the grammar. As more and more actions are added to a bottom-up grammar, the resulting splitting of productions tends to reduce the ability of the grammar to be able to postpone parsing decisions. In this case, the bottom-up grammar begins to resemble a top-down grammar in expressive power.

In the worst-case scenario of action placement, the LL(k) and the LR(k) grammars are equivalent. This was shown indirectly by Broscol (1974). He showed that if an LR(k) grammar contains a reference to a unique null production on the left edge of every non-null production in the grammar, then that LR(k) grammar must also be LL(k). An LR(k) grammar that contains an action as the first symbol on the right hand side of every production would have this property because of the splitting of productions that is required in order to move the actions to the end of a production. Parr (1993) extended this idea for the case of variable-length lookaheads. He showed that an LL(k) grammar must be stronger than any deterministic grammar that is restricted to k-1 lookahead, as long as arbitrary action placement is taken into account.

Some of the notation and properties associated with the deterministic context-free grammars was introduced. The early development of the LL(k) and the LR(k) classes of grammars was traced in order to show the origins of the parsing techniques that are associated with them. The top-down method that is used to parse the LL(k) grammars and their common variants was illustrated by example. The shift-reduce method that is used to parse the LR(k) grammars was also demonstrated, as an example of a bottom-up parsing algorithm. The terms top-down and bottom-up refer generically to parsing methods that work by constructing a parse tree from the root downwards, or from the leaves upwards, respectively. Finally, the top-down and bottom-up parsing methods were compared for their relative merits and weaknesses.

Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman (1985). *Compilers: Principles, Techniques, and Tools.* Addison-Wesley, Reading, Mass.

Aho, Alfred V., and Jeffrey D. Ullman (1971). "The care and feeding of LR(k) grammars," Proc. of 3rd ACM Conf. on Theory of Computing, pp. 159-170.

Aho, Alfred V., and Jeffrey D. Ullman (1972). *The Theory of Parsing, translation, and compiling. Vol. 1: Parsing.* Prentice-Hall, Englewood Cliffs, N.J.

Aho, Alfred V., and Jeffrey D. Ullman (1973). *The Theory of Parsing, translation, and compiling. Vol. 2: Compiling.* Prentice-Hall, Englewood Cliffs, N.J.

Broscol, B.M. (1974). "Deterministic Translation Grammars," Thesis and TR 3-74, Center for Research in Computer Technology, Harvard University, Cambridge, Mass.

Culik, K. II (1968). "Contribution to deterministic top-down analysis of context-free languages," Kybernetika 4:5, pp. 422-431.

DeRemer, F.L. (1969). "Practical translators for LR(k) languages," Massachusetts Institute of Technology, Cambridge, Mass.

Fischer, Charles N. and Richard J. LeBlanc, Jr. (1991). *Crafting a Compiler with C.* Benjamin/Cummings, Menlo Park, Ca.

Knuth, D.E. (1965). "On the translation of languages from left to right," Information and Control, 8:6, 607-639.

Knuth, D.E. (1967). "Top-down syntax analysis," Lecture notes. International Summer School on Computer Programming, Copenhagen, Denmark.

Korenjak, A.J. (1969). "A practical method for constructing LR(k) processors," Comm. ACM 12:11, pp. 613-623.

Korenjak, A.J. and J.E. Hopcroft (1966). "Simple deterministic languages," IEEE Conf. Record of 7th Annual Symposium on Switching and Automata Therory, pp. 36-46.

Kurki-Suonio, R. (1969). "Note on top-down languages." BIT 9, pp. 225-238.

Lewis, P.M. and D.J. Rosenkrantz (1971). "An ALGOL compiler designed using automata theory," Proc. Polytechnic Institute of Brooklyn Symposium on Computers and Automata.

Lewis, P.M. and R.E. Stearns (1968). "Syntax-Directed Transduction," JACM 15 (3), pp 465-468.

Parr, T.J. (1993). "Obtaining practical variants of LL(k) and LR(k) for k>1 by splitting the atomic k-tuple," Ph.D. Thesis, Purdue University.

Rosenkrantz, D.J. and R.E. Stearns (1970). "Properties of Deterministic Top-Down Grammars," Inf. and Control, 17 (3), pp 226-256.

Sippu, S. and E. Soisalon-Soininen (1988). *Parsing Theory Volume 1: Languages and parsing.* Springer-Verlag, New York, N.Y.

Sippu, S. and E. Soisalon-Soininen (1990). *Parsing Theory Volume 2: LR(k) and LL(k) parsing.* Springer-Verlag, New York, N.Y.

Wood, D. (1969). "The theory of left factored languages," Computer J. 12:4, pp. 349-356, and 13:1, pp. 55-62.

Wood, D. (1970). "Bibliography 23: Formal language theory and automata theory." Computing Reviews 11:7, pp. 417-430.

Home

Copyright © 2001-2009 SLK Systems