The LL(k) languages comprise the largest class of context-free languages that can be parsed deterministically from the top down. The LL(k) grammars generate the LL(k) languages, so the LL(k) class of grammars is of at least theoretical interest in any consideration of top-down parsing methods. In practice, LL(k) parsing techniques for values of k that are greater than one have generally not been used. This is because both the size of the parser and the complexity of the grammar analysis grow exponentially with k. Even in the LL(1) parsing method, the size of the tables, and so the size of the parser is a matter of some concern. That is why much attention has been focused on table compaction methods for LL(1), as well as for other parsing methods. Given that the LL(1) parse table is considered uncomfortably large, it is not surprising that the size of the parse tables for LL(k) grammars for values of k that are greater than one has been considered prohibitively large.
Even so, there has been a small body of work devoted to the parsing of LL(k) grammars. Aho and Ullman (1972) give a completely general method of analyzing and parsing LL(k) grammars. The price paid for this generality is impracticality in many cases. The computational complexity of the method is too great for most grammars of useful size. Yoshida and Takeuchi (1992) present a more practical method for parsing a subset of the LL(k) grammars relatively efficiently. Most current parser generators use an approximate lookahead method that has linear complexity, but very weak recognition capability.
The LL(k) methods produce a leftmost parse by examining the input string from left to right. Parsing decisions are made based on the two criteria that follow:
1. The already-parsed prefix of a sentence in the language 2. The next k input symbols
That is, the production that is used to expand the current nonterminal that is being parsed is chosen based on a knowledge of the string of terminal symbols that has been produced thus far by the parse, in conjunction with a knowledge of the next k input tokens to be encountered in the input string. This can be represented symbolically as in the following definition:
Definition: A grammar G = ( N, T, P, S ) is said to be LL(k) for some fixed natural number k when for any two leftmost derivations in the grammar as in the following: 1. S ==>lm* x A delta1 ==>lm x omega1 delta1 ==>lm* x y1 2. S ==>lm* x A delta2 ==>lm x omega2 delta2 ==>lm* x y2 if FIRSTk ( y1 ) = FIRSTk ( y2 ) it implies that omega1 = omega2
In the general case, the LL(k) grammars are quite difficult to parse directly. This is due to the fact that the left context of the parse must be remembered somehow. As can be seen from the definition, each parsing decision is based both on what is to come as well as on what has already been seen of the input. How then is it that the LL(1) grammars are so easily parsed? The answer is that the class of LL(1) grammars is strong. The strong LL(k) grammars are a subset of the LL(k) grammars that can be parsed without knowledge of the left-context of the parse. That is, each parsing decision is based only on the next k tokens of the input for the current nonterminal that is being expanded. A definition of the strong LL(k) grammars follows.
Definition: A grammar G = ( N, T, P, S ) is said to be strong LL(k) for some fixed natural number k if for all nonterminals A, and for any two distinct A-productions in the grammar A --> alpha A --> beta FIRSTk ( alpha FOLLOWk (A) ) intersection FIRSTk ( beta FOLLOWk (A) ) = Ø
As can be seen from the definition, the strong LL(k) grammars are defined in terms of FIRSTk and FOLLOWk sets only. These sets consist entirely of parts of a derivation that are to come, not on any part of a derivation that has already been completed. Definitions of the FIRSTk and FOLLOWk sets are as follows.
Definition: The FIRSTk set of a string of symbols in a grammar is a set of k-length strings of terminal symbols that may begin a sentential form derivable from the string of symbols in the grammar. More specifically, for a grammar G = ( N, T, P, S ) FIRSTk ( alpha ) = { w | alpha ==>* wbeta } union { epsilon if alpha ==>* epsilon } where alpha in ( N union T )+ w in T+ beta in ( N union T )* For the case of the empty string: FIRSTk ( epsilon ) = { epsilon } Note that the length of w may be less than or equal to k, in which case beta = epsilon.
Definition: The FOLLOWk set of a string of symbols in a grammar is a set of k-length terminal symbol strings in the grammar that may follow the string of symbols in some sentential form derivable in the grammar. More specifically, for a grammar G = ( N, T, P, S ): FOLLOWk ( beta ) = { w | S ==>* alpha beta gamma and w in FIRSTk ( gamma ) } union { epsilon if S ==>* alpha beta } where alpha, gamma in ( N union T )* beta in ( N union T )+ w in T+
Why is it that the LL(1) grammars have the strong property while LL(k) grammars for values of k that are greater than one may or may not be strong? Briefly, the problem lies in the nature of the FOLLOWk sets. A string is included in a FOLLOWk set if it can follow another string in some context. With a lookahead string of more than one symbol, the same lookahead string may occur in more than one context when a null production is applied in one case and not in the other. The string may be a valid FOLLOWk string in one context, but not in another. This problem cannot occur with FOLLOW strings that are of length one, so that all LL(1) grammars are strong LL(1). This makes the LL(1) class of grammars particularly easy to parse. Consider the following grammar as an example:
A --> aBaa Grammar 4.1 A --> bBba B --> b B -->
Grammar 4.1 is LL(2), but not strong LL(2). It can be seen by inspection of the grammar that:
FOLLOW2 ( B ) = { aa, ba }
So if the grammar is strong LL(k), then the lookahead "ba" should uniquely predict one or the other of the two B-productions in the grammar because nonterminal B is nullable. In fact, the lookahead "ba" predicts both of the last two productions in the grammar. If production 3 is used in production 1 to expand nonterminal B, the lookahead is "ba". If production 4 is used in production 2 to expand nonterminal B, again the lookahead is "ba". The problem is that in the first case, the lookahead string is constructed from a concatenation of terminal b from production 3 and terminal "a" from production 1. Thus a lookahead string that does not apply to production 1 is constructed. That is, the lookahead string "ba" cannot follow nonterminal B in production 1. The only lookahead that can follow B in production 1 is string "aa". This impreciseness in the FOLLOW2 set is what causes Grammar 4.1 not to be strong LL(2).
The solution to this problem is to use context information to know which production to use. If production 1 is being parsed and the lookahead is "ba", then use production 3 to expand B. If the parsing context is production 2 and the lookahead is "ba", then production 4 must be used to expand nonterminal B. The definition of LL(k) grammars clearly handles this type of construction by using the information available from the left-context of the parse. In this case, only the single terminal symbol that precedes nonterminal B in the first two productions is needed to determine the correct prediction. In the general case, the entire prefix of the parse may be required in order to know which production to use.
The standard LL(1) method of parsing can be applied to any strong LL(k) grammar, with only slight modification. This is because the strong LL(k) languages for values of k that are greater than one are also predictive in the same way that the LL(1) languages are predictive, without regard to the left-context of the parse. This means that given a lookahead string of length k, the choice of the next production to be used in the parse can be always be determined uniquely. So, strong LL(k) grammars for any value of k can be analyzed and parsed in the same way, except for the fact that the length of the lookahead string is equal to the k-value of the grammar. In the strong LL(k) method, FIRSTk and FOLLOWk sets are used to construct a parse table of the form:
N x { w | w in T* and |w| = k }
The rows of the parse table are the same as in the LL(1) construction. However, the columns of the parse table consist of all k-length strings of terminal symbols that are k-length permutations of the terminal set. The method used to construct the strong LL(k) parse table is identical to the method that is used to construct the LL(1) parse table, except that the FIRSTk, FOLLOWk and PREDICTk sets are used instead of the FIRST, FOLLOW and PREDICT sets. The PREDICTk set is defined in terms of the FIRSTk and FOLLOWk sets in a manner very similar to the way that the PREDICT set is defined in terms of the FIRST and the FOLLOW sets. At this point, it should be apparent that the FIRST, FOLLOW and PREDICT sets are just the FIRSTk, FOLLOWk and PREDICTk sets for the case where k = 1. The definition of the PREDICTk set follows.
Definition: The PREDICTk set for a production in a grammar is a set of terminal symbols as follows. PREDICTk ( A --> alpha ) = if epsilon in FIRSTk ( alpha ) ( FIRSTk ( alpha ) - epsilon ) union FOLLOWk ( A ) otherwise FIRSTk ( alpha )
The LL(1) parsing algorithm only needs to be modified slightly to make it applicable to strong LL(k) parsing. The lookahead needs to be changed so that it is a k-length string of tokens, rather than a single terminal symbol. Also, the next production to be used to expand the current nonterminal symbol is obtained from a strong LL(k) parse table. The table look up parameters are the nonterminal symbol and the current k-length lookahead string. Even though a k-length lookahead string is used to make the parsing decisions, the input string is still scanned a only single token at a time as in the LL(1) parsing method. The strong LL(k) parsing algorithm is shown in figure 4.1.
Get the first k input tokens as a lookahead string Push the start symbol of the grammar onto the stack While the stack is not empty and input remains Pop a grammar symbol from the stack. If the symbol is a terminal Match it against the first lookahead token Get the next input token and append it to the lookahead string Advance the beginning of the lookahead string by one token Else if the symbol is a nonterminal Get the next production number from the parse table Push that production onto the stack Else if the symbol is an action Perform the action End while
Figure 4.1 The strong LL(k) Parsing Algorithm
The complexity of the strong LL(k) parsing algorithm is within a constant factor of the complexity of the LL(1) parsing algorithm. This factor depends on k, but since k is not an input to the parsing routine itself, k can be considered to be a constant in this case. The additional complexity is due to the fact that the lookahead is a string, rather than just a single terminal symbol. The lookahead string must be analyzed somehow in order to convert it into an index into the columns of the parse table. The complexity of this analysis is proportional to the length of the lookahead string, k. An efficient numbering all of the k-length permutations of the terminal set is not difficult to imagine. Similarly, given a k-length input string, it should not be too difficult to map that string to the proper index by analyzing its component terminal symbols one by one. However, this is still an amount of work that must be done in addition to the work that is done by the LL(1) parsing routine, so the complexity of strong LL(k) parsing is somewhat greater than that of LL(1) parsing.
The strong LL(k) parsing algorithm examines k symbols at a time of the input string. Since the input pointer is advanced only one symbol at a time, this means that the input string is examined approximately k times. However, since k is a constant in this case, the complexity of the strong LL(k) parsing algorithm is still linear in the length of the input as follows.
W (n) = O (n)
The complexity of the analysis of a strong LL(k) grammar for values of k that are greater than one is significantly greater than the complexity of the analysis of LL(1) grammars. The number of possible k-length lookahead strings is exponential in k as follows.
|T|k
This means that the computation of the FIRSTk and the FOLLOWk sets will also be exponential in k. Since k is one of the inputs to the problem of analyzing an strong LL(k) grammar, it follows that strong LL(k) analysis is exponential in the size of the input. This can also be seen by inspection of the size of the strong LL(k) parse table, which is
|N| · |T|k
The problem with objects that are of exponential size is that their size grows very rapidly as the exponent increases. This is especially true in the case of the strong LL(k) parse table. For many typical applications, the size of the parse table that is required for strong LL(k) is essentially unmanageable for any k greater than one. This is because the number of columns in the parse table is the number of terminals in the grammar raised to the power k. For the case of a typical programming language having about 80 terminal symbols, the parse table size increases by a factor of about 80 for the strong LL(2) parse table as compared to the LL(1) parse table. Even with compaction methods applied, this size is generally considered too large to be practical for most compilers.
Aho and Ullman (1972) present an algorithm for constructing parse tables for general LL(2) grammars. This algorithm is mostly of theoretical interest, since it could not be feasibly implemented for a grammar of any practical size. Yoshida and Takeuchi (1992) have shown that the space requirements for the algorithm are too great to make it useful for a typical programming language like Pascal. The method is actually applicable to all LL(k) grammars, but in order to keep things simple, only the case of k = 2 will be considered here.
The approach taken by Aho and Ullman is to transform the original LL(2) grammar into a strong LL(2) grammar that covers the original grammar. The resulting strong LL(2) grammar can then be analyzed and parsed using the strong LL(k) method. So, the Aho and Ullman method consists of two phases. The first phase is to transform each nonterminal symbol of the grammar into a set of new nonterminal symbols, each of which is associated with a different possible follow string of length two, where necessary. This effectively eliminates the ambiguity that can be caused by the FOLLOW2 sets of the LL(2) grammars. Once the new grammar has been constructed, it is possible to then construct a strong LL(2) parse table from the resulting unique PREDICT2 sets, and this is phase two of the method. The strong LL(2) parse table consists of the following:
{ new nonterminals } x { all terminal strings of length 2 }
The grammar that results from the conversion is strong LL(2) so that the strong LL(2) parsing method applies. That is, given a new nonterminal symbol and a lookahead string of length two, a unique transition can be obtained from the parse table that is constructed. Consider again Grammar 4.1. This grammar is not strong LL(2) because the FOLLOW2 sets for nonterminal B are different for the two occurrences of B in the first two productions of the grammar. The solution is to create two new nonterminals from the original nonterminal. The new nonterminals must have duplicate productions so that any action that could have been taken in parsing B can also be taken in parsing the new nonterminals. Grammar 4.2 that follows is Grammar 4.1 so transformed:
A --> a <Baa> a a Grammar 4.2 A --> b <Bba> b a <Baa> --> b <Baa> --> <Bba> --> b <Bba> -->
The FIRST2, FOLLOW2, and PREDICT2 sets for the nonterminals of Grammar 4.2 are given in figure 4.2.
FIRST2 ( A ) = { aa, ab, bb } FIRST2 ( <Baa> ) = { epsilon } FIRST2 ( <Bba> ) = { epsilon } FOLLOW2 ( <Baa> ) = { aa } FOLLOW2 ( <Bba> ) = { ba } PREDICT2 ( A --> a<Baa>aa ) = { aa, ab } PREDICT2 ( A --> b<Bba>ba ) = { bb } PREDICT2 ( <Baa> --> b ) = { ba } PREDICT2 ( <Baa> --> ) = { aa } PREDICT2 ( <Bba> --> b ) = { bb } PREDICT2 ( <Bba> --> ) = { ba }
Figure 4.2 The Grammar 4.2 Sets
As the figure shows, the offending lookahead "ba" has been divided between the two new B-nonterminals such that the resulting PREDICT2 sets are unique for each of the two new nonterminals. The lookahead "ba" predicts that nonterminal <Baa> expands to b, and it predicts that nonterminal <Bba> expands to null, as it should. The parse table for Grammar 4.2 is given in figure 4.3.
Lookahead2 | | aa ab ba bb ----------------------------------------------- | A | 1 1 1 Nonterminal | <Baa> | 4 3 | <Bba> | 6 5 | |
Figure 4.3 Parse Table For Grammar 4.2
The Aho and Ullman method for converting an arbitrary LL(k) grammar to a strong LL(k) grammar is useful in cases where either the grammar is small, or in cases where only a few of the nonterminals of the grammar have the non-strong property. In the latter case, a simple transformation of part of the grammar may result in a new grammar that can be parsed efficiently from the top down using the strong LL(k) method. Another benefit of the method is that it demonstrates an interesting property of the LL(k) grammars. That is that there exists a covering strong LL(k) grammar for any non-strong LL(k) grammar.
Yoshida and Takeuchi (1992) present a relatively efficient algorithm for constructing parse tables for semi-LL(2) grammars. A semi-LL(2) grammar is a restricted form of LL(2) that lies somewhere between strong LL(2) and general LL(2). As with the Aho and Ullman method, this algorithm requires that the grammar be modified. The grammar must first be put into semi-LL(2) form in order for the method to be applied. The authors feel that most LL(2) grammars can be converted to semi-LL(2) grammars. However, not all LL(2) grammars can be so modified since the class of semi-LL(2) grammars is a proper subset of the class of LL(2) grammars.
The method uses FIRST2 and FOLLOW2 sets to compute the PREDICT2 function as is done in the standard strong LL(k) method. However, in place of the usual FOLLOW2 set are two follow sets called END-FOLLOW and PF (partial-FOLLOW). The END-FOLLOW set is almost the same as the standard FOLLOW set except that the symbol being followed must occur at the rightmost position of a production for the follows of the left hand side nonterminal to be included in the set. The position is also considered rightmost if what follows it is nullable.
The partial-FOLLOW set is somewhat more complex. Given any two adjacent symbols AX in a valid derivation in the language such that the first symbol is a nonterminal, the PF (A,X) is the union of all cases of the FIRST2 (X concatenated with what may follow it). The definition of a semi-LL(k) grammar is quite similar to that of an LL(k) grammar except that the PF set is used instead of the FOLLOW2 set. An outline of the parse table construction method is given in figure 4.4.
1. Initialize the FIRST-table. 2. Compute the closure of the FIRST-table. 3. Initialize the PF-table. 4. Initialize the END-FOLLOW-table. 5. Compute the closure of the END-FOLLOW-table. 6. Complete the PF-table using the END-FOLLOW-table. 7. Compute several special grammar cases. 8. Construct the parse table.
Figure 4.4 Semi-LL(2) Parse Table Construction
Of particular interest are the empirical results that were found. Comparisons between this method and the Aho and Ullman method using the Pascal programming language were made. Limiting the class of grammars from general LL(2) to semi-LL(2) made it possible to achieve significant performance improvements over the more general Aho and Ullman method. Table construction time was reduced by approximately ten fold. The space required for the parse tables and the production rules was reduced by from 120 to 400 times. No comparison of the performance of the resultant parsers was given. However, the parsing algorithms for the two methods are quite similar, so it is expected that the speed performance of the parsers would not be much different except as it may be affected by the very large difference in table size. That is, the Aho and Ullman method resulted in parse tables that were so large that they probably would not fit into the real memory of most computers. This may result in frequent page faults that would slow down the parsing algorithm.
A new class of grammar called SLL¹(k) was introduced as a linear approximation to the strong LL(k) grammars. This new class of grammar removes the exponential complexity from the strong LL(k) grammars by ignoring the sequence information that is present in the lookahead strings. The effect is to convert a set of strings into a string of sets. This is graphically illustrated in figure 4.5.
strong LL(k) set of lookahead strings: { a b c d e, a b d c f, a b c d g, a b c d h, a b c d i, a b c d j } SLL¹(k) string of sets lookahead: { a } { b } { c, d } { c, d } { e, f, g, h, i, j }
Figure 4.5 The SLL¹(k) method lookahead strings (ignore the spaces)
In the SLL¹(k) method, the lookahead tokens at each level are compared against each other to see if the parsing decision can be made uniquely based on the fact that the lookahead sets at some level are disjoint for the two productions. If so, the construct is said to be SLL¹(k) for the lowest k level that has disjoint lookahead sets. The weakness of the method is that all sequence information in the lookahead strings is lost. The result is that new and invalid strong LL(k) lookahead strings are introduced. This can be most easily seen by taking the cross product of the string of sets in figure 4.5. Among the many new strings that were not in the original set of strings would be string "abcce", for instance. The SLL¹(k) method would recognize this as a valid lookahead string, even though it would not be valid for a strong LL(k) grammar. If this does not cause a conflict in the grammar being analyzed, then that grammar is SLL¹(k). The SLL¹(k) grammars are a proper subset of the strong LL(k) grammars.
The methods that are available to parse LL(k) grammars for values of k that are greater than one were considered. The difference between the strong LL(k) and the non-strong LL(k) grammars was demonstrated. The strong LL(k) grammars are much more easily parsed than the more general LL(k) grammars because in the case of strong LL(k) grammars, the left context of the parse does not need to be known. The method that is used to parse the strong LL(k) grammars is quite similar to the LL(1) parsing methodology, except that the complexity is much greater. Three specific methods for LL(k) parsing were summarized. The Aho and Ullman method works by converting a grammar to strong LL(k), and then using the strong LL(k) parsing method. The Yoshida and Takeuchi method works on the semi-LL(2) grammars that fall between the strong LL(2) and the LL(2) classes of grammar. The linear approximation to strong LL(k) works on a subset of the strong LL(k) grammars by using approximate lookahead to make parsing decisions.
Aho, Alfred V., and Jeffrey D. Ullman (1972). The Theory of Parsing, translation, and compiling. Vol. 1: Parsing. Prentice-Hall, Englewood Cliffs, N.J.
Yoshida K. and Y. Takeuchi (1992). "An algorithm for constructing a semi-LL(2) grammar's parsing table," J. Inf. Process. (Japan) Vol. 15, No. 1, pp. 93-107.