A parser generator is a tool that automates the process of creating parse tables from a grammar. It may also generate parsing, or even compiling code from the tables. In top-down parsing, this can be in the form of parse tables and a driver, or in the form of a recursive-descent program. Bottom-up parsers must be table-driven since there is no bottom-up correlary to programming recursion. An advantage to using a parser generator is that it verifies that the grammar is suitable to the parsing method being used. Programmers rarely take the time to fully analyze the grammar when creating a recursive-descent parser. This can lead to subtle errors that only appear on certain inputs to the program.
SLK is a top-down parser generator. As such, it is very different from the YACC parser generator, which is bottom-up. SLK is also designed to be much easier to use than other parser generators. So much so that it can be confusing to those familiar with other tools. For example, one might wonder how to create tokens, the first thing usually done in most parser generators. In YACC, for every token a line like
%token DIVIDEmust be written. In SLK, you just write the grammar. The tokens are extracted and named automatically by SLK. So for the token DIVIDE just put the symbol (probably /) in the grammar. Note that quotation marks are not needed as in some other tools. This makes the grammar cleaner and easier to follow.
The first thing to do is to obtain a reference grammar for the language you are implementing. An internet search will usually give several grammars. The grammar from a standards committee is usually the best one to use. It will be in a generic BNF or EBNF format that is suitable for use by SLK. Minor editing changes will probably be needed to get in into exactly the SLK syntax described below.
Next, run the grammar through SLK and capture the message output to a file. The "equivalent to" messages can usually be ignored, but everything else must be corrected, usually in the order given. The exception is that left factoring and left recursion elimination should be done last where possible. See the FAQ below for a specific example of this. Clearly, things like "duplicate production" should be handled first. These are errors in the grammar, not things that are required to put the grammar into LL(k) format. Questions about context free grammar formatting should be referred to a compiler or parsing text book.
Once the grammar is correct, the scanner should be developed. This can be done by hand, or by using any of the available scanner generator tools. A scanner is a simple automata that assembles characters in the input into tokens that are used by the parser, which itself is a more complicated pushdown automata. It is usually a good idea to develop a symbol table manager in conjunction with the scanner, since it can make the scanner much simpler by differentiating keywords from identifiers.
At this point, you have what is referred to as a recognizer or parser. It should be able to read language input files and determine if they are syntactically correct. The error reporting can be as simple or as detailed as the application warrents. The SLK parser will call your error routine appropriately, giving information that can be used to customize the error message. But to start it can be as simple as the dreaded "syntax error."
Converting the recognizer to something more useful, like a compiler or translator is accomplished by adding action code to the program. This is the hard part. Producing the parser is trivial compared to the work required to transform it into even a simple translator. But a large part of the reason for this is that the parser generator does a lot of the work for you. By contrast, a hand-coded recursive descent parser is quite a lot of work to produce, and even more so to maintain.
The action code consists of lots of usually small routines that are called in callback fashion by the parser at the points in the grammar that you specify. Two general strategies for action code are prevalent. One is to have it just create an abstract syntax tree (AST) that is then processed again after the input file is completely parsed. The other strategy is to emit target code or an intermediate form directly. The best choice depends on the application at hand as well as personal preference.
All tokens must be separated by white space. Production alternates are on separate lines. Blank lines separate productions instead of the semi-colon. Tokens are assumed to be terminals unless they are seen as the left-hand-side of a production. This is handy during development, but it means that the token declarations must be checked to be sure that all nonterminals were declared. A nonterminal can be converted to a terminal by commenting out its productions. This makes it easier to trace conflicts in the grammar.
The left-hand-side separator ( : ::= --> := -> = :? :! ) is the only operator. However, the algorithm can usually detect the left-hand-side separator from its position, so these operators are not reserved. The :! separator directs SLK to not analyze the conflict. The first production alternate will be used as the parse table entry. This is useful in cases like the dangling-else ambiguity where the resolution is to use the first production.
The :? separator indicates a conditional nonterminal. It means that if the input is syntactically well-formed for the first production, then use it. Otherwise, try each of the other productions in turn and use the first one that matches. This is the SLK method of backtracking. Actions are not executed during this process. Note that in the C++ grammar, statements should be separated as follows to avoid unnecessary backtracking.
statement: ambiguous-statements unambiguous-statements ambiguous-statements :? declaration-statement expression-statement unambiguous-statements: labeled-statement compound-statement selection-statement iteration-statement jump-statement try-block
Single or double quotes are never needed. The backslash can be used as an escape to continue a line or to specify an ascii character in case the algorithm does not work. The usual C escapes are recognized ( \\ \n \b \t ... ). The C comment styles are used. A line that contains tokens cannot begin with a comment. Since white space is used as a separator, almost any combination of characters can be used as a name. For example, ;_opt and ,_id_* are valid symbols. The following names are reserved.
A limited EBNF grammar notation is also supported as follows.
An action that is specified just after the end brace/bracket is only executed when the EBNF construct goes to null. An action that is specified just before the end brace/bracket is only executed when the EBNF construct does not go to null. Of course this also applies to all other actions within the brace/brackets. The following production is an example of a case where this is useful.
list -> _action_initialize { item _action_item } _action_finalize
Care needs to be taken to avoid unexpected results. For example in
list -> { item } _action_always_do
the action is executed exactly once whether or not there is an item because the braces must eventually go to null either way. However, in
list -> [ item ] _action_always_do
the action is only executed if no item is present. The following three methods can be used to make this action always fire exactly one time:
list -> [ item _action_always_do ] _action_always_do list -> [ item ] _action_dummy_nop _action_always_do list -> [ item ] always_do always_do -> _action_always_do
So, for brackets, if the same action is desired for either case, specify the same action twice, both before and after the bracket. Or insert an action that does nothing after the bracket. Or make a nonterminal that just executes the action.
SLK will also accept grammars in a yacc-like syntactic format. If the grammar is not LL(k), then the SLK grammar transformation options can be used to convert it. The main syntactic differences from yacc are
Existing yacc grammars can be syntactically converted to SLK format by using the -ys option with the -f=filename option. The resulting grammar will probably need further processing to get it to LL(k) format. The SLK left recursion and left factoring options can be used to aid in doing this. See the FAQ below for an example of how to convert an LR grammar to LL form. The following sample grammar illustrates the general layout for an SLK grammar. Note that the actions are just names in the grammar, not code fragments as in YACC. For example, _action_Add in the grammar causes the parser to call an action routine named Add() in the action code. SLK handles the details of making this linkage. You just need to write the action routine and include it in the SlkAction class.
expression: additive_expression _action_Finish additive_expression: multiplicitive_expression { add_op _action_Push multiplicitive_expression \ _action_Add } add_op: + - multiplicitive_expression: exponential_expression { mul_op _action_Push exponential_expression \ _action_Multiply } mul_op: * / exponential_expression: primary_expression [ ^ exponential_expression _action_Exponent ] primary_expression: ( additive_expression ) - primary_expression _action_Negate NUMBER _action_Push
The SLK syntax is designed to look as much like a plain text reference grammar as possible. In many cases, some simple text editing can produce an SLK grammar from a reference grammar. However, there are many variations on the basic syntax of published grammars. The -En SLK option enables the automatic conversion of some of the more popular reference formats, as outlined in the succeeding sections. If a manual conversion is needed, some basic rules are outlined in the FAQ below.
The conversion is done in two stages as follows:
slk ieee_grammar -E1 -f=out_grammar slk out_grammar -f=slk_grammar
The second stage generates OR productions and puts the grammar in a cleaner format. The input grammar syntax is as follows:
SymbolMeanings ::= The definition operator. This is used in a production rule to or separate the element defined by the rule from its definition. := The element being defined appears to the left of the operator and the formula that defines the element appears to the right. [ ] Square brackets indicate optional elements in a formula. The portion of the formula within the brackets may be explicitly specified or may be omitted. { } Braces indicate repeating elements in a formula. This is the zero-or-more operator as in the SLK syntax. | The alternative operator. The vertical bar indicates that the portion of the formula following the bar is an alterna- tive to the portion preceding the bar. If the vertical bar appears at a position where it is not enclosed in braces or square brackets, it specifies a complete alternative for the element defined by the production rule. If the vertical bar appears in a portion of a formula enclosed in braces or square brackets, it specifies alternatives for the contents of the innermost pair of such braces or brackets.
The conversion is done in two stages as follows:
slk ieee_grammar -E2 -f=out_grammar slk out_grammar -f=slk_grammar
The second stage generates OR productions and puts the grammar in a cleaner format. The input grammar syntax is as follows:
SymbolMeanings < > Angle brackets delimit character strings that are the names of syntactic elements, the non-terminal symbols of the language. Spaces are allowed. ::= The definition operator. This is used in a production rule to or separate the element defined by the rule from its definition. := The element being defined appears to the left of the operator and the formula that defines the element appears to the right. [ ] Square brackets indicate optional elements in a formula. The portion of the formula within the brackets may be explicitly specified or may be omitted. { } Braces group elements in a formula. The portion of the for- mula within the braces shall be explicitly specified. | The alternative operator. The vertical bar indicates that the portion of the formula following the bar is an alterna- tive to the portion preceding the bar. If the vertical bar appears at a position where it is not enclosed in braces or square brackets, it specifies a complete alternative for the element defined by the production rule. If the vertical bar appears in a portion of a formula enclosed in braces or square brackets, it specifies alternatives for the contents of the innermost pair of such braces or brackets. ... The ellipsis indicates that the element to which it applies in a formula may be repeated any number of times. If the el- lipsis appears immediately after a closing brace "}", then it applies to the portion of the formula enclosed between that closing brace and the corresponding opening brace "{". If an ellipsis appears after any other element, then it applies only to that element. This is the one-or-more operator that corresponds to the SLK {}+ usage.
As noted above, existing yacc grammars can be syntactically converted to SLK format by using the -ys option with the -f=filename option. The resulting grammar will probably need further processing to get it to LL(k) format because the LL(k) requirements are different from those of the LALR(1) grammars used by yacc. The SLK left recursion and left factoring options can be used to aid in doing this.
The following classes are referenced by the parser. They must be supplied by the user. Note that the name "Slk" can be changed by a command line option to the parser generator. The change occurs in all instances, so for example SlkToken could be My_Token. This enables the use of more than one grammar/parser in a single program. Of course, any number of parsers can be instantiated and run concurrently.
Error recovery is tunable according to the returned token from mismatch() or no_entry().
Changing the input token can cause the parser to enter an infinite loop of calling the error routine if the change does not repair the original error, and the error routine continues to instruct the parser to retry. The mismatch and no_entry methods need to guard against this by keeping a counter or other means to detect that the parse is stuck. Simply advancing the input will break the loop.
void Parse ( SlkAction action, SlkToken tokens, SlkError error, slk_size_t start_symbol ) slk_size_t GetProduction ( slk_size_t conflict_number, SlkToken tokens ) int GetSymbolType ( slk_size_t symbol ) slk_size_t *SlkGetProductionArray ( slk_size_t production_number ) slk_size_t *SlkGetState ( slk_size_t state_number ) boolean IsNonterminal ( slk_size_t symbol ) boolean IsTerminal ( slk_size_t symbol ) boolean IsAction ( slk_size_t symbol )
string GetSymbolName ( short symbol ) string GetProductionName ( short production_number )
A token file consists of token definitions on separate lines. Each token string is the token exactly as it appears in the grammar. The token value is a number. Selected parts of a token file follow.
auto 257 typedef 261 TYPEDEF_NAME_ 271 += 294 END_OF_SLK_INPUT_ 318
The code that is generated converts the token strings to a form that is a valid identifier. Each one ends in an underscore to try to avoid name collisions. The special token END_OF_SLK_INPUT_ is returned by a scanner at the end of the input token stream. It does not appear in a grammar. The following definitions correspond to the tokens given above.
public const short AUTO_ = 257; public const short TYPEDEF_ = 261; public const short TYPEDEF_NAME_ = 271; public const short PLUS_EQUAL_ = 294; public const short END_OF_SLK_INPUT_ = 318;
Using SLK with a C/C++ generated scanner is fairly simple. Just include the header file SlkParse.h in the scanner specification, and then code the scanner actions to return the values defined when the token is recognized. So, for example, the action would be "return PLUS_EQUAL_" when "+=" is seen by the scanner. At EOF, return END_OF_SLK_INPUT_. Java and C# work similarly, except that the token values are found in the SlkConstants class.
void SlkParse ( SlkAction &action, SlkToken &tokens, SlkError &error, short start_symbol ) // zero for default symbol short SlkGetProduction ( short conflict_number, SlkToken Slktokens ) " int SlkGetSymbolType ( short symbol ) int SlkIsNonterminal ( short symbol ) int SlkIsTerminal ( short symbol ) int SlkIsAction ( short symbol )
Another difference from Java is that header files are used in C++ to provide the public declarations. Also, SLK is able to fully define an action call table in C++. Since SlkAction is a class, it is necessary to initialize the action table in the constructor. This is done by making a call to the supplied method initialize_table. Note that SlkParse itself is not a class. It is just a reentrant subroutine that drives the parsing process. A file summary for an SLK C++ parser follows.
The C interface is almost the same as the C++ interface. The main difference is that the arguments to SlkParse are C structures rather than C++ classes. A file summary for an SLK C parser follows.
These methods produce the same parse tree. The difference is in the order of construction. In top-down parsing the parse tree is constructed from the top down. In bottom-up parsing the parse tree is constructed from the bottom up. The bottom-up method has more powerful recognition capability because it delays parsing decisions as long as possible. However, since the semantic actions must also be delayed in the bottom-up method, translation/compiling is generally easier with top-down parsing.
Parsing is recursive by nature. Nested constructs like ((x(y))) usually need to be evaluated from the inside out. Recursion handles this process easily by saving partially evaluated constructs on a stack until the innermost parts have been evaluated. If the stack is explicit, the parsing method is referred to as table-driven. If programming language recursion is used, the parsing method is called recursive-descent. In recursive-descent parsing, a subroutine is written for every nonterminal in the grammar. These subroutines call each other in the order specified by the grammar to simulate traversing a parse tree. Most grammars are hierarchical in nature, so descending the grammar from the top down is the parsing procedure used in top-down parsing. Most grammars have repeating constructs, so recursion is necessary. Hence the term recursive-descent parsing.
No. Highly ambiguous languages are difficult to parse using techniques that are primarily deterministic. The GLR parsing method is better in these cases because it was designed specifically to handle ambiguity.
A parser generator is a tool that automates the process of creating parse tables from a grammar. It may also generate parsing, or even compiling code from the tables. In top-down parsing, this can be in the form of parse tables and a driver, or in the form of a recursive-descent program. Bottom-up parsers must be table-driven since there is no bottom-up correlary to programming recursion. An advantage to using a parser generator is that it verifies that the grammar is suitable to the parsing method being used. Programmers rarely take the time to fully analyze the grammar when creating a recursive-descent parser. This can lead to subtle errors that only appear on certain inputs to the program.
Extended BNF adds a zero-or-one grouping operator, [], and a zero-or-more, {}, grouping operator to the BNF productions. This shorthand notation makes a grammar more compact. For example,
S -> a A a A -> b A ->can be expressed in EBNF notation as
S -> a [b] a
The use of EBNF notation also tends to avoid the occurrence of some types of grammar conflicts. The only top-down conflicts that can be addressed by EBNF are those that result from the conflict of a null and a non-null production. These conflicts are avoided by substitution in the grammar. The EBNF construct is in effect a new and unique nonterminal that occurs in only one place in the grammar. This removes the possibility of a conflict, since only one continuation is present in the single production. If the same EBNF construct occurs elsewhere, it is a different nonterminal, so no conflict is possible. However, this does not apply to left-recursion and left-factoring conflicts. These cannot be avoided by using EBNF. (EBNF direct left-factoring can be done if grouping and or are supported, but not indirect left-factoring).
Clearly, the same conflict avoidance can be achieved with a fully substituted BNF grammar, so EBNF is not more powerful than BNF notation, at least for LL(k) grammars. Not all conflicts are the result of a null/non-null construct. In the C language, of the nine deterministic LL(1) conflicts, only two are the result of null/non-null productions, and even these two cannot be resolved by using EBNF. All of the conflicts can be resolved using an LL(2) grammar, suggesting that the LL(k) method is preferable to EBNF in the area of conflict resolution.
An interesting consequence of the substitution property of EBNF is that it can make an LL(k) grammar strong LL(k) under the covers. Consider the classic example of a grammar that is LL(2), but not strong LL(2).S -> a A a S -> b A b a A -> b A ->The problem is that "ba" predicts both of the S productions. The left context before the A is needed to resolve the conflict. This grammar is made strong LL(2) by adding a new, duplicate nonterminal A2 in the following way.
S -> a A a S -> b A2 b a A -> b A -> A2 -> b A2 ->Now the left context is not needed because A and A2 are different nonterminals. The equivalent EBNF grammar is also strong LL(2) in its natural expression, without the need to convert a grammar. The LL(2)/strong LL(2) issue never arises at all.
S -> a [b] a S -> b [b] b a
The advantage of EBNF in the case of this grammar is clear.
LL(k) parsing uses all of the left context of the sentence being parsed. This makes it the most powerful top-down deterministic parsing method. The strong LL(k) parsing method does not use any of the left context of the parse. This is somewhat less powerful, but uses much smaller grammars and parsing tables. As noted above, EBNF SLK grammars are likely to be fully LL(k).
It depends on the language to be parsed. If the language is highly ambiguous, a GLR parser is a good choice because the GLR method was designed specifically to handle ambiguity. If the language is very small, and you have not already used a parser generator, hand-coded recursive-descent is probably a good choice. It is possible that you could be finished with the parser in less time than it would take to master the intricacies of a complicated tool like YACC. For larger projects, the time invested in learning to use a parser generator will probably be offset by the time saved in coding and debugging a recursive-descent parser. This is because recursive-descent parsers require considerably more hand-written code than when a parser generator is used.
If a suitable grammar for the language can be found or constructed, a top-down parser generator is probably a better choice than a bottom-up parser generator like YACC. Top-down parser generators are usually easier to use, and also easier to learn to use. They follow the familiar paradigm of recursion. Bottom-up parsing is a little like recursion that is turned inside out, or maybe upside down. It takes some getting used to before it becomes comfortable.
Lexical analysis is easier and lower-level than parsing. The need for a tool is questionable, especially if a template is available. An efficient scanner is necessary to prevent a bottleneck in the overall program. The scanner processes characters one at a time and assembles them into larger tokens for the parser. If this process is not efficient, the effect can be noticeable even though the compiler often does more actual processing. Most tools do produce fairly efficient scanners, so this is usually not a problem. FLEX, FLEX++, JFLEX, and RE2C are some popular scanner generators.
Another reason to hand code the scanner is for flexibility. Often things can be done in the scanner to make an otherwise difficult parsing problem much easier. Some of these techniques cannot be implemented using a generator. One approach to creating a scanner is to use a lex tool for the prototype, and then to hand code a scanner later if necessary.
item|item2: item item2
The syntactic conversion can be done automatically by SLK:
slk -ys -f=new_grammar yacc_grammarThe SLK grammar from yacc_grammar will now be in new_grammar. The new grammar will probably need grammar conversions to put it in top-down form. SLK can also help with these conversions. Use the SLK options to remove left recursion and to do left factoring where needed.
E : id | & id | ( E ) | E [ int ] | E = E ;
This grammar is ambiguous, so it is not LL(k) or LR(k). The problem is that the grammar does not specify operator precedence and associativity, so more than one parse tree can be constructed for some inputs. It is easy to see that symmetrical constructs like E=E can have more than one parse derivation. A bit more subtle is the ambiguity caused by the interaction of E[int] and E=E. The grammar does not specify whether E=E[int] should be parsed as E=(E[int]) or as (E=E)[int]. Since either interpretation is possible, the grammar is ambiguous. Many grammars can be made LL(1) by first removing ambiguity, then removing left-recursion, and finally left-factoring the grammar. One reasonable and unambiguous form of this grammar would be
E -> F = E E -> F F -> F [ int ] F -> G G -> id G -> & id G -> ( E )This gives low precedence to = and higher to []. The grammar makes = right associative, as is usually the case. The first production would be the following if left associativity were required.
E -> E = FRemoving left-recursion results in the following grammar.
E -> F = E E -> F F -> G more_F more_F -> [ int ] more_F more_F -> G -> id G -> & id G -> ( E )Left-factoring results in the following LL(1) grammar.
E -> F E_tail E_tail -> = E E_tail -> F -> G more_F more_F -> [ int ] more_F more_F -> G -> id G -> & id G -> ( E )
Note that once the ambiguity is removed from the grammar, the other transformations to make the grammar LL(1) could be done using a tool like SLK. In the general case, removing ambiguity from a grammar cannot be done deterministically because no algorithm is capable of discerning the intent of the grammar.
LL(k) parsing uses the next k input tokens to make parsing decisions. LL(1) is LL(k) for the case where k=1. LL(k) parsing is considerably more powerful than LL(1) parsing because of the proper superset containment relationship for different k values. For all k>0, the LL(k) languages are a proper superset of the LL(k-1) languages. Note that this is a stronger statement than a similar grammar containment would be. For example, the C programming language is LL(2) but not LL(1), so an LL(1) grammar does not exist for the language.
Permutations are usually handled semantically. Use a broad grammar that accepts a superset of what you want, and check for invalid input in the action code. Were both D and E present? Are any phrases repeated? ...
s : A { phrase }+ phrase : B = "1" C = "2" D = "3" E = "4" F = "5"
Yes and No. It is generally much easier to code local parsing exceptions in recursive-descent. However, maintaining the parser is difficult and it is nearly impossible to make the parser correctly handle all error inputs. It should be noted that for modern programming languages, i.e. post C++, hand-written recursive descent may be a good choice. Its weakness is actually an advantage in this case. That is the language ambiguities that cause so much trouble for a parser generator are easily just ignored by the programmer. This works because the vast bulk of code to be compiled is correct, or nearly so with most types of errors being well known. If a particular parse derivation does not occur in practice, there is no need to account for it in the parser. This can cause incorrect programs to be silently accepted by the compiler. But good testing can usually handle much of this.
Yes and no. The parser generator itself only deals with integer tokens, not with any kind of text. The scanner converts the input text into a stream of tokens, so it is responsible for handling Unicode. The sample C/C++ scanners are not written to handle Unicode, though they could be modified to do so. The sample Java/C# parsers do handle Unicode using the languages' built-in support.
Any compiler text book is a good source of information.
It is a form of top-down short-circuit backtracking parsing. SLK implements a limited, controlled version of this and calls it predictive backtracking. Backtracking is always inefficient, so it should be used with care. The short-circuit algorithm picks the first syntactically valid parse that it encounters, not trying any others at all. In the case of the C++ expression/declaration ambiguity, this is clearly correct because the language defines it to be so. In other cases, the programmer should be cautious, and understand what he is doing.
It is a limited version of LL(inf) that only backtracks over repeating constructs to resolve conflicts. It helps reduce the need for left factoring in these cases.
Yes, quite a few from very small to very large. They are not listed here mainly because of nondisclosure concerns.
Traditional strong LL(k) grammar analysis is usually intractable for common grammars that need a lookahead of 2 or more. This is because the number of lookahead strings is an exponential function of the size of the terminal set of the grammar. For an LL(k) language with an alphabet, or terminal set T, this size is |T|k.
The SLK algorithms analyze a grammar without having to construct all of the possible lookahead strings, only those that are relevant to the conflict being resolved. So the amount of time that it takes for SLK to analyze a grammar and construct a parser for it is a linear function of the number of conflicts in the grammar. Most grammars have only a linear number of conflicts, so the SLK analysis is linear for them.
A production compiler for a small LL(2) knowledge representation language was written using SLK. At one point, input files for this compiler were about 5 megabytes, and increasing steadily over time as new definitions were added. The 5 megabyte files took less than two seconds to compile on an ancient 133 MHz personal computer many years ago. This was a gross observation that included the time to load the program as well as to read in the file. (Enter key to finished.)
The following is the SLK output for a database language parser. It illustrates how effective the compaction algorithms are for medium to large size languages.
This grammar is strong LL(7) except that - conflicts were ignored by request on 2 nonterminal(s) total nonterminals = 504 ignored conflicts = 2 total terminals = 321 total productions = 1371 parse table size = 161784 total base conflicts = 65 total conflicts = 297 conflict table size = 95337 generating the parser compacting parse table from 161784 to 6019 compacting conflict table from 95337 to 6804
This is the command line and options summary that SLK prints if no parameters are given.
Usage: slk input_filename [-options] -a append parser name to header file token defines (see -n below) -b enable aggresive heuristic ambiguity checking -c disable table compacting -cc generate instance model C code (C with Classes) -cs generate C# code -C++ generate C++ code -Cr -Gc with a single conflict specified as 2 __dot_ configs in grammar -d display parse derivation of the input grammar for debugging -dot use ^G for dot in configs instead of $$, prints as dot but beeps in console -Da display all -D values -Dc display all conflict traces for debugging only -Dd[=symbol name] display nonterminals that can left derive symbol name -De display nonterminals that can derive the empty string, epsilon -Df display first and follow sets -Dl display lookahead strings for resolved conflicts -Dp display productions -Dr display LR states -Ds display symbol table -Dt display parse tables -E n convert EBNF grammar to SLK format, n=1 is IEEE, n=2 is ISO -e disable EBNF grammar support -f=filename name of the file for the output grammar -Fa force _action_ naming so __ can be for user tokens -g generate the parser even if the grammar is ambiguous -Ga do not generate anything -Gc generate conflict resolver only, no parser -Gd generate parse derivation code in get_predicted_entry for debugging -Gn do not generate nonterminal strings used for displaying -Gp do not generate the production strings used for displaying -Gs do not generate action strings used for displaying -Gt do not generate terminal strings used for displaying -Gu do not generate unique names for duplicate EBNF nonterminals -i ignore non-fatal warnings -j generate java code -js generate javascript code -k=maximum k value to test for in the grammar default k=7 -l=nonterminal name indirect left factor nonterminal the old way -Ld[=nonterminal name] direct left factor nonterminal or the grammar -Li[=nonterminal name] indirect left factor nonterminal or the grammar -[LA]LR LALR or LR grammar, (LRS for SLR grammar) -m conserve memory by reusing internal structures -n=name of the parser default is Slk, must be length 3 -ns=namespace or package name default is no namespace -O[=n] only analyze conflict number n ( -1 = don't show conflict numbers ) -o[=n] only analyze the nth conflict at each level above level 1 -Ps=parse_stack_size size in the generated parse routine, default=512 -q quiet mode operation -Rd remove duplicate productions -Rr[=nonterminal name] remove left recursion -Ru[=nonterminal name] remove unused nonterminals -Sn[=nonterminal name] substitute nonterminals for equivalent ones -St[=nonterminal name] substitute terminals for equivalent nonterminals -Tl=filename load token strings from a file -Tw=filename write token strings to a file -t[=n] set grammar transformations trace level to n for debugging only ( no =n sets t=1, ( 2, 3, 4 ) ) -v[=n] set verbose output level to n ( no =n sets v=1, ( 2, 3 ) ) -Xo eXpand opt/_opt symbols in an LR grammar to reduce conflicts -y[s] yacc style scanner is used, s=strip out all actions
The order does not matter, options can be in any position. Options must be listed separately. The file name is required.
The 3 letter parser name can be used to prevent name collisions.
Makes the parse tables easier to read when debugging. No performance advantage, table access is the same.
This is the new and recommended coding style.
The SLK parser is not generated. The code to resolve conflicts for k>1 is generated only. This is the get_production method along with its conflict resolution tables. Useful for adding LL(k) or LR(k) capability to any parser. The LL(k) analysis for the grammar is not done. Instead the user specifies conflicts in the grammar by inserting __dot_[n] at the appropriate point in each of the productions that are conflicting. If there are only 2 productions and one conflict, the conflict number n does not need to be specified. Otherwise it is needed to tell SLK which configurations are associated with which conflict. Conflicts are numbered 1..n for n conflicts in the grammar. For LL grammars, the __dot_ is always at the left corner of the productions of a single nonterminal. For LR grammars the __dot_ can appear anywhere on the right hand side of any productions in the grammar. The returned production number is sufficient to know whether to shift or reduce, or which reduce to perform.
This gives a flat file version of the entire parse tree.
The display values all go to stdout.
Shows the conflicts in a trace format. The first line is
nonterminal: conflict_terminal conflict_number
The next rows are
conflict_terminal conflict_number
for each successive lookahead level. The levels are indented to show where the next level starts.
The symbol can be nonterminal or terminal. Useful in debugging. For all options, a colon can be used instead of the equal sign. White space is not allowed anywhere in the option.
Standard grammar sets used in the analysis.
Shows the conflict lookahead strings for each production. These strings can be up to k terminals long.
Lists the grammar productions in string and numeric form.
Prints out the entire symbol table for the grammar.
The parse table format is
nonterminal: terminal production_number terminal production_number ...
The conflict table format is
conflict number: terminal production_number terminal production_number ...
This option can be used to convert certain grammars to SLK syntactic format. See the grammar conversion sections above for details.
This option can be used for grammars that have no EBNF constructs. In this case, the EBNF operators do not need to be escaped. ( [ can be used instead of \[ ) The default is to support the following EBNF notation.
An action that is specified just after the end brace/bracket is only executed when the EBNF construct goes to null. An action that is specified just before the end brace/bracket is only executed when the EBNF construct does not go to null. Of course this also applies to all other actions within the brackets.
This can be used even if no grammar transformation options are selected. Comments are stripped from the grammar. The output grammar is in EBNF or BNF format, depending on the type of the original grammar.
If this option is specified with the -e option (assumes a BNF grammar), SLK attempts to generate productions for undefined regular expression nonterminals.
The generated productions need to be confirmed by inspection. Mistakes are possible. For example, a terminal that ends in "opt" could be mistaken for an optional nonterminal.
SLK also attempts to generate productions for OR-nonterminals in the grammar. Since SLK does not use the | operator, a shorthand notation for alternatives can be specified. If the grammar contains an undefined symbol like READ|WRITE|UPDATE, SLK will generate the following production at the end of the output file:
READ|WRITE|UPDATE : READ WRITE UPDATE
Embedded spaces are supported with the tilde operator as follows:
READ~ONLY|WRITE|UPDATE : READ ONLY WRITE UPDATE
This will cause the parser to be generated for a non-LL grammar. This previously was the default action. The recommended method is to explicitly handle each conflict using the :? operator in the grammar as a last resort, or if you understand the conflict such as in the case of the if-else ambiguity.
These are the code generation options. This one is useful during grammar development.
The SLK parser is not generated. The code to resolve conflicts for k>1 is generated only. This is the get_production method along with its conflict resolution tables. Useful for adding LL(k) capability to any parser.
Printable strings that can be used for messages.
Use this for backward compatibility or if an explicit parse tree is not needed. The open source SlkTree code is included with the SLK distribution. It should be compiled and linked with the parsing application.
Printable strings that can be used for messages.
Printable strings that can be used for messages.
Printable strings that can be used for messages.
Mostly useful during grammar development. If not set, duplicate EBNF nonterminals cannot be detected and removed by slk.
Some warnings cannot be ignored.
This can be adjusted based on the grammar. The final actual k value is determined automatically by SLK. It may be less than what is specified here. This just puts a ceiling on the complexity of the grammar analysis.
This option should only be used to try to resolve conflicts that are marked as unresolvable or ambiguous. The following smart left factoring options only factor out nonterminals that can derive infinite strings because LL(k) lookahead can handle most of the other cases. Grammars like the following slip through the cracks.
value-type : struct-type enum-type struct-type : type-name simple-type enum-type : type-name
Here type-name needs to be factored out. Left recursion should always be removed first.
This option provides some automation to the process. It is a heuristic that is best used with some guidance. Left recursion should always be removed first.
This option provides some automation to the process. It is a heuristic that is best used with some guidance. Left recursion should always be removed first.
Should only to be used on severely constrained systems.
Letters, digits, and underscore are allowed. Files and functions have this as a prefix. This allows more than one grammar and parser in a single program. Otherwise, it is not needed.
Wrap all of the generated C++, C#, or Jave code with a namespace or package name.
For debugging a single conflict at a time. The -1 value is useful when comparing two conflict files where the conflicts are nearly the same but the numbers are changed because one or more conflicts were resolved. Useful to see what the incremental change is for a grammar modification.
A single conflict can generate many new ones at each lookahead level. This value can usually be set to 1.
Smaller for embedded use, larger for complex autogenerated grammars
Display almost no messages.
The productions can be safely removed.
This is also a heuristic, but it is better behaved than the one for left factoring.
The productions can be safely removed. The nonterminal does not appear on the right hand side of any production.
Provided as an automation task. Not really necessary to do.
Provided as an automation task. Not really necessary to do.
Used to override the SLK automatic token declarations.
For loading by another parser.
Shows detailed step by step transformation of the grammar.
More detailed messages during the grammar analysis.
A yacc-syntax grammar is used instead of the space-delimited grammar syntax. This can be an SLK grammar or an actual yacc grammar that is being converted to SLK format. The "s" option says strip out all of the actions from the grammar. This is usually necessary when converting a yacc grammar to SLK format.
When we look at translation, however, the left parse appears more desirable.Comment about LL vs. LR from Grune & Jacobs (1990), page 242:
If one is in a position to design the grammar along with the parser, there is little doubt that LL(1) is to be preferred: not only will parsing and performing semantic actions be easier, text that conforms to an LL(1) grammar is also clearer to the human reader.Performance comment from Fischer & LeBlanc (1991), page 120:
Since the LL(1) driver uses a stack rather than recursive procedure calls to store symbols yet to be matched, the resulting parser can be expected to be smaller and faster than a corresponding recursive descent parser.
Fischer, Charles N. and Richard J. LeBlanc, Jr. (1991). Crafting a Compiler with C. Benjamin/Cummings, Menlo Park, Ca.
Grune, Dick and Ceriel J.H. Jacobs (1990). Parsing Techniques - A Practical Guide. Ellis Horwood, Chichester, England.