SLK Documentation


Introduction

A parser generator is a tool that automates the process of creating parse tables from a grammar. It may also generate parsing, or even compiling code from the tables. In top-down parsing, this can be in the form of parse tables and a driver, or in the form of a recursive-descent program. Bottom-up parsers must be table-driven since there is no bottom-up correlary to programming recursion. An advantage to using a parser generator is that it verifies that the grammar is suitable to the parsing method being used. Programmers rarely take the time to fully analyze the grammar when creating a recursive-descent parser. This can lead to subtle errors that only appear on certain inputs to the program.

SLK is a top-down parser generator. As such, it is very different from the YACC parser generator, which is bottom-up. SLK is also designed to be much easier to use than other parser generators. So much so that it can be confusing to those familiar with other tools. For example, one might wonder how to create tokens, the first thing usually done in most parser generators. In YACC, for every token a line like

%token DIVIDE
must be written. In SLK, you just write the grammar. The tokens are extracted and named automatically by SLK. So for the token DIVIDE just put the symbol (probably /) in the grammar. Note that quotation marks are not needed as in some other tools. This makes the grammar cleaner and easier to follow.

Getting Started

The first thing to do is to obtain a reference grammar for the language you are implementing. An internet search will usually give several grammars. The grammar from a standards committee is usually the best one to use. It will be in a generic BNF or EBNF format that is suitable for use by SLK. Minor editing changes will probably be needed to get in into exactly the SLK syntax described below.

Next, run the grammar through SLK and capture the message output to a file. The "equivalent to" messages can usually be ignored, but everything else must be corrected, usually in the order given. The exception is that left factoring and left recursion elimination should be done last where possible. See the FAQ below for a specific example of this. Clearly, things like "duplicate production" should be handled first. These are errors in the grammar, not things that are required to put the grammar into LL(k) format. Questions about context free grammar formatting should be referred to a compiler or parsing text book.

Once the grammar is correct, the scanner should be developed. This can be done by hand, or by using any of the available scanner generator tools. A scanner is a simple automata that assembles characters in the input into tokens that are used by the parser, which itself is a more complicated pushdown automata. It is usually a good idea to develop a symbol table manager in conjunction with the scanner, since it can make the scanner much simpler by differentiating keywords from identifiers.

At this point, you have what is referred to as a recognizer or parser. It should be able to read language input files and determine if they are syntactically correct. The error reporting can be as simple or as detailed as the application warrents. The SLK parser will call your error routine appropriately, giving information that can be used to customize the error message. But to start it can be as simple as the dreaded "syntax error."

Converting the recognizer to something more useful, like a compiler or translator is accomplished by adding action code to the program. This is the hard part. Producing the parser is trivial compared to the work required to transform it into even a simple translator. But a large part of the reason for this is that the parser generator does a lot of the work for you. By contrast, a hand-coded recursive descent parser is quite a lot of work to produce, and even more so to maintain.

The action code consists of lots of usually small routines that are called in callback fashion by the parser at the points in the grammar that you specify. Two general strategies for action code are prevalent. One is to have it just create an abstract syntax tree (AST) that is then processed again after the input file is completely parsed. The other strategy is to emit target code or an intermediate form directly. The best choice depends on the application at hand as well as personal preference.

SLK grammar syntax

All tokens must be separated by white space. Production alternates are on separate lines. Blank lines separate productions instead of the semi-colon. Tokens are assumed to be terminals unless they are seen as the left-hand-side of a production. This is handy during development, but it means that the token declarations must be checked to be sure that all nonterminals were declared. A nonterminal can be converted to a terminal by commenting out its productions. This makes it easier to trace conflicts in the grammar.

The left-hand-side separator ( : ::= --> := -> = :? :! ) is the only operator. However, the algorithm can usually detect the left-hand-side separator from its position, so these operators are not reserved. The :! separator directs SLK to not analyze the conflict. The first production alternate will be used as the parse table entry. This is useful in cases like the dangling-else ambiguity where the resolution is to use the first production.

The :? separator indicates a conditional nonterminal. It means that if the input is syntactically well-formed for the first production, then use it. Otherwise, try each of the other productions in turn and use the first one that matches. This is the SLK method of backtracking. Actions are not executed during this process. Note that in the C++ grammar, statements should be separated as follows to avoid unnecessary backtracking.

statement:
    ambiguous-statements
    unambiguous-statements

ambiguous-statements :?
    declaration-statement
    expression-statement

unambiguous-statements:
    labeled-statement
    compound-statement
    selection-statement
    iteration-statement
    jump-statement
    try-block

Single or double quotes are never needed. The backslash can be used as an escape to continue a line or to specify an ascii character in case the algorithm does not work. The usual C escapes are recognized ( \\ \n \b \t ... ). The C comment styles are used. A line that contains tokens cannot begin with a comment. Since white space is used as a separator, almost any combination of characters can be used as a name. For example, ;_opt and ,_id_* are valid symbols. The following names are reserved.

A limited EBNF grammar notation is also supported as follows.

An action that is specified just after the end brace/bracket is only executed when the EBNF construct goes to null. An action that is specified just before the end brace/bracket is only executed when the EBNF construct does not go to null. Of course this also applies to all other actions within the brace/brackets. The following production is an example of a case where this is useful.

list ->  _action_initialize  { item _action_item }  _action_finalize

Care needs to be taken to avoid unexpected results. For example in

list ->  { item } _action_always_do

the action is executed exactly once whether or not there is an item because the braces must eventually go to null either way. However, in

list ->  [ item ] _action_always_do

the action is only executed if no item is present. The following three methods can be used to make this action always fire exactly one time:

list ->  [ item _action_always_do ] _action_always_do
list ->  [ item ] _action_dummy_nop _action_always_do
list ->  [ item ] always_do
always_do ->  _action_always_do

So, for brackets, if the same action is desired for either case, specify the same action twice, both before and after the bracket. Or insert an action that does nothing after the bracket. Or make a nonterminal that just executes the action.

SLK will also accept grammars in a yacc-like syntactic format. If the grammar is not LL(k), then the SLK grammar transformation options can be used to convert it. The main syntactic differences from yacc are

  1. Actions are by number rather than as code. {1} is action 1. Alternatively, actions can be specified by function names as in {MyAction}.
  2. No yacc directives are supported.
  3. There is no %% prefix or suffix code.
  4. SLK does not maintain a semantic stack, so the $$ notation is not supported.

Existing yacc grammars can be syntactically converted to SLK format by using the -ys option with the -f=filename option. The resulting grammar will probably need further processing to get it to LL(k) format. The SLK left recursion and left factoring options can be used to aid in doing this. See the FAQ below for an example of how to convert an LR grammar to LL form. The following sample grammar illustrates the general layout for an SLK grammar. Note that the actions are just names in the grammar, not code fragments as in YACC. For example, _action_Add in the grammar causes the parser to call an action routine named Add() in the action code. SLK handles the details of making this linkage. You just need to write the action routine and include it in the SlkAction class.

Sample EBNF grammar

expression:
    additive_expression  _action_Finish

additive_expression:
    multiplicitive_expression { add_op _action_Push multiplicitive_expression \
                                _action_Add }

add_op:
    +
    - 

multiplicitive_expression:
    exponential_expression { mul_op _action_Push exponential_expression \
                             _action_Multiply }

mul_op:
    *
    /

exponential_expression:
    primary_expression [ ^  exponential_expression  _action_Exponent ]

primary_expression:
    (   additive_expression   )
    -  primary_expression   _action_Negate 
    NUMBER _action_Push

Converting Grammars To SLK Format

The SLK syntax is designed to look as much like a plain text reference grammar as possible. In many cases, some simple text editing can produce an SLK grammar from a reference grammar. However, there are many variations on the basic syntax of published grammars. The -En SLK option enables the automatic conversion of some of the more popular reference formats, as outlined in the succeeding sections. If a manual conversion is needed, some basic rules are outlined in the FAQ below.

Converting IEEE grammars to SLK format

The conversion is done in two stages as follows:

    slk ieee_grammar -E1 -f=out_grammar
    slk out_grammar -f=slk_grammar

The second stage generates OR productions and puts the grammar in a cleaner format. The input grammar syntax is as follows:

SymbolMeanings 

    ::=   The definition operator. This is used in a production rule to
    or    separate the element defined by the rule from its definition.
    :=    The element being defined appears to the left of the operator
          and the formula that defines the element appears to the
          right.

    [ ]   Square brackets indicate optional elements in a formula. The
          portion of the formula within the brackets may be explicitly
          specified or may be omitted.

    { }   Braces indicate repeating elements in a formula. This is
          the zero-or-more operator as in the SLK syntax.

    |     The alternative operator. The vertical bar indicates that
          the portion of the formula following the bar is an alterna-
          tive to the portion preceding the bar. If the vertical bar
          appears at a position where it is not enclosed in braces
          or square brackets, it specifies a complete alternative for
          the element defined by the production rule. If the vertical
          bar appears in a portion of a formula enclosed in braces or
          square brackets, it specifies alternatives for the contents
          of the innermost pair of such braces or brackets.

Converting ISO grammars to SLK format

The conversion is done in two stages as follows:

    slk ieee_grammar -E2 -f=out_grammar
    slk out_grammar -f=slk_grammar

The second stage generates OR productions and puts the grammar in a cleaner format. The input grammar syntax is as follows:

SymbolMeanings 

    < >   Angle brackets delimit character strings that are the names
          of syntactic elements, the non-terminal symbols of the 
          language. Spaces are allowed. 

    ::=   The definition operator. This is used in a production rule to
    or    separate the element defined by the rule from its definition.
    :=    The element being defined appears to the left of the operator
          and the formula that defines the element appears to the
          right.

    [ ]   Square brackets indicate optional elements in a formula. The
          portion of the formula within the brackets may be explicitly
          specified or may be omitted.

    { }   Braces group elements in a formula. The portion of the for-
          mula within the braces shall be explicitly specified.

    |     The alternative operator. The vertical bar indicates that
          the portion of the formula following the bar is an alterna-
          tive to the portion preceding the bar. If the vertical bar
          appears at a position where it is not enclosed in braces
          or square brackets, it specifies a complete alternative for
          the element defined by the production rule. If the vertical
          bar appears in a portion of a formula enclosed in braces or
          square brackets, it specifies alternatives for the contents
          of the innermost pair of such braces or brackets.

    ...   The ellipsis indicates that the element to which it applies
          in a formula may be repeated any number of times. If the el-
          lipsis appears immediately after a closing brace "}", then it
          applies to the portion of the formula enclosed between that
          closing brace and the corresponding opening brace "{". If
          an ellipsis appears after any other element, then it applies
          only to that element. This is the one-or-more operator that 
          corresponds to the SLK {}+ usage. 

Converting YACC grammars to SLK format

As noted above, existing yacc grammars can be syntactically converted to SLK format by using the -ys option with the -f=filename option. The resulting grammar will probably need further processing to get it to LL(k) format because the LL(k) requirements are different from those of the LALR(1) grammars used by yacc. The SLK left recursion and left factoring options can be used to aid in doing this.

User Supplied Classes

The following classes are referenced by the parser. They must be supplied by the user. Note that the name "Slk" can be changed by a command line option to the parser generator. The change occurs in all instances, so for example SlkToken could be My_Token. This enables the use of more than one grammar/parser in a single program. Of course, any number of parsers can be instantiated and run concurrently.

SlkToken

Class containing the get and peek methods. The get() method just returns the next token from the input. The peek(level) method returns the token at the lookahead level given. Lookahead level one is the same as the get() level. Peek does not consume input, so its tokens are all returned again by get(). The get() and peek(level) methods act as a layer to insulate the parser from the scanner. The usual practice is to have them call the real scanner to get input, and then to buffer it where needed.

SlkError

Class containing the parse error recovery routines.

Error recovery is tunable according to the returned token from mismatch() or no_entry().

  1. If the same token is returned, the algorithm proceeds as if nothing happened. The parse stack is popped. For mismatch(), this is one way to insert a missing token.
  2. If a different token is returned, the algorithm does a retry using the new token and the same nonterminal.
  3. If zero is returned, the algorithm retries using the old token/nonterminal pair. This is useful when the action required is to change the symbol table information. For example, toggling between identifier and typedef in C.

Changing the input token can cause the parser to enter an infinite loop of calling the error routine if the change does not repair the original error, and the error routine continues to instruct the parser to retry. The mismatch and no_entry methods need to guard against this by keeping a counter or other means to detect that the parse is stuck. Simply advancing the input will break the loop.

SlkAction

Class containing the semantic action methods. The actions are numbered 1, 2, 3, ... The zero numbered action is never called. The router for the semantic action methods can be implemented as a call table or a switch. Defining the router can be automated by SLK, or manually supplied by the user code. Also supplies the predict() and reduce() methods used by SLK to report the sequence of productions used in the parse. This fully defines the parse, so it can be used to display the derivation as well as to construct a persistent parse tree.

SLK Supplied Classes and Files

The following classes are supplied by SLK. These are implemented as static classes because they need no initialization or instance data. Note that the name "Slk" can be changed by a command line option to the parser generator. The change occurs in all instances, so for example SlkParser could be My_Parser. This enables the use of more than one grammar and parser in a single program. Otherwise, it is not needed and the default "Slk" is fine. These classes can optionally be enclosed in a namespace or package.

SlkParser

The SLK generated tables and parser. The public methods are
void Parse ( SlkAction   action,
             SlkToken    tokens,
             SlkError    error,
             slk_size_t  start_symbol )

slk_size_t GetProduction ( slk_size_t  conflict_number,
                                       SlkToken  tokens )

int GetSymbolType ( slk_size_t   symbol )
slk_size_t *SlkGetProductionArray ( slk_size_t   production_number )
slk_size_t *SlkGetState ( slk_size_t  state_number )
boolean IsNonterminal ( slk_size_t   symbol )
boolean IsTerminal ( slk_size_t   symbol )
boolean IsAction ( slk_size_t   symbol )

SlkString

The SLK generated class that defines the printable strings and their access methods. The public methods are
string  GetSymbolName ( short symbol )
string  GetProductionName ( short production_number )

SlkConstants

The SLK generated class that defines the constant values for the nonterminals and the terminals of the grammar. The token definitions are automatically assigned by SLK unless a token file is specified. A token file is usually only needed if the program contains more than one parser, and some of same the tokens are included in each of the grammars. In this case, the automatically assigned token values may be different.

A token file consists of token definitions on separate lines. Each token string is the token exactly as it appears in the grammar. The token value is a number. Selected parts of a token file follow.

                            auto       257
                         typedef       261
                   TYPEDEF_NAME_       271
                              +=       294
               END_OF_SLK_INPUT_       318

The code that is generated converts the token strings to a form that is a valid identifier. Each one ends in an underscore to try to avoid name collisions. The special token END_OF_SLK_INPUT_ is returned by a scanner at the end of the input token stream. It does not appear in a grammar. The following definitions correspond to the tokens given above.

public const short  AUTO_ = 257;
public const short  TYPEDEF_ = 261;
public const short  TYPEDEF_NAME_ = 271;
public const short  PLUS_EQUAL_ = 294;
public const short  END_OF_SLK_INPUT_ = 318;

Using SLK with a C/C++ generated scanner is fairly simple. Just include the header file SlkParse.h in the scanner specification, and then code the scanner actions to return the values defined when the token is recognized. So, for example, the action would be "return PLUS_EQUAL_" when "+=" is seen by the scanner. At EOF, return END_OF_SLK_INPUT_. Java and C# work similarly, except that the token values are found in the SlkConstants class.

SlkExecute.txt

The SLK generated file that defines the action router. The execute method is implemented as a switch statement. It is not a call table because pointers are not supported in Java. The SlkExecute.txt file is meant as an example, or to be manually edited into the user-written SlkAction class. (The #include directive is not supported, and a separate class is not feasible.) An alternative is to use numbered actions like _action_32_does_this, and define the execute method manually. As with function call actions, the same action can be given many times in the grammar. This can reduce the switch size. A user-maintained switch statement also enables using snippets of code in the switch instead of requiring function calls from the switch.

Specific Generated Code Interfaces

Java version generated code interface

JavaScript version generated code interface

C# version generated code interface

C++ version generated code interface

The C++ code is logically the same as the Java code, but its implementation is different due to the fact that SLK originally generated C code. Static classes in Java serve only to provide a namespace. The three letter prefix on names is just as effective, but with fewer letters. So in C++ the parse method is named SlkParse instead of SlkParser.parse. The public methods are
void SlkParse ( SlkAction  &action,
                SlkToken   &tokens,
                SlkError   &error,
                short       start_symbol )    // zero for default symbol

short SlkGetProduction ( short    conflict_number,   
                         SlkToken Slktokens ) "

int SlkGetSymbolType ( short   symbol )
int SlkIsNonterminal ( short   symbol )
int SlkIsTerminal ( short   symbol )
int SlkIsAction ( short   symbol )

Another difference from Java is that header files are used in C++ to provide the public declarations. Also, SLK is able to fully define an action call table in C++. Since SlkAction is a class, it is necessary to initialize the action table in the constructor. This is done by making a call to the supplied method initialize_table. Note that SlkParse itself is not a class. It is just a reentrant subroutine that drives the parsing process. A file summary for an SLK C++ parser follows.

C version generated code interface

The C interface is almost the same as the C++ interface. The main difference is that the arguments to SlkParse are C structures rather than C++ classes. A file summary for an SLK C parser follows.


Appendix A: Parsing Questions and Answers

  1. What is the difference between top-down and bottom-up parsing?

    These methods produce the same parse tree. The difference is in the order of construction. In top-down parsing the parse tree is constructed from the top down. In bottom-up parsing the parse tree is constructed from the bottom up. The bottom-up method has more powerful recognition capability because it delays parsing decisions as long as possible. However, since the semantic actions must also be delayed in the bottom-up method, translation/compiling is generally easier with top-down parsing.

  2. What is recursive-descent parsing?

    Parsing is recursive by nature. Nested constructs like ((x(y))) usually need to be evaluated from the inside out. Recursion handles this process easily by saving partially evaluated constructs on a stack until the innermost parts have been evaluated. If the stack is explicit, the parsing method is referred to as table-driven. If programming language recursion is used, the parsing method is called recursive-descent. In recursive-descent parsing, a subroutine is written for every nonterminal in the grammar. These subroutines call each other in the order specified by the grammar to simulate traversing a parse tree. Most grammars are hierarchical in nature, so descending the grammar from the top down is the parsing procedure used in top-down parsing. Most grammars have repeating constructs, so recursion is necessary. Hence the term recursive-descent parsing.

  3. Is LL(k) parsing suitable for natural language processing?

    No. Highly ambiguous languages are difficult to parse using techniques that are primarily deterministic. The GLR parsing method is better in these cases because it was designed specifically to handle ambiguity.

  4. What is a parser generator?

    A parser generator is a tool that automates the process of creating parse tables from a grammar. It may also generate parsing, or even compiling code from the tables. In top-down parsing, this can be in the form of parse tables and a driver, or in the form of a recursive-descent program. Bottom-up parsers must be table-driven since there is no bottom-up correlary to programming recursion. An advantage to using a parser generator is that it verifies that the grammar is suitable to the parsing method being used. Programmers rarely take the time to fully analyze the grammar when creating a recursive-descent parser. This can lead to subtle errors that only appear on certain inputs to the program.

  5. What is the difference between BNF and EBNF notation?

    Extended BNF adds a zero-or-one grouping operator, [], and a zero-or-more, {}, grouping operator to the BNF productions. This shorthand notation makes a grammar more compact. For example,

    S -> a A a
    A -> b
    A -> 
    
    can be expressed in EBNF notation as
    S -> a [b] a 
    

    The use of EBNF notation also tends to avoid the occurrence of some types of grammar conflicts. The only top-down conflicts that can be addressed by EBNF are those that result from the conflict of a null and a non-null production. These conflicts are avoided by substitution in the grammar. The EBNF construct is in effect a new and unique nonterminal that occurs in only one place in the grammar. This removes the possibility of a conflict, since only one continuation is present in the single production. If the same EBNF construct occurs elsewhere, it is a different nonterminal, so no conflict is possible. However, this does not apply to left-recursion and left-factoring conflicts. These cannot be avoided by using EBNF. (EBNF direct left-factoring can be done if grouping and or are supported, but not indirect left-factoring).

    Clearly, the same conflict avoidance can be achieved with a fully substituted BNF grammar, so EBNF is not more powerful than BNF notation, at least for LL(k) grammars. Not all conflicts are the result of a null/non-null construct. In the C language, of the nine deterministic LL(1) conflicts, only two are the result of null/non-null productions, and even these two cannot be resolved by using EBNF. All of the conflicts can be resolved using an LL(2) grammar, suggesting that the LL(k) method is preferable to EBNF in the area of conflict resolution.

    An interesting consequence of the substitution property of EBNF is that it can make an LL(k) grammar strong LL(k) under the covers. Consider the classic example of a grammar that is LL(2), but not strong LL(2).
    S -> a A a
    S -> b A b a
    A -> b
    A ->
    
    The problem is that "ba" predicts both of the S productions. The left context before the A is needed to resolve the conflict. This grammar is made strong LL(2) by adding a new, duplicate nonterminal A2 in the following way.
    S -> a A a
    S -> b A2 b a
    A -> b
    A ->
    A2 -> b
    A2 ->
    
    Now the left context is not needed because A and A2 are different nonterminals. The equivalent EBNF grammar is also strong LL(2) in its natural expression, without the need to convert a grammar. The LL(2)/strong LL(2) issue never arises at all.
    S -> a [b] a
    S -> b [b] b a
    

    The advantage of EBNF in the case of this grammar is clear.

  6. What is the difference between LL(k) parsing and strong LL(k) parsing?

    LL(k) parsing uses all of the left context of the sentence being parsed. This makes it the most powerful top-down deterministic parsing method. The strong LL(k) parsing method does not use any of the left context of the parse. This is somewhat less powerful, but uses much smaller grammars and parsing tables. As noted above, EBNF SLK grammars are likely to be fully LL(k).

  7. What parsing technique should I use?

    It depends on the language to be parsed. If the language is highly ambiguous, a GLR parser is a good choice because the GLR method was designed specifically to handle ambiguity. If the language is very small, and you have not already used a parser generator, hand-coded recursive-descent is probably a good choice. It is possible that you could be finished with the parser in less time than it would take to master the intricacies of a complicated tool like YACC. For larger projects, the time invested in learning to use a parser generator will probably be offset by the time saved in coding and debugging a recursive-descent parser. This is because recursive-descent parsers require considerably more hand-written code than when a parser generator is used.

    If a suitable grammar for the language can be found or constructed, a top-down parser generator is probably a better choice than a bottom-up parser generator like YACC. Top-down parser generators are usually easier to use, and also easier to learn to use. They follow the familiar paradigm of recursion. Bottom-up parsing is a little like recursion that is turned inside out, or maybe upside down. It takes some getting used to before it becomes comfortable.

  8. Should I use a lexical analyzer tool, or write the scanner by hand?

    Lexical analysis is easier and lower-level than parsing. The need for a tool is questionable, especially if a template is available. An efficient scanner is necessary to prevent a bottleneck in the overall program. The scanner processes characters one at a time and assembles them into larger tokens for the parser. If this process is not efficient, the effect can be noticeable even though the compiler often does more actual processing. Most tools do produce fairly efficient scanners, so this is usually not a problem. FLEX, FLEX++, JFLEX, and RE2C are some popular scanner generators.

    Another reason to hand code the scanner is for flexibility. Often things can be done in the scanner to make an otherwise difficult parsing problem much easier. Some of these techniques cannot be implemented using a generator. One approach to creating a scanner is to use a lex tool for the prototype, and then to hand code a scanner later if necessary.

  9. How do I convert my EBNF grammar to the EBNF format used by SLK?

  10. How do I convert my YACC grammar to SLK format?

    The syntactic conversion can be done automatically by SLK:

        slk -ys -f=new_grammar yacc_grammar
    
    The SLK grammar from yacc_grammar will now be in new_grammar. The new grammar will probably need grammar conversions to put it in top-down form. SLK can also help with these conversions. Use the SLK options to remove left recursion and to do left factoring where needed.

  11. How do I convert the following YACC LR grammar to LL form?

    E : id
      | & id
      | ( E )
      | E [ int ]
      | E = E
      ;
    

    This grammar is ambiguous, so it is not LL(k) or LR(k). The problem is that the grammar does not specify operator precedence and associativity, so more than one parse tree can be constructed for some inputs. It is easy to see that symmetrical constructs like E=E can have more than one parse derivation. A bit more subtle is the ambiguity caused by the interaction of E[int] and E=E. The grammar does not specify whether E=E[int] should be parsed as E=(E[int]) or as (E=E)[int]. Since either interpretation is possible, the grammar is ambiguous. Many grammars can be made LL(1) by first removing ambiguity, then removing left-recursion, and finally left-factoring the grammar. One reasonable and unambiguous form of this grammar would be

    E -> F = E
    E -> F
    F -> F [ int ]
    F -> G
    G -> id
    G -> & id
    G -> ( E )
    
    This gives low precedence to = and higher to []. The grammar makes = right associative, as is usually the case. The first production would be the following if left associativity were required.
    E -> E = F
    
    Removing left-recursion results in the following grammar.
    E -> F = E
    E -> F
    F -> G more_F
    more_F -> [ int ] more_F
    more_F ->
    G -> id
    G -> & id
    G -> ( E )
    
    Left-factoring results in the following LL(1) grammar.
    E -> F E_tail
    E_tail -> = E
    E_tail ->
    F -> G more_F
    more_F -> [ int ] more_F
    more_F ->
    G -> id
    G -> & id
    G -> ( E )
    

    Note that once the ambiguity is removed from the grammar, the other transformations to make the grammar LL(1) could be done using a tool like SLK. In the general case, removing ambiguity from a grammar cannot be done deterministically because no algorithm is capable of discerning the intent of the grammar.

  12. What is the difference between LL(1) parsing and LL(k) parsing?

    LL(k) parsing uses the next k input tokens to make parsing decisions. LL(1) is LL(k) for the case where k=1. LL(k) parsing is considerably more powerful than LL(1) parsing because of the proper superset containment relationship for different k values. For all k>0, the LL(k) languages are a proper superset of the LL(k-1) languages. Note that this is a stronger statement than a similar grammar containment would be. For example, the C programming language is LL(2) but not LL(1), so an LL(1) grammar does not exist for the language.

  13. Given that I have:
    A B="1" C="2" D="3" E="4" F="5"
    In the case above B,C and F are optional while D and E are required but B,C,D,E and F can appear in any order thus it may come in as A C="2" D="3" B="1" E="5". Notice there is a missing F which is optional. How can I make a grammar for this?

    Permutations are usually handled semantically. Use a broad grammar that accepts a superset of what you want, and check for invalid input in the action code. Were both D and E present? Are any phrases repeated? ...

    s : A { phrase }+
    
    phrase :
        B = "1" 
        C = "2" 
        D = "3"
        E = "4"
        F = "5"
    
  14. Is a hand-written parser more powerful than one that is generated by a parser generator?

    Yes and No. It is generally much easier to code local parsing exceptions in recursive-descent. However, maintaining the parser is difficult and it is nearly impossible to make the parser correctly handle all error inputs. It should be noted that for modern programming languages, i.e. post C++, hand-written recursive descent may be a good choice. Its weakness is actually an advantage in this case. That is the language ambiguities that cause so much trouble for a parser generator are easily just ignored by the programmer. This works because the vast bulk of code to be compiled is correct, or nearly so with most types of errors being well known. If a particular parse derivation does not occur in practice, there is no need to account for it in the parser. This can cause incorrect programs to be silently accepted by the compiler. But good testing can usually handle much of this.

  15. Does SLK support Unicode?

    Yes and no. The parser generator itself only deals with integer tokens, not with any kind of text. The scanner converts the input text into a stream of tokens, so it is responsible for handling Unicode. The sample C/C++ scanners are not written to handle Unicode, though they could be modified to do so. The sample Java/C# parsers do handle Unicode using the languages' built-in support.

  16. Where can I find general information about grammars and parsing?

    Any compiler text book is a good source of information.

  17. What is LL(inf)?

    It is a form of top-down short-circuit backtracking parsing. SLK implements a limited, controlled version of this and calls it predictive backtracking. Backtracking is always inefficient, so it should be used with care. The short-circuit algorithm picks the first syntactically valid parse that it encounters, not trying any others at all. In the case of the C++ expression/declaration ambiguity, this is clearly correct because the language defines it to be so. In other cases, the programmer should be cautious, and understand what he is doing.

  18. What is LL(*)?

    It is a limited version of LL(inf) that only backtracks over repeating constructs to resolve conflicts. It helps reduce the need for left factoring in these cases.

  19. Has SLK been used in commercial applications?

    Yes, quite a few from very small to very large. They are not listed here mainly because of nondisclosure concerns.


Appendix B: SLK Performance

Traditional strong LL(k) grammar analysis is usually intractable for common grammars that need a lookahead of 2 or more. This is because the number of lookahead strings is an exponential function of the size of the terminal set of the grammar. For an LL(k) language with an alphabet, or terminal set T, this size is |T|k.

The SLK algorithms analyze a grammar without having to construct all of the possible lookahead strings, only those that are relevant to the conflict being resolved. So the amount of time that it takes for SLK to analyze a grammar and construct a parser for it is a linear function of the number of conflicts in the grammar. Most grammars have only a linear number of conflicts, so the SLK analysis is linear for them.

SLK Scalability

A production compiler for a small LL(2) knowledge representation language was written using SLK. At one point, input files for this compiler were about 5 megabytes, and increasing steadily over time as new definitions were added. The 5 megabyte files took less than two seconds to compile on an ancient 133 MHz personal computer many years ago. This was a gross observation that included the time to load the program as well as to read in the file. (Enter key to finished.)

SLK table compaction

The following is the SLK output for a database language parser. It illustrates how effective the compaction algorithms are for medium to large size languages.

This grammar is strong LL(7) except that
  - conflicts were ignored by request on 2 nonterminal(s)

         total nonterminals   = 504
          ignored conflicts   = 2
         total terminals      = 321
         total productions    = 1371
         parse table size     = 161784

         total base conflicts = 65
         total conflicts      = 297
         conflict table size  = 95337

generating the parser
compacting parse table from 161784 to 6019
compacting conflict table from 95337 to 6804


Appendix C: SLK Command Line

This is the command line and options summary that SLK prints if no parameters are given.

Usage:  slk input_filename [-options]

  -a   append parser name to header file token defines (see -n below) 
  -b   enable aggresive heuristic ambiguity checking 
  -c   disable table compacting
  -cc  generate instance model C code (C with Classes)
  -cs  generate C#  code 
  -C++ generate C++ code 
  -Cr  -Gc with a single conflict specified as 2 __dot_ configs in grammar 
  -d   display parse derivation of the input grammar for debugging 
  -dot use ^G for dot in configs instead of $$, prints as dot but beeps in console 
  -Da  display all -D values 
  -Dc  display all conflict traces for debugging only 
  -Dd[=symbol name] display nonterminals that can left derive  symbol name
  -De  display nonterminals that can derive the empty string, epsilon 
  -Df  display first and follow sets 
  -Dl  display lookahead strings for resolved conflicts 
  -Dp  display productions
  -Dr  display LR states 
  -Ds  display symbol table 
  -Dt  display parse tables 
  -E n  convert EBNF grammar to SLK format, n=1 is IEEE, n=2 is ISO
  -e    disable EBNF grammar support 
  -f=filename    name of the file for the output grammar
  -Fa  force _action_ naming so __ can be for user tokens
  -g   generate the parser even if the grammar is ambiguous 
  -Ga  do not generate anything 
  -Gc  generate conflict resolver only, no parser
  -Gd  generate parse derivation code in get_predicted_entry for debugging
  -Gn  do not generate nonterminal strings used for displaying 
  -Gp  do not generate the production strings used for displaying 
  -Gs  do not generate action strings used for displaying
  -Gt  do not generate terminal strings used for displaying 
  -Gu  do not generate unique names for duplicate EBNF nonterminals 
  -i   ignore non-fatal warnings 
  -j   generate java code 
  -js  generate javascript code
  -k=maximum k value to test for in the grammar   default k=7
  -l=nonterminal name   indirect left factor nonterminal the old way
  -Ld[=nonterminal name]  direct left factor nonterminal or the grammar
  -Li[=nonterminal name]  indirect left factor nonterminal or the grammar
  -[LA]LR  LALR or LR grammar, (LRS for SLR grammar)
  -m   conserve memory by reusing internal structures 
  -n=name of the parser   default is Slk, must be length 3
  -ns=namespace or package name   default is no namespace
  -O[=n]  only analyze conflict number n ( -1 = don't show conflict numbers )
  -o[=n]  only analyze the nth conflict at each level above level 1
  -Ps=parse_stack_size   size in the generated parse routine, default=512
  -q   quiet mode operation 
  -Rd  remove duplicate productions 
  -Rr[=nonterminal name]  remove left recursion
  -Ru[=nonterminal name]  remove unused nonterminals
  -Sn[=nonterminal name]  substitute nonterminals for equivalent ones
  -St[=nonterminal name]  substitute terminals for equivalent nonterminals
  -Tl=filename    load token strings from a file
  -Tw=filename    write token strings to a file
  -t[=n]  set grammar transformations trace level to n for debugging only
            ( no =n  sets t=1, ( 2, 3, 4 ) )
  -v[=n]  set verbose output level to n ( no =n  sets v=1, ( 2, 3 ) )
  -Xo  eXpand opt/_opt symbols in an LR grammar to reduce conflicts
  -y[s]  yacc style scanner is used, s=strip out all actions 

Usage and options detail


Appendix D: Quotations

Aho & Ullman (1973) state that top-down parsing is better suited to syntax directed translation than is bottom-up parsing. From page 273:
When we look at translation, however, the left parse appears more desirable.
Comment about LL vs. LR from Grune & Jacobs (1990), page 242:
If one is in a position to design the grammar along with the parser, there is little doubt that LL(1) is to be preferred: not only will parsing and performing semantic actions be easier, text that conforms to an LL(1) grammar is also clearer to the human reader.
Performance comment from Fischer & LeBlanc (1991), page 120:
Since the LL(1) driver uses a stack rather than recursive procedure calls to store symbols yet to be matched, the resulting parser can be expected to be smaller and faster than a corresponding recursive descent parser.


Aho, Alfred V., and Jeffrey D. Ullman (1973). The Theory of Parsing, translation, and compiling. Vol. 1: Parsing. Prentice-Hall, Englewood Cliffs, N.J.

Fischer, Charles N. and Richard J. LeBlanc, Jr. (1991). Crafting a Compiler with C. Benjamin/Cummings, Menlo Park, Ca.

Grune, Dick and Ceriel J.H. Jacobs (1990). Parsing Techniques - A Practical Guide. Ellis Horwood, Chichester, England.


Home

Copyright © 2001-2016 SLK Systems