Recursive Descent Parser using LINQ: The Augmented Backus-Naur Form Grammar
A grammar is a device to define syntax for a language. A grammar is made up of rules, sometimes called productions. Each rule defines a symbol, when can then be further used in other rules. Grammars are not hard to understand; most developers instinctively understand grammars when they see them. When you learn a new programming language, almost without thinking about it, you assemble some version of the grammar in your head. One of the benefits of reading the grammar of a language is to make sure that the conceptual grammar you’ve mentally assembled matches the actual grammar of the language.
This blog is inactive.
New blog: EricWhite.com/blog
Microsoft devotes a great deal of effort towards writing interoperability documents. If you are a document format geek like me (or even if you only peripherally use document formats), you can find a treasure trove of information on MSDN under Microsoft Office File Format Documents.
The grammar that we want to use to parse SpreadsheetML formulas is in the interoperability document: Excel Extensions to the Office Open XML SpreadsheetML File Format (.xlsx) Specification. This grammar is expressed in Augmented Backus–Naur Form (ABNF).
In this post, I’m going to distill ABNF to just the set of rules and grammar syntax that we need to understand to write a parser for the grammar in the Excel Extensions Specification (linked above). I’ll take all examples of ABFN grammar from that spec.
Terminals express the actual text of the programming language. A grammar expresses a terminal either as a quoted string, or sometimes as a list of values, such as the hex values for the digits from “0” – “9”.
Following is another symbol that uses a literal string terminal.
full-stop = "."
Just to be clear, this is the terminal:
Where this is the grammar rule:
The terminal in the full-stop rule is just the string literal:
full-stop = "."
A symbol can be comprised of one OR another symbol. In ABNF, “OR” is expressed as a forward slash. The following symbol definition defines constant to be one of the list of varieties of constants.
constant = error-constant / logical-constant / numerical-constant / string-constant / array-constant
The logical-constant symbol is one of two terminals, expressed as quoted strings:
logical-constant = "FALSE" / "TRUE"
Two symbols separated by a space indicate that you must first have the one symbol, followed by the second symbol. The following rule specifies that the fractional-part symbol requires a full-stop followed by a digit-sequence.
fractional-part = full-stop digit-sequence
A symbol or terminal that is optional is surrounded by square brackets. The following definition of the exponent-part symbol indicates that the sign before the digit-sequence is optional.
exponent-part = exponent-character [ sign ] digit-sequence
The definition of exponent-character is of course:
exponent-character = "E"
The following examples could produce an exponent-part symbol:
Zero or More
If a symbol is preceded by an asterisk (*), zero or more of those symbols can occur at that point in the production of the symbol being defined. The following rule says that an expression can be made up of a ref-expression, or zero or more instances of the whitespace symbol, followed by a nospace-expression, followed by zero or more instances of the whitespace symbol.
expression= ref-expression / *whitespace nospace-expression *whitespace
The symbol bring-to-front-params is defined to be an open parenthesis followed by zero or more space symbols, followed by a close parenthesis.
bring-to-front-params = "(" *space ")"
N or More
If a symbol is preceded by a number followed by an asterisk, it indicates that you must have at least n instances of that symbol. The following defines a digit-sequence to be one or more decimal digits:
digit-sequence = 1*decimal-digit
Exactly N Symbols
If a symbol is preceded by a number, it indicates that you must have exactly n instances of that symbol. The following defines that an escaped-double-quote is comprised of exactly two adjacent double-quote symbols.
escaped-double-quote = 2double-quote
N to M Symbols
The following defines that the and-params symbol consists of an open parenthesis, followed by either a single argument-expression or an argument, followed by 1-254 comma/argument pairs, followed by a close parenthesis.
and-params = "(" (argument-expression / (argument 1*254("," argument))) ")"
Symbols in a production can be grouped by parentheses, and then preceded by a symbol quantifier. The following defines that the constant-list-rows symbol consists of one constant-list-row, followed by zero or more pairs of symbols, where the pair is a semicolon, followed by a constant-list-row.
constant-list-rows = constant-list-row *(semicolon constant-list-row)
Exceptions and Special Rules
In some places, the grammar defines some special rules in text. In our case, the following special rule is defined for an array-constant:
An array-constant MUST NOT contain
Columns or rows of unequal length.
In addition, following the grammar, there is additional text that describes further restrictions or exceptions. As necessary, we’ll need to incorporate those restrictions.
You can see that grammar rules are not very complex.
My approach for coding the recursive descent parser will be to paste the grammar rule directly into the class that implements the rule as a C# comment. This makes it very easy to correlate the grammar to the code that implements the rule.
In the next post, I’m going to define a super-small grammar that is a subset of the Excel formulas grammar. Then in subsequent posts, we’ll implement and test a parser for that small grammar.