-
Notifications
You must be signed in to change notification settings - Fork 9
Scannerless Parsing
Grammars writted using the bootbnf.bnf tag language always inherit from some base grammar. This package implements two such base grammars directly in JavaScript:
-
defaultBaseGrammaras a simple lexer for defining a conventional two-level grammar. -
baseScannerlessas a base for scannerless parsing.
If no base grammar is explicitly specified, then the defined grammar implicitly inherits from defaultBaseGrammar, which defines the token-level rules
-
SKIPcallsSPACEandCOMMENTto skip whitespace tokens. -
SPACEcontiguous whitespace characters. -
COMMENTa line comment. -
HOLEa${...}template string hole, i.e., an anti-quote. -
EOFend of input. -
NUMBERuses JSON number syntax. -
STRINGuses JSON string syntax. -
IDENTconventional alphanumeric identifier including underbar (_) and dollar ($).
Except for SKIP, SPACE, and COMMENT, all the other token-level rules first call SKIP to skip whatever a given grammar considers whitespace. Derived grammars should do likewise if they introduce new token-level rules.
In grammars inheriting from defaultBaseGrammar, quoted identifiers define keywords which are thereby excluded from the IDENT production. Recognition of other quoted punctuation should ideally be based on what quoted strings appear in the grammar definition. However, this simple lexer instead simply recognizes
- Any of
[](){},;as a single character token. - Any contiguous sequence of the characters
:~@%&+=*<>.?|-^\/as a single token.
For defining little embedded DSLs, these simple rules work surprisingly well. For more sophicticated lexing needs, you can write an alternative to defaultBaseGrammar, or you can use scannerless parsing.
The baseScannerless grammar defines only the rules
-
SKIPwhich is meant to be overridden by derived grammars which wish to universally ignore whatever they consider to be whitespace. The default here matches only empty. TheSKIPrule should always be defined so accept empty, so that it always succeeds.CHAR,HOLE, andEOFall callSKIPfirst. In derived grammars overridingSKIP, further token-level rules should also start by callingSKIP. -
CHARmatchesSKIPfollowed by a single character. Its semantic value is that character. -
HOLEmatchesSKIPfollowed by a template string hole, i.e., an embedded "```${...}" template string unquoted expression. Its semantic value is the hole number, i.e., the index into the array of substitution arguments. -
EOFmatchesSKIPfollowed only by the end of input.
The scannerless.es6 module uses grammar inheritance to define a derived grammar, scannerless, that provides the same token-level productions as defaultBaseGrammar in a scannerless manner.
const scannerless = bnf.extends(baseScannerless)`
start ::= TOKEN* EOF ${(toks,_) => toks};
SKIP ::= (SPACE / COMMENT)*;
SPACE ::= this.${skip(SPACE_RE)};
# COMMENT is broken out to make it easy to override
COMMENT ::= this.${skip(LINE_COMMENT_RE)};
TOKEN ::= NUMBER / STRING / IDENT / CHAR / HOLE;
NUMBER ::= this.${match(NUMBER_RE)};
STRING ::= this.${match(STRING_RE)};
IDENT ::= this.${match(IDENT_RE)};
`;This example makes use of an additional feature of our bnf language, external rule methods, as well as two helper functions, skip and match, to help define such external rule methods.
The bnf language syntax
"this" "." HOLE
expresses a call to an external rule. The substitution value in the hole is expected to be a this-sensitive function whose this-value is expected to be a parser (i.e., an instance of a Parser class), and whose single argument is a position. It is expected to return a pair of a new position and either a semantic value or FAIL. In other words, this external function is invoked as if it is a rule method of this parser. By virtue of packrat memoization, this function will be called at most once at any given position of a parse.
The skip and match helper functions take a JavaScript RegExp (regular expression) object as argument, and returns a function that can be used as an external rule method. This rule method will eat as many characters as match the regular expression starting at the current position. The difference is that match defines an external rule method that first calls the SKIP rule, which is normally what you want. The skip function does not, which is typically only needed for defining components of the SKIP rule, such as SPACE and COMMENT above.
The *_RE variables above hold the same regular expressions used to define the defaultBaseGrammar, helping to ensure that these agree.