aPTK Grammar Syntax

Syntax of aPTK Grammars are oriented on BNF and a bit on Perl6 grammars.

A grammar consists of production rules and statements. Statements influence parsing and/or interpretation of the parsed. Optionally you may add assertions, to prove, that your rules meet your expectations.

General

All rules and statements have to start on same indentation level. If you want to continue a rule or statement on next line, you can do so by indenting next line a bit more than the line, where your rule or statement started:

:grammar grammar

grammar := [ <statement> 
           | <production-rule>
           | <test-assertion>
           ]*

ws      := \s+

Lines abore define new grammar named “grammar” and define first rule, the default entry point of the grammar.

Statements

A statement is a line, which starts with a ”:”. There are following statements supported:

:grammar <name> [ extends [ <grammar-name> ]+ ]?

Define a new grammar named <name>, which extends grammars <grammar-name>. If you do not pass <grammar-name>, it defaults to Grammar

This statement is available in contexts where you not have predefined a grammar, as for example if you define your grammar as python class.

Examples for :grammar:

:grammar very-simple-grammar

:grammar another-grammar extends very-simple-grammar

:grammar x extends aptk.BaseGrammar
:parse-actions [ <name> <python-name> ]+

Define a ParseActions class (or module), which can be later used in tests (or simply referenced by its name, when creating a parser):

:parse-actions my_module.MyParseAction

This statements imports parse-actions into your grammar, that you can make use of it in test assertions:

<some-rule> ~~ "some string" -MyParseAction-> some ast
:parse-action-map [ <name> <method-name> ]+

Map <string> to <method-name>, which is expected to exist in parse-actions passed to parser. After mapping <string> to <method-name>, you can use <string>= as operator in production rule, to assign a parse-action:

:parse-action-map
    "foo" make_foo

some-rule  foo= "some right-hand side"

These parse-action-map become handy, if there is an action which is done for more than one capture.

:sigspace [ <non-terminal> | <terminal> ]
Set rule for significant whitespace.
:args-of <custom-rule-name> [ [ <arg-flag> ]+ | <callable> ]

Specify how args of a complex custom rule are parsed:

arg-flag := "string" | "capturing" | "non-capturing" | "regex" |
               "raw" | "slashed-regex" | "char-class"

callable := <module-name> "." <>

Production Rules

A production rule consists of a name, an operator, and a statement on the right hand side:

production-rule := <token-def> | <rule-def>

token-name := "{" <name> "}" | <name>
rule-name  := "<" <name> ">" | <name>

You can have following operators:

This is the formal definition of production rules, here follow detailed explanations with examples:

Tokens

Tokens are a special form of production rules:

token-def  :-  <token-name> "=" <token-value>
<token-name>

Can be any name. All characters except whitespace, with two limitations:

  • <token-name> must not start and end with a ”:” or be enclosed by “{:” and ”:}
  • <token-name> may be optionally be enclosed by “{” and “}” for better readability.
<token-value>
<token-value> is interpreted as regular expression as described in re.

Tokens are simply macros where {<token-name>} is replaced by <token-value> such that quantifications of tokens hold:

foo1 = bar
foo2 = [bar]
foo3 = a
foo4 = \n
foo5 = [ bar ]*

<some-rule-1> := here\x20is\x20{foo1}*
<some-rule-2> := here\x20is\x20{foo2}*
<some-rule-3> := here\x20is\x20{foo3}*
<some-rule-4> := here\x20is\x20{foo4}*
<some-rule-5> := here\x20is\x20{:foo1:}*
<some-rule-6> := here\x20is\x20[{:foo1:}{:foo4:]]*
<some-rule-7> := here\x20is\x20{foo5}

Token replacement creates following rules from this, before really parsing them:

<some-rule-1> := here\x20is\x20(?:bar)*
<some-rule-2> := here\x20is\x20[bar]*
<some-rule-3> := here\x20is\x20a*
<some-rule-4> := here\x20is\x20\n*
<some-rule-5> := here\x20is\x20bar*
<some-rule-6> := here\x20is\x20[bar\n]*
<some-rule-7> := here\x20is\x20(?:(?:bar)*)

You see that tokens are used in a way that the quantification after the token always quantifies the entire token not like in <some-rule-5> where simply the value of the token was substituted.

So you can also let your token be exanded with {:<token-name>:} syntax, which is simply expanding the value of tokens without taking care of grouping for clean quantifications. This expansions are intended to be used e.g. as character-classes (this is also the reason for the choice of syntax), as seen in <some-rule-6>, but maybe there are other use cases.

In <some-rule-7> there is used {foo5} token. Where you see a special notation of:

foo5 = [ bar ]*

In tokens a “[” sorrounded by whitespace is replaced by “(?:” and a “]” surrounded by whitespace or followed by a quantifier like ”?”, “*”, “+” or “{a,b}” is replaced by ”)” and the optional quantifier. This is for convenience and better readability of the token rule. Do not confuse with:

foo6 = [bar]*

Because:

{foo5} ~~ barbar

{foo6} ~~ brarab

{foo5} !~ brarab

Rules

Formally rules are defined as this:

rule-def     :- <rule-name> <operator> <alternatives>
alternatives :- <sequence> [ {or} <sequence> ]
sequence     :- [ <non-terminal> | <terminal> ]
non-terminal := [ <capturing> | <non-capturing> 
                | <sub-rule ] <quantification>?

terminal     := <string> | <regex>

quantification :=  "?" | "*" | "+" | "{" \d* "," \d* "}"

operator := <token-op> | <backtracking-op> | <non-backtracking-op> 
          | <backtracking-sigspace-op> | <non-backtracking-sigspace-op> 



token-op                     := "="
backtracking-op              := ":" <parse-action> "="
backtracking-sigspace-op     := ":" <parse-action> "-"
non-backtracking-op          := <parse-action> "="
non-backtracking-sigspace-op := <parse-action> "-"

parse-action                 := ":" | [^=]+

A production rule has the form:

:sigspace {ws}
after-ws          = (?<=\s)
before-ws         = (?=\s)
or                = {after-ws} \| {before-ws}

<production-rule> ::- <non-terminal> <rule-op> <alternatives>

<alternatives>    ::- <sequence> [ {or} <sequence> ]*

<sequence>        ::= [ <non-terminal> | <terminal> ]+

<terminal>        ::= <string> | <regex>

<non-terminal>    ::= [ <capturing> | <non-capturing> | <sub-rule> 
                      ] <quantification>?

<quantification>  ::= "?" | "*" | "+" | "{" \d* "," \d* "}"
<non-terminal>

May be enclosed by “<”, “>” for beeing closer to BNF or better readability, but this is not neccesserily needed. So:

<foo> ::= "bar"

is equivalent to:

foo   ::= "bar"
<rule-op>

This is a tricky thing. Usually you will use ”:=”. But you can use any <parse-action>= for it. See also parse-actions.

There are more flavors of the <rule-op>, for specifying significant space and backtracking on failure:

rule-op description
= Specify a token, which can be used later as macro.
:= Normal rule.
::= Backtracking rule.
:foo= Backtracking rule calling “foo” method from ParseActions
foo= Normal rule calling “foo” method from ParseActions
:- Normal rule using significant whitespace
::- Backtracking rule using significant whitespace.

In short:

  • a rule with a <rule-op> with a preceding ”:”, does backtracking on failure.
  • a rule with a <rule-op> using a “-” instead of “=” has significant whitespace
<string>

May be a double-quoted or a single-quoted string. Like:

"foo" "foo\n" "foo\"" 'foo"bar"' 'bar\''

This is a terminal in terms of grammars.

<regex>
Anything, which is not anything else listed here is interpreted as regular expression like defined in re.
<non-capturing>

From syntactical point of view it is a “<.capturing>” rule. So the same like a capturing rule, except you have a ”.” right behind the opening “<”.

No-capturing rules pass their captured children to the parental rule, which combines the children of all non-capturing childrens to its own list of children.

Examples:

<.simple-rule> <.rule-with-arg:foo> <.ext-rule{ here is more }>
<capturing>

Capturing rule has three syntactical flavours:

<ws>        ::= \s+

<simple>    ::= "<" <non-terminal-name> ">"

<with-arg>  ::= "<" <non-terminal-name> ":" <arg>  ">"

<with-args> ::= "<" <non-terminal-name> "{" [ <.ws> <extarg> ]* <.ws> "}>"

<arg>       ::= (?:\\\\|\\>|[^>])*

<extarg>    ::= (?!\}>\s) (?!\}>$) [^\s]+

Where name is the name of another non-terminal. The two extended versions of rule-calls are for invoking custom rules, which do more than simply parsing sequencenses or alternatives.

Please note for <with-args> rules:

Backtracking

Explain here how backtracking works

Significant Whitespace

Explain here how significant whitespace works.

Test Assertions

Assert, that your rules match

If you want to assert, that a rule matches a certain string you can add an assertion:

<my-rule> ~~ "foo"

Assert, that your rules do not match

If you want to assert that a rule does not match some string you can add an assertion:

<my-rule> !~ "foo"

Assert, that your rules produce some expected syntax tree

If you want to assert that a rule produces some syntax tree you can add an assertion:

<my-rule> ~~ "foo" -> my-rule("foo")

Token and exact match

Difference between token match and exact match is, that in token matches whitespace is ignored and only non-whitespace tokens are compared. In exact match there is compared complete string:

<my-rule> ~~ "something" ->
    In token
        match only
         non-whitespace          tokens
     are considered for
               comparison.

<my-rule> =~ "something" --> "Must output exact this string"

Multiline input

You can specify multiline input (or expected output) by lines preceded by “``| ``”:

<my-rule> ~~
    | first line
    | second line
    |
    | And a line after an
    | empty line
    ->
    | Same for
    | expected output.

For testing your grammar you can setup test assertions for your rules:

<my-rule> ~~ "foo"
<my-rule> !~ "foo"
<my-rule> ~~ "foo" -> my-rule("foo")
<my-rule> =~ "foo" -> "foo"
<my-rule> =~
    | a really
    | long, long
    | text.
    | 
    | with another paragraph
    -> here
       is what
       I expect
       to be the ast's output.

<my-rule> =~ "foo" -MyParseActions-> [ 'f', 'o', 'o' ]

Formally test assertions are created with following syntax:

<test-assertion> ::= <test-rule> <test-op> <string-to-match> [ <ast-op> <expected-output> ]?

<test-rule> ::= "<" <non-terminal-name> ">"

<test-op>   ::= (?P<token-match>~~)|(?P<not-match>!~)|(?P<equal-match>=~)

<string-to-match>   ::= <quoted-string> | <multi-line-string>
<multi-line-string> ::= [ \s* [ \|(?P<line>\n) | \|\s <line> ] ]+

<ast-op>    ::= -> | -(?P<parse-actions-name>\w+)->

<expected-output> ::= <quoted-string> | <multi-line-string> | <tokens>