Chapter 3. How it works

At the end of the day, parsing an input against an Invisible XML grammar is really parsing that input against an Earley grammar constructed to recognize the same set of inputs (sometimes called “sentences”). When everything works, this is completely concealed. Unfortunately, when things go wrong, coffeepot can do little but show you the underlying grammar.

You can peek at this directly, even on a successful parse, with the --show-grammar option.

The Invisible XML grammar is much more directly expressive than the rules allowed by the Earley parser. The Earley parser accepts a set of rules that map nonterminals to a flat sequence of symbols (either terminals or nonterminals). The underlying grammar has no direct support for optionality or repeatability.

In order to support these “higher level” concepts, the parser rewrites the rules in terms of additional nonterminals. (Some of these rewrites are described in the Hints for Implementors section of the Invisible XML specification.)

Let’s look at some examples.

3.1. Mapping a simple rule

Simple rules, like a match for “aba”, map in a straightforward way.

Example 3.1. The Invisible XML grammar

s: a, b, a .            { match "aba" }
a: 'a' .
b: 'b' .

The resulting Earley grammar has introduced a new nonterminal “$$” as the start symbol. This means that all of the grammars will have the same start symbol, irrespective of the users grammar.

Example 3.2. The Earley grammar

$$ ::= s
 s ::= a, b, a
 a ::= 'a'
 b ::= 'b'

3.2. Mapping alternatives

An Invisible XML grammar must have exactly one rule for each nonterminal. Alternatives must be expressed in that single rule:

Example 3.3. The Invisible XML grammar

s: a, b, a ; b, a, b .  { match "aba" or "bab" }
a: 'a' .
b: 'b' .

In the Earley grammar, there are no alternatives, but there is no prohibition on multiple rules for the same nonterminal.

Example 3.4. The Earley grammar

$$ ::= s
 s ::= a, b, a
 s ::= b, a, b
 a ::= 'a'
 b ::= 'b'

3.3. Mapping “x?”

Invisible XML uses “?” to express that a symbol is optional.

Example 3.5. The Invisible XML grammar

s: a, b?, a .           { match "aba" or "aa" }
a: 'a' .
b: 'b' .

Optionality is implemented by having a new rule where the right hand side is either “nothing” symbolized by the greek letter ε or the other alternative. When CoffeeGrinder has to introduce new symbols, it names them things like “$1_b-optionⁿ”. The leading “$1_” (or “$2_”, “$3_”, etc.) just makes the name unique. What follows is an attempt to describe the rule “b-option”. The superscript “ⁿ” tells you that this rule can match nothing. (Rules that match nothing are a common source of ambiguity, so it’s handy to be able to see them at a glance.)

Example 3.6. The Earley grammar

          $$ ::= s
           s ::= a, $1_b-optionⁿ, a
           a ::= 'a'
           b ::= 'b'
$1_b-optionⁿ ::= ε
$1_b-optionⁿ ::= b

3.4. Mapping “x*”

Invisible XML uses “*” to express that a symbol may be repeated zero or more times.

Example 3.7. The Invisible XML grammar

s: a, b*, a .           { match "aa", "aba", "abba", "abbba" ... }
a: 'a' .
b: 'b' .

This requires a couple of new nonterminals.

Pay particular attention to nullable nonterminals, either your own or ones introduced by rewriting. The most common source of ambiguity in a grammar is to allow more than one nonterminal at a particular location to map to ε.

Example 3.8. The Earley grammar

        $$ ::= s
         s ::= a, $1_b-starⁿ, a
         a ::= 'a'
         b ::= 'b'
$1_b-starⁿ ::= ε
$1_b-starⁿ ::= $2_b-plus
 $2_b-plus ::= b
 $2_b-plus ::= b, $2_b-plus

3.5. Mapping “x+”

Invisible XML uses “+” to express that a symbol may be repeated one or more times.

Example 3.9. The Invisible XML grammar

s: a, b+, a .           { match "aba", "abba", "abbba" ... }
a: 'a' .
b: 'b' .

Requires this new nonterminal.

Example 3.10. The Earley grammar

       $$ ::= s
        s ::= a, $1_b-plus, a
        a ::= 'a'
        b ::= 'b'
$1_b-plus ::= b
$1_b-plus ::= b, $1_b-plus

3.6. Mapping “x1**x2”

Invisible XML uses “**” between two symbols to indicate that the first symbol may be repeated zero or more times, separated by the second. For example, name*',' accepts zero or more names (however they’re defined) separated by commas.

This is a clever syntactic convenience for authors.

Example 3.11. The Invisible XML grammar

s: a, b**'.', a .        { match "aa", "aba", "ab.ba", "ab.b.ba" ... }
a: 'a' .
b: 'b' .

It’s a bit of a mess for the Earley grammar.

Example 3.12. The Earley grammar

            $$ ::= s
             s ::= a, $1_b-star-sepⁿ, a
             a ::= 'a'
             b ::= 'b'
$1_b-star-sepⁿ ::= ε
$1_b-star-sepⁿ ::= $2_b-plus-sep
 $2_b-plus-sep ::= b, $3_L.-starⁿ
   $3_L.-starⁿ ::= ε
   $3_L.-starⁿ ::= $4_L.-plus
    $4_L.-plus ::= '.', b
    $4_L.-plus ::= '.', b, $4_L.-plus

3.7. Mapping “x1++x2”

Invisible XML uses “++” between two symbols to indicate that the first symbol may be repeated one or more times, separated by the second.

Example 3.13. The Invisible XML grammar

s: a, b++'.', a .        { match "aba", "ab.ba", "ab.b.ba" ... }
a: 'a' .
b: 'b' .

Also a bit of a mess for the Earley grammar.

Example 3.14. The Earley grammar

           $$ ::= s
            s ::= a, $1_b-plus-sep, a
            a ::= 'a'
            b ::= 'b'
$1_b-plus-sep ::= b, $2_L.-starⁿ
  $2_L.-starⁿ ::= ε
  $2_L.-starⁿ ::= $3_L.-plus
   $3_L.-plus ::= '.', b
   $3_L.-plus ::= '.', b, $3_L.-plus

3.8. Mapping ¯\_(ツ)_/¯

Of course, the syntax allows these rules to be mixed more-or-less arbitrarily.

Example 3.15. The Invisible XML grammar

{ match "a", "aa", "aba", "ab-ba", "ab.ba" ... }
s: a+, (b**'-' ; b++'.')?, a* .
a: 'a' .
b: 'b' .

Consequently, real grammars tend to be quite messy. (Not that I’d call this a real grammar in any useful sense.)

Example 3.16. The Earley grammar

                $$ ::= s
                 s ::= $8_a-plus, $12_$1_alt-optionⁿ, $5_a-starⁿ
                 a ::= 'a'
                 b ::= 'b'
            $1_alt ::= $2_b-star-sepⁿ
            $1_alt ::= $3_b-plus-sep
    $2_b-star-sepⁿ ::= ε
    $2_b-star-sepⁿ ::= $4_b-plus-sep
     $3_b-plus-sep ::= b, $6_L.-starⁿ
     $4_b-plus-sep ::= b, $7_L--starⁿ
        $5_a-starⁿ ::= ε
        $5_a-starⁿ ::= $9_a-plus
       $6_L.-starⁿ ::= ε
       $6_L.-starⁿ ::= $10_L.-plus
       $7_L--starⁿ ::= ε
       $7_L--starⁿ ::= $11_L--plus
         $8_a-plus ::= a
         $8_a-plus ::= a, $8_a-plus
         $9_a-plus ::= a
         $9_a-plus ::= a, $9_a-plus
       $10_L.-plus ::= '.', b
       $10_L.-plus ::= '.', b, $10_L.-plus
       $11_L--plus ::= '-', b
       $11_L--plus ::= '-', b, $11_L--plus
$12_$1_alt-optionⁿ ::= ε
$12_$1_alt-optionⁿ ::= $1_alt

3.9. A word about “-”, “@”, and “^”

The marks added to the grammar that reflect how (or if) a symbol should be output have no effect on the rule rewriting. They’re handled as attributes on the symbols.

3.10. Insertions

T.B.D.