Chapter 5Ambiguity

Ambiguity is not an error. The fact that Invisible XML grammars allow ambiguity is a feature. It’s also generally observed as the combination of a grammar and an input. Consider this grammar:

Example 5.1SNL™ Shimmer Sketch
product         =  dessert-topping | floor-wax .
dessert-topping = "custard" | "cream" | "Shimmer" .
floor-wax       = "paste-wax" | "quick-shine" | "Shimmer" .

(If you aren’t familiar with the Saturday Night Live “Shimmer Floor Wax” sketch, now would be the time to go search the web.)

Parsed against the input “custard”, it says dessert topping:

$ coffeepot -pp -g:examples/ambig01.ixml custard

Parsed against the input “paste-wax”, it says floor wax:

$ coffeepot -pp -g:examples/ambig01.ixml paste-wax

Parsed against “Shimmer”, it says:

$ coffeepot -pp -g:examples/ambig01.ixml Shimmer
There are 2 possible parses.
<product xmlns:ixml="" ixml:state="ambiguous">

This is an example of an essential ambiguity; you can’t “fix” this grammar. But let’s dig a little deeper anyway. For a small number of parses, one way to investigate the ambiguity is to simply list them all with --parse-count:all:

$ coffeepot -pp -g:examples/ambig01.ixml --parse-count:all Shimmer
<ixml parses='2' totalParses='2'>
<product xmlns:ixml="" ixml:state="ambiguous">
</product><product xmlns:ixml="" ixml:state="ambiguous">

Alternatively, we can ask the parser to describe the ambiguity with --describe-ambiguity:

$ coffeepot -pp -g:examples/ambig01.ixml --describe-ambiguity --no-output Shimmer
Found 2 possible parses.
At /$$[1]
      ✔ product «1,7» ⇒ dessert-topping
        product «1,7» ⇒ floor-wax

This indicates that the characters from 1 to 8 in the input can be matched as a product containing a dessert-topping or a floor-wax. This is another case where the forest graph can be useful.

Figure 5.1The parse forest for “Shimmer”

It is possible for grammars to be infinitely ambigous. Consider this trivial grammar:

Example 5.2An infinitely ambiguous grammar
expr: expr ; 'a' .

There’s no practical way for coffeepot to enumerate infinitely many parses, so it essentially ignores edges in the graph if it encounters them a second time. Parsing “a” yields:

$ coffeepot -pp -g:examples/ambig02.ixml a
Found 2 possible parses (of infinitely many).
<expr xmlns:ixml="" ixml:state="ambiguous">a</expr>

There are 2 parses because there’s one choice of two paths in the graph. A description of the ambiguity, reveals that there are infinitely many parses:

$ coffeepot -pp -g:examples/ambig02.ixml --describe-ambiguity --no-output a
Found 2 possible parses (of infinitely many).
At /$$[1]
      ✔ expr «1,1» ⇒ 'a'
        expr «1,1» ⇒ expr

This is also evident in the graph:

Figure 5.2The forest for an infinitely ambiguous parse

That loop is the source of infinite ambiguity. The parses that coffeepot will enumerate are:

$ coffeepot -pp -g:examples/ambig02.ixml --parse-count:all a
Found 2 possible parses (of infinitely many).
<ixml parses="2" totalParses="2" infinitelyAmbiguous="true">
<expr xmlns:ixml="" ixml:state="ambiguous">a</expr>
<expr xmlns:ixml="" ixml:state="ambiguous">

The default “sequential” tree selector will choose every node and ever edge at least once. For graphs without loops, this matches the number of parses reported. Tree construction from graphs that contain loops is challenging. The total number of parses reported does not take loops into consideration. The tree selector tries to follow all of the paths once, but it may not always succeed. Consider:

{ An absurd grammar to stress-test ambiguity handling. }
S = A .
A = B | C | X .
B = D | E .
C = F | G | A .
D = H | A .
E = I .
F = I .
G = A .
H = D | K .
I = "t" | K | G .
K = A .
X = A .

The only input that parses is the single letter “t”:

coffeepot -g:src/website/xml/examples/horror.ixml --pretty-print t
Found 12 possible parses (of infinitely many).
<S xmlns:ixml="" ixml:state="ambiguous">

The claim of 12 parses is based on the number of choices in the graph. But the graph is pernicously looped:

Figure 5.3A perniciously looped graph

An absurd grammar to test ambiguity handling.

Asking for all of the parses will return 57 different parses. That’s not all of them, even by the criteria that coffeepot is using. The processor tries to follow all of the unique paths, but also tries to avoid falling into an infinite loop.

The priority tree selector will always choose the path with the highest priority (if there is one). If all choices are made by a uniquely highest priority in each case, this fact is recorded. CoffeePot will not report such a parse as ambiguous.

If you’re trying to eliminate ambiguity from a grammar that you think should be unambiguous, look for multiple ways to match “nothing”. For example, if you have a nonterminal that matches zero or more whitespace characters, make sure it isn’t possible for it to match in two different places.

5.1Describing ambiguity

Ambiguity can be described in one of three ways: using a compact, text format; using a minimal XML format; or using the full XML format that will be used when choosing alternatives with a function or XPath expression.

In the examples below, the parse output is from the numbers grammar and input introduced in Chapter 6, Choosing among alternatives.

5.1.1Textual ambiguity descriptions

The text format identifies the location in the result tree with a simple XPath expression and then lists each option in the form “nonterminal => right-hand-side”. An “X” marks the selected alternative.

For example:

At /$$[1]/number-list[1]/$1_number-plus[3]
      ✔ number «10,2» ⇒ hex
        number «10,2» ⇒ decimal

This indicates that at the “number” element, in the third item of the number list, there are two choices “hex” or “decimal”. Both span input tokens 10-11. The “hex” alternative has been selected.

5.1.2XML ambiguity descriptions

The XML format presents the same information in a small XML fragments that represent the parse trees under consideration.

For example:

At /$$[1]/number-list[1]/$1_number-plus[3] (selected C3464)
<symbol name="number" id="N3694" mark="^" start="10" length="2">
   <parent ref="N3699">$1_number-plus</parent>
   <children id="C3464" priority="0">
      <symbol name="hex" ref="N3692" start="10" length="2"/>
   <children id="C3465" priority="0">
      <symbol name="decimal" ref="N3693" start="10" length="2"/>

As before, this indicates that at the “number” element, there are two choices, represented by separate children elements.