Chapter 5. Ambiguity
Ambiguity is not an error. The fact that Invisible XML grammars allow ambiguity is a feature. It’s also generally observed as the combination of a grammar and an input. Consider this grammar:
product = dessert-topping | floor-wax .
dessert-topping = "custard" | "cream" | "Shimmer" .
floor-wax = "paste-wax" | "quick-shine" | "Shimmer" .
(If you aren’t familiar with the Saturday Night Live “Shimmer Floor Wax” sketch, now would be the time to go search the web.)
Parsed against the input “custard”, it says dessert topping:
$ coffeepot -pp -g:examples/ambig01.ixml custard
<product>
<dessert-topping>custard</dessert-topping>
</product>
Parsed against the input “paste-wax”, it says floor wax:
$ coffeepot -pp -g:examples/ambig01.ixml paste-wax
<product>
<floor-wax>paste-wax</floor-wax>
</product>
Parsed against “Shimmer”, it says:
$ coffeepot -pp -g:examples/ambig01.ixml Shimmer
There are 2 possible parses.
<product xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<floor-wax>Shimmer</floor-wax>
</product>
This is an example of an essential ambiguity; you can’t “fix”
this grammar. But let’s dig a little deeper anyway. For a small number
of parses, one way to investigate the ambiguity is to simply list them
all with --parse-count:all
:
$ coffeepot -pp -g:examples/ambig01.ixml --parse-count:all Shimmer
<ixml parses='2' totalParses='2'>
<product xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<floor-wax>Shimmer</floor-wax>
</product><product xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<dessert-topping>Shimmer</dessert-topping>
</product></ixml>
Alternatively, we can ask the parser to describe the ambiguity with
--describe-ambiguity
:
$ coffeepot -pp -g:examples/ambig01.ixml --describe-ambiguity --no-output Shimmer
Found 2 possible parses.
At /$$[1]
✔ product «1,7» ⇒ dessert-topping
product «1,7» ⇒ floor-wax
This indicates that the characters from 1 to 8 in the input can be matched
as a product
containing a dessert-topping
or
a floor-wax
. This is another case where the forest graph
can be useful.
It is possible for grammars to be infinitely ambigous. Consider this trivial grammar:
expr: expr ; 'a' .
There’s no practical way for coffeepot to enumerate infinitely many parses, so it essentially ignores edges in the graph if it encounters them a second time. Parsing “a” yields:
$ coffeepot -pp -g:examples/ambig02.ixml a
Found 2 possible parses (of infinitely many).
<expr xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">a</expr>
There are 2 parses because there’s one choice of two paths in the graph. A description of the ambiguity, reveals that there are infinitely many parses:
$ coffeepot -pp -g:examples/ambig02.ixml --describe-ambiguity --no-output a
Found 2 possible parses (of infinitely many).
At /$$[1]
✔ expr «1,1» ⇒ 'a'
expr «1,1» ⇒ expr
This is also evident in the graph:
That loop is the source of infinite ambiguity. The parses that coffeepot will enumerate are:
$ coffeepot -pp -g:examples/ambig02.ixml --parse-count:all a
Found 2 possible parses (of infinitely many).
<ixml parses="2" totalParses="2" infinitelyAmbiguous="true">
<expr xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">a</expr>
<expr xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<expr>a</expr>
</expr>
</ixml>
The default “sequential” tree selector will choose every node and ever edge at least once. For graphs without loops, this matches the number of parses reported. Tree construction from graphs that contain loops is challenging. The total number of parses reported does not take loops into consideration. The tree selector tries to follow all of the paths once, but it may not always succeed. Consider:
{ An absurd grammar to stress-test ambiguity handling. }
S = A .
A = B | C | X .
B = D | E .
C = F | G | A .
D = H | A .
E = I .
F = I .
G = A .
H = D | K .
I = "t" | K | G .
K = A .
X = A .
The only input that parses is the single letter “t”:
coffeepot -g:src/website/xml/examples/horror.ixml --pretty-print t
Found 12 possible parses (of infinitely many).
<S xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<A>
<B>
<E>
<I>t</I>
</E>
</B>
</A>
</S>
The claim of 12 parses is based on the number of choices in the graph. But the graph is pernicously looped:
Asking for all of the parses will return 57 different parses. That’s not all of them, even by the criteria that coffeepot is using. The processor tries to follow all of the unique paths, but also tries to avoid falling into an infinite loop.
The priority tree selector will always choose the path with the highest priority (if there is one). If all choices are made by a uniquely highest priority in each case, this fact is recorded. CoffeePot will not report such a parse as ambiguous.
If you’re trying to eliminate ambiguity from a grammar that you think should be unambiguous, look for multiple ways to match “nothing”. For example, if you have a nonterminal that matches zero or more whitespace characters, make sure it isn’t possible for it to match in two different places.
5.1. Describing ambiguity
Ambiguity can be described in one of three ways: using a compact, text format; using a minimal XML format; or using the full XML format that will be used when choosing alternatives with a function or XPath expression.
In the examples below, the parse output is from the numbers grammar and input introduced in Chapter 6, Choosing among alternatives.
5.1.1. Textual ambiguity descriptions
The text format identifies the location in the result tree with a simple XPath expression and then lists each option in the form “nonterminal => right-hand-side”. An “X” marks the selected alternative.
For example:
At /$$[1]/number-list[1]/$1_number-plus[3]
✔ number «10,2» ⇒ hex
number «10,2» ⇒ decimal
This indicates that at the “number” element, in the third item of the number list, there are two choices “hex” or “decimal”. Both span input tokens 10-11. The “hex” alternative has been selected.
5.1.2. XML ambiguity descriptions
The XML format presents the same information in a small XML fragments that represent the parse trees under consideration.
For example:
At /$$[1]/number-list[1]/$1_number-plus[3] (selected C3464)
<symbol name="number" id="N3694" mark="^" start="10" length="2">
<parent ref="N3699">$1_number-plus</parent>
<children id="C3464" priority="0">
<symbol name="hex" ref="N3692" start="10" length="2"/>
</children>
<children id="C3465" priority="0">
<symbol name="decimal" ref="N3693" start="10" length="2"/>
</children>
</symbol>
As before, this indicates that at the “number” element,
there are two choices, represented by separate children
elements.