Chapter 7. Choosing among alternatives
This API has been completely rewritten for version 3.0.0.
Where it’s practical to write a grammar that is unambiguous, that’s best. But it isn’t always practical. Sometimes it’s difficult, and sometimes it’s impossible. If the data is actually ambiguous, you may have to reflect that in your grammar.
Invisible XML doesn’t consider ambiguity an error, but it also doesn’t provide any mechanism for controlling it. All parses are considered equal and the processor’s only obligation is to provide one of them.
That may not suit your needs. CoffeeSacks provides a way to
examine the alternatives and select one. You can supply a
choose-alternative
function in the parser options. This
must be a function that takes a single argument, a list of elements,
and returns an integer.
The function receives a context element that identifies its location in the parse forest and a map that describes the current state of the parse:
Property | Value |
---|---|
input | The input string |
available-choices | A list of available choices |
other-choices | A list of other choices |
user-defined | User defined properties |
The function must return a map that contains a selection
property that
identifies the choice made. It may also return an ambiguous-choice
property to
indicate that an abitrary, ambiguous choice was made.
Any other properties returned in the map will be passed back on the next call. This allows the selection function to maintain state between invocations.
The function must return one of the choices passed to it. There will always be
at least one item in the choices
list. The difference between the available
choices and the “other” choices is only relevant when the grammar contains a loop. If the
grammar loops, the previously selected choices will be in other-choices
.
Only choose from that list if you have some other way to avoid looping infinitely.
This function always selects the first alternative:
<xsl:function name="f:first" as="map(*)">
<xsl:param name="context" as="element()"/>
<xsl:param name="options" as="map(*)"/>
<xsl:sequence select="map { 'selection': $options?available-choices[1] }"/>
</xsl:function>
You could pass it in parse options like this:
<xsl:variable name="parser"
select="cs:make-parser($grammar,
map{'choose-alternative: f:first#2})"/>
Of course, unconditionally choosing the first option isn’t very interesting. To explore further, consider this simple, ambiguous grammar:
number-list = (number, -#a)+, number? .
number = hex | decimal .
hex = hex-digit+ .
decimal = decimal-digit+ .
-hex-digit = ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit = ["0"-"9" ] .
If we parse the following input,
bad
cafe
42
We might get this result:
<number-list xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<hex>bad</hex>
<hex>cafe</hex>
<hex>42</hex>
</number-list>
Of course, we might equally get this result:
<number-list xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<hex>bad</hex>
<hex>cafe</hex>
<decimal>42</decimal>
</number-list>
The ambiguity here is between “decimal” and “hexidecimal”. Here’s a function that will always select the decimal alternative:
<xsl:function name="f:choose" as="map(*)">
<xsl:param name="context" as="element()"/>
<xsl:param name="options" as="map(*)"/>
<xsl:variable name="choice"
select="$context/children[symbol[@name='decimal']]/@id"/>
<xsl:sequence select="map { 'selection': $choice }"/>
</xsl:function>
In order to understand how this works, we need to look at what’s passed to the function. The function get’s a context element that is an XML description of the current state of the parse, and a map. For our example, the map would look something like this:
map {
"input":"bad
cafe
42",
"available-choices": ("C3464","C3465"),
"other-choices":()
}
And the context element like this:
<symbol name="number" id="N3694" mark="^" start="10" length="2">
<parent ref="N3695">$4_number-option</parent>
<children id="C3464" priority="0">
<symbol name="hex" ref="N3692" start="10" length="2"/>
</children>
<children id="C3465" priority="0">
<symbol name="decimal" ref="N3693" start="10" length="2"/>
</children>
</symbol>
This is a choice being made for the symbol “number”. There are two choices, identified by “children”. In this example, each of the alternatives is a single symbol, but the context can be more complex. For each symbol, we also get the range of characters that it covers in the input. For each choice, it’s computed priority. The context is always the specific choice that’s being made, but the whole forest is available. Each of the “N” references can be used to find the corresponding nodes for those symbols.
Which aspects of the context (if any) are relevant is going to depend on your grammar and the input.
7.1. Grammar details
Invisible XML is an example of an extended Backus-Naur form (EBNF) grammar. The underlying parsing technlogies, either Earley or GLL, operate on simple Backus-Naur form (BNF) grammars. What that means in practice is that the first thing CoffeeGrinder does to your Invisible XML grammar is convert it into a plain BNF. That process introduces new nonterminals.
Mostly this is a behind-the-scenes transformation that isn’t relevant to the user, but it’s inescapable when describing ambiguity. Ultimately, the ambiguity is in the BNF and that’s the only grammar that the parser can describe.
The oddly named nonterminals seen above come from the fact that our iXML grammar has been transformed into this BNF:
$$ ::= number-list
number-list ::= $1_number-plus, $4_number-optionⁿ
number ::= hex
number ::= decimal
hex ::= $2_hex-digit-plus
decimal ::= $3_decimal-digit-plus
hex-digit ::= ['0'-'9'; 'a'-'f'; 'A'-'F']
decimal-digit ::= ['0'-'9']
$1_number-plus ::= number, #A
$1_number-plus ::= number, #A, $1_number-plus
$2_hex-digit-plus ::= hex-digit
$2_hex-digit-plus ::= hex-digit, $2_hex-digit-plus
$3_decimal-digit-plus ::= decimal-digit
$3_decimal-digit-plus ::= decimal-digit, $3_decimal-digit-plus
$4_number-optionⁿ ::= ε
$4_number-optionⁿ ::= number
This grammar is simpler in the sense that the rules for matching tokens in the input are simpler. There are no repetitions, alternatives, or other features on the “right hand side” of each production.
It appears more complicated partly because there are more productions and there can be multiple productions for any given nonterminal. It also appears more complicated because it’s impossible to generate semantically meaningful names for the new nonterminals introduced.