Chapter 7Choosing among alternatives

Note

This API has been completely rewritten for version 3.0.0.

Where it’s practical to write a grammar that is unambiguous, that’s best. But it isn’t always practical. Sometimes it’s difficult, and sometimes it’s impossible. If the data is actually ambiguous, you may have to reflect that in your grammar.

Invisible XML doesn’t consider ambiguity an error, but it also doesn’t provide any mechanism for controlling it. All parses are considered equal and the processor’s only obligation is to provide one of them.

That may not suit your needs. CoffeeSacks provides a way to examine the alternatives and select one. You can supply a choose-alternative function in the parser options. This must be a function that takes a single argument, a list of elements, and returns an integer.

The function receives a context element that identifies its location in the parse forest and a map that describes the current state of the parse:

PropertyValue
inputThe input string
available-choicesA list of available choices
other-choicesA list of other choices
user-definedUser defined properties

The function must return a map that contains a selection property that identifies the choice made. It may also return an ambiguous-choice property to indicate that an abitrary, ambiguous choice was made.

Any other properties returned in the map will be passed back on the next call. This allows the selection function to maintain state between invocations.

The function must return one of the choices passed to it. There will always be at least one item in the choices list. The difference between the available choices and the “other” choices is only relevant when the grammar contains a loop. If the grammar loops, the previously selected choices will be in other-choices. Only choose from that list if you have some other way to avoid looping infinitely.

This function always selects the first alternative:

<xsl:function name="f:first" as="map(*)">
  <xsl:param name="context" as="element()"/>
  <xsl:param name="options" as="map(*)"/>

  <xsl:sequence select="map { 'selection': $options?available-choices[1] }"/>
</xsl:function>

You could pass it in parse options like this:

<xsl:variable name="parser"
              select="cs:make-parser($grammar,
                      map{'choose-alternative: f:first#2})"/>

Of course, unconditionally choosing the first option isn’t very interesting. To explore further, consider this simple, ambiguous grammar:

   number-list = (number, -#a)+, number? .
        number = hex | decimal .
           hex = hex-digit+ .
       decimal = decimal-digit+ .
    -hex-digit = ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit = ["0"-"9" ] .

If we parse the following input,

bad
cafe
42

We might get this result:

<number-list xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
  <hex>bad</hex>
  <hex>cafe</hex>
  <hex>42</hex>
</number-list>

Of course, we might equally get this result:

<number-list xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
  <hex>bad</hex>
  <hex>cafe</hex>
  <decimal>42</decimal>
</number-list>

The ambiguity here is between “decimal” and “hexidecimal”. Here’s a function that will always select the decimal alternative:

<xsl:function name="f:choose" as="map(*)">
  <xsl:param name="context" as="element()"/>
  <xsl:param name="options" as="map(*)"/>

  <xsl:variable name="choice"
                select="$context/children[symbol[@name='decimal']]/@id"/>

  <xsl:sequence select="map { 'selection': $choice }"/>
</xsl:function>

In order to understand how this works, we need to look at what’s passed to the function. The function get’s a context element that is an XML description of the current state of the parse, and a map. For our example, the map would look something like this:

map {
  "input":"bad
cafe
42",
  "available-choices": ("C3464","C3465"),
  "other-choices":()
}

And the context element like this:

<symbol name="number" id="N3694" mark="^" start="10" length="2">
   <parent ref="N3695">$4_number-option</parent>
   <children id="C3464" priority="0">
      <symbol name="hex" ref="N3692" start="10" length="2"/>
   </children>
   <children id="C3465" priority="0">
      <symbol name="decimal" ref="N3693" start="10" length="2"/>
   </children>
</symbol>

This is a choice being made for the symbol “number”. There are two choices, identified by “children”. In this example, each of the alternatives is a single symbol, but the context can be more complex. For each symbol, we also get the range of characters that it covers in the input. For each choice, it’s computed priority. The context is always the specific choice that’s being made, but the whole forest is available. Each of the “N” references can be used to find the corresponding nodes for those symbols.

Which aspects of the context (if any) are relevant is going to depend on your grammar and the input.

7.1Grammar details

Invisible XML is an example of an extended Backus-Naur form (EBNF) grammar. The underlying parsing technlogies, either Earley or GLL, operate on simple Backus-Naur form (BNF) grammars. What that means in practice is that the first thing CoffeeGrinder does to your Invisible XML grammar is convert it into a plain BNF. That process introduces new nonterminals.

Mostly this is a behind-the-scenes transformation that isn’t relevant to the user, but it’s inescapable when describing ambiguity. Ultimately, the ambiguity is in the BNF and that’s the only grammar that the parser can describe.

The oddly named nonterminals seen above come from the fact that our iXML grammar has been transformed into this BNF:

                   $$ ::= number-list
          number-list ::= $1_number-plus, $4_number-optionⁿ
               number ::= hex
               number ::= decimal
                  hex ::= $2_hex-digit-plus
              decimal ::= $3_decimal-digit-plus
            hex-digit ::= ['0'-'9'; 'a'-'f'; 'A'-'F']
        decimal-digit ::= ['0'-'9']
       $1_number-plus ::= number, #A
       $1_number-plus ::= number, #A, $1_number-plus
    $2_hex-digit-plus ::= hex-digit
    $2_hex-digit-plus ::= hex-digit, $2_hex-digit-plus
$3_decimal-digit-plus ::= decimal-digit
$3_decimal-digit-plus ::= decimal-digit, $3_decimal-digit-plus
    $4_number-optionⁿ ::= ε
    $4_number-optionⁿ ::= number

This grammar is simpler in the sense that the rules for matching tokens in the input are simpler. There are no repetitions, alternatives, or other features on the “right hand side” of each production.

It appears more complicated partly because there are more productions and there can be multiple productions for any given nonterminal. It also appears more complicated because it’s impossible to generate semantically meaningful names for the new nonterminals introduced.