Chapter 9. Pragmas
It’s possible to influence the behavior of the processor by placing pragmas in your grammar.
Pragmas are a separate feature; they are not part of Invisible XML 1.0. As of 4 September 2022, the pragma syntax accepted by CoffeeFilter (and CoffeePot) has been updated to the grammar described in Designing for change: Pragmas in Invisible XML as an extensibility mechanism presented at Balisage, 2022.
If you run CoffeePot with the
--pedantic
option, you cannot use pragmas.
A pragma begins with “{[
” and is followed by a pragma
name, pragma data (which may be empty), and closes with
“]}
”. The pragma name is a shortcut for a URI which provides
the “real” identity of the pragma. This mechanism leverages URI space to achieve
distributed extensibility.
The mapping from names to URIs is done with the
“pragma
” pragma at the top of your grammar. This, for example,
declares the name “nineml
” as the pragma identified by the
URI “https://nineml.org/ns/pragma/
”:
{[+pragma nineml "https://nineml.org/ns/pragma/"]}
CoffeePot ignores any pragmas it does
not recognize. The rest of this document assumes that you have
declared the pragma name “nineml
” as shown above. You
must do this in every grammar file where you use pragmas.
Pragmas can be associated with the entire grammar or with a rule, a nonterminal symbol, or a terminal symbol:
-
A pragma placed before a symbol applies to the symbol that follows it:
rule: {[pragma applies to “A”]} A, {[pragma applies to “b”]} 'b'.
-
A pragma placed before a rule, applies to the rule that follows it:
{[pragma applies to “rule”]} rule: {[pragma applies to “A”]} A, {[pragma applies to “b”]} 'b'.
-
To apply a pragma to the entire grammar, it must be in the prolog.
{[+pragma applies to whole grammar]} {[pragma applies to “rule”]} rule: {[pragma applies to “A”]} A, {[pragma applies to “b”]} 'b'.
More than one pragma can appear at any of those locations:
{[+pragma applies to whole grammar]}
{[+second pragma applies to whole grammar ]}
{[pragma applies to “rule”]}
{[second pragma applies to “rule”]}
rule:
{[pragma applies to “A”]}
{[second pragma applies to “A”]} A,
{[pragma applies to “b”]}
{[second pragma applies to “b”]} 'b'.
If a pragma is not recognized, or does not apply, it is ignored. CoffeePot will generate debug-level log messages to alert you to pragmas that it is ignoring.
9.1. Grammar pragmas
There following pragmas apply to a grammar as a whole.
9.1.1. csv-columns
Identifies the columns to be output when CSV output is selected.
Usage:
{[+nineml csv-columns list,of,names]}
Ordinarily, CSV formatted output includes all the columns in (roughly) the order they occur in the XML. This pragma allows you to list the columns you want output and the order in which you want them output.
If a column requested does not exist in the document, it is ignored. An empty column is not produced.
9.1.2. import
Allows one grammar to import another.
Usage:
{[+nineml import "grammar-uri"]}
In principle, this pragma allows you to combine grammars. This feature is experimental and no coherent semantics have yet been established.
9.1.3. ns
Declares the default namespace for the output XML.
Usage:
{[+nineml ns "namespace-uri"]}
9.1.4. record-end
The record-end
pragma enables record-oriented processing by
default. It’s value is the regular expression that marks record ends. Unlike the
other pragmas, this one has a different URI binding:
Usage:
{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-end "\n([^ ])"]}
9.1.5. record-start
The record-start
pragma enables record-oriented processing by
default. It’s value is the regular expression that marks record starts. Unlike the
other pragmas, this one has a different URI binding:
Usage:
{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-start "([^\\])\n"]}
9.2. Rule pragmas
There following pragmas apply to a rules.
9.2.1. discard-empty
If the nonterminal defined by this rule is empty, it will be discarded (not serialized at all).
Usage:
{[nineml discard empty]}
9.2.2. regex
This pragma replaces the “right hand side” of a nonterminal with a regular expression. This is a “greedy match” and can greatly improve performance in some cases.
Usage:
{[nineml regex "regular expression"]}
For example:
{[+pragma n "https://nineml.org/ns/pragma/"]}
number-list = (number, -#a)+, number? .
number = hex | decimal .
hex = hex-digit+ {[n regex "[0-9a-fA-F]+"]} .
decimal = decimal-digit+ {[n regex "[0-9]+"]} .
-hex-digit = ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit = ["0"-"9" ] .
This grammar will consume hex
and decimal
nonterminals with a
regular expression.
The regex pragma is a sharp tool and comes with a number of caveats.
-
The regular expression is used in place of the right hand side specified in the grammar. No effort is made to determine if the original iXML rule (that will be used by a processor that doesn’t support the regex pragma) matches the same input as the regular expression.
The regular expression match is greedy. It will consume all of the characters that match. If this consumes “too much” input, the parse will fail. Consider this grammar:
S = 'a', B, 'a' . B = ('a'|'b')+ .
It will match “aaaba” to produce “
<S>a<B>aab</B>a</S>
”. At first glance, this grammar may seem equivalent:{[+pragma n "https://nineml.org/ns/pragma/"]} S = 'a', B, 'a' . {[n regex "[ab]+"]} B = ('a'|'b')+ .
But if regular expressions are used, this grammar will not match “aaaba”. The entire string “aaba” is consumed by the regular expression, leaving no “a” for the last terminal in S to match.
The GLL parser doesn’t support multiple regular expressions that start at the same place. This is an implementation deficiency. It means that some ambiguous grammars cannot be parsed using the GLL parser and regular expressions.
9.3. Symbol pragmas
The following pragmas that apply to a symbols.
9.3.1. rename
This pragma changes the name used when the element is serialized. It applies only to nonterminals.
Usage:
{[nineml rename newname]}
An alternative approach is to use the nonterminal renaming
proposal.
This is an experimental feature. To use it, you must specify that the
grammar version is “1.1-nineml
”.
9.3.2. rewrite
This pragma is no longer supported. The same effect can be achieved with the standard insertions feature.
9.3.3. priority
This pragma associates a priority with a nonterminal.
Usage:
{[nineml priority 2]}
When an ambiguous parse is being serialized, there will be places in the output where a choice must be made between two or more alternatives. A priority can be used to control the selection. The nonterminal with the highest priority will be selected. If there are no priorities, or if several nonterminals have the same priority, no guarantees are made about which alternative will be selected. The default priority for all nonterminals is 0.
Consider the following grammar:
number: hex | decimal .
hex: hex-digit+ .
decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .
It parses numbers in either hexadecimal or decimal. In the case of a number like “42”, the parse is ambiguous, it matches either hex or decimal:
$ coffeepot --pretty-print -g:hex.ixml 42
Found 2 possible parses.
<number xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<hex>42<hex>
</number>
You can give decimal a higher priority:
{[+pragma nineml "https://nineml.org/ns/pragma/"]}
number: hex | {[nineml priority 2]} decimal .
hex: hex-digit+ .
decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .
Now decimal will be selected:
$ coffeepot --pretty-print -g:hex.ixml 42
Found 2 possible parses.
<number>
<decimal>42</decimal>
</number>
The parse is no longer considered ambiguous because no arbitrary choices were made.
Use the --strict-ambiguity
option to mark
the output ambiguous.
If a grammar is infinitely ambiguous, the same part of the parse may be serialized more than once. When this happens, the selection is always between the remaining alternatives.