Chapter 2. Pragmas
It’s possible to influence the behavior of the processor by placing pragmas in your grammar.
Pragmas are a separate feature; they are not part of Invisible XML 1.0. As of 4 September 2022, the pragma syntax accepted by CoffeeFilter (and CoffeePot) has been updated to the grammar described in Designing for change: Pragmas in Invisible XML as an extensibility mechanism presented at Balisage, 2022.
If you run CoffeePot with the
--pedantic
option, you cannot use pragmas.
A pragma begins with “{[
” and is followed by a pragma
name, pragma data (which may be empty), and closes with
“]}
”. The pragma name is a shortcut for a URI which provides
the “real” identity of the pragma. This mechanism leverages URI space to achieve
distributed extensibility.
The mapping from names to URIs is done with the
“pragma
” pragma at the top of your grammar. This, for example,
declares the name “nineml
” as the pragma identified by the
URI “https://nineml.org/ns/pragma/
”:
{[+pragma nineml "https://nineml.org/ns/pragma/"]}
CoffeePot ignores any pragmas it does
not recognize. The rest of this document assumes that you have
declared the pragma name “nineml
” as shown above. You
must do this in every grammar file where you use pragmas.
Pragmas can be associated with the entire grammar or with a rule, a nonterminal symbol, or a terminal symbol:
-
A pragma placed before a symbol applies to the symbol that follows it:
rule: {[pragma applies to “A”]} A, {[pragma applies to “b”]} 'b'.
-
A pragma placed before a rule, applies to the rule that follows it:
{[pragma applies to “rule”]} rule: {[pragma applies to “A”]} A, {[pragma applies to “b”]} 'b'.
-
To apply a pragma to the entire grammar, it must be in the prolog.
{[+pragma applies to whole grammar]} {[pragma applies to “rule”]} rule: {[pragma applies to “A”]} A, {[pragma applies to “b”]} 'b'.
More than one pragma can appear at any of those locations:
{[+pragma applies to whole grammar]}
{[+second pragma applies to whole grammar ]}
{[pragma applies to “rule”]}
{[second pragma applies to “rule”]}
rule:
{[pragma applies to “A”]}
{[second pragma applies to “A”]} A,
{[pragma applies to “b”]}
{[second pragma applies to “b”]} 'b'.
If a pragma is not recognized, or does not apply, it is ignored. CoffeePot will generate debug-level log messages to alert you to pragmas that it is ignoring.
2.1. Grammar pragmas
There following pragmas apply to a grammar as a whole.
2.1.1. csv-columns
Identifies the columns to be output when CSV output is selected.
Usage:
{[+nineml csv-columns list,of,names]}
Ordinarily, CSV formatted output includes all the columns in (roughly) the order they occur in the XML. This pragma allows you to list the columns you want output and the order in which you want them output.
If the grammar renames nonterminals, the new “renamed” name must be used in the list of column names.
If a column requested does not exist in the document, it is ignored. An empty column is not produced.
2.1.2. import
Allows one grammar to import another.
Usage:
{[+nineml import "grammar-uri"]}
In principle, this pragma allows you to combine grammars. This feature is experimental and no coherent semantics have yet been established.
2.1.3. ns
Declares the default namespace for the output XML.
Usage:
{[+nineml ns "namespace-uri"]}
2.1.4. record-end
The record-end
pragma enables record-oriented processing by
default. It’s value is the regular expression that marks record ends. Unlike the
other pragmas, this one has a different URI binding:
Usage:
{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-end "\n([^ ])"]}
2.1.5. record-start
The record-start
pragma enables record-oriented processing by
default. It’s value is the regular expression that marks record starts. Unlike the
other pragmas, this one has a different URI binding:
Usage:
{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-start "([^\\])\n"]}
2.1.6. strict
The strict
pragma controls which grammar hygiene rules apply.
This pragma overrides configuration and command-line options that effect hygiene rules.
Note that this option uses a different URI binding.
Usage:
{[+pragma strict "https://gyfre.org/ns/pragma/strict"]}
{[+strict exceptions]}
If the strict
pragma is applied, all grammar
hygiene rules are strictly applied. Exceptions can be used to relax
constraints that would otherwise be enforced:
allow-empty-alt
-
The Invisible XML grammar allows an empty string to represent an empty alternative. The rule
S: A;.
says that anS
matches either anA
or “nothing”. The use of an empty string for this purpose can be difficult to read. If empty alternatives are forbidden, then you must use()
to represent empty:S: A;().
.Unlike the other hygiene constraints, this one is purely stylistic.
allow-multiple-definitions
-
With this exception, multiple rules defining the same nonterminal are allowed. They are treated as if they were alternatives. In other words,
S = A. S = B.
is the same asS = A|B.
allow-undefined
-
With this exception, undefined symbols are allowed. An undefined symbol that can be encountered during a parse is forbidden.
allow-unproductive
-
With this exception, unproductive symbols are allowed. An unproductive symbol is one that can never match an input.
allow-unreachable
-
With this exception, unreachable symbols are allowed. An unreachable symbol is one that is defined but cannot be reached from the start rule.
2.2. Rule pragmas
The following pragmas apply to rules.
2.2.1. csv-heading
Specify the heading title to use in CSV output if the nonterminal defined by this rule is used as the value of a column. (If no heading is specified, the name of the nonterminal is used as the heading.)
Usage:
{[nineml csv-heading "Heading Title"]}
Heading titles may be quoted with either single
('
) or double ("
) quotes.
2.2.2. discard-empty
If the nonterminal defined by this rule is empty, it will be discarded (not serialized at all).
Usage:
{[nineml discard empty]}
2.2.3. regex
This pragma replaces the “right hand side” of a nonterminal with a regular expression. This is a “greedy match” and can greatly improve performance in some cases.
Usage:
{[nineml regex "regular expression"]}
For example:
{[+pragma n "https://nineml.org/ns/pragma/"]}
number-list = (number, -#a)+, number? .
number = hex | decimal .
hex = hex-digit+ {[n regex "[0-9a-fA-F]+"]} .
decimal = decimal-digit+ {[n regex "[0-9]+"]} .
-hex-digit = ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit = ["0"-"9" ] .
This grammar will consume hex
and decimal
nonterminals with a
regular expression.
The regex pragma is a sharp tool and comes with a number of caveats.
-
The regular expression is used in place of the right hand side specified in the grammar. No effort is made to determine if the original iXML rule (that will be used by a processor that doesn’t support the regex pragma) matches the same input as the regular expression.
The regular expression match is greedy. It will consume all of the characters that match. If this consumes “too much” input, the parse will fail. Consider this grammar:
S = 'a', B, 'a' . B = ('a'|'b')+ .
It will match “aaaba” to produce “
<S>a<B>aab</B>a</S>
”. At first glance, this grammar may seem equivalent:{[+pragma n "https://nineml.org/ns/pragma/"]} S = 'a', B, 'a' . {[n regex "[ab]+"]} B = ('a'|'b')+ .
But if regular expressions are used, this grammar will not match “aaaba”. The entire string “aaba” is consumed by the regular expression, leaving no “a” for the last terminal in S to match.
2.3. Symbol pragmas
The following pragmas that apply to a symbols.
2.3.1. rename
This pragma changes the name used when the element is serialized. It applies only to nonterminals.
Usage:
{[nineml rename newname]}
An alternative approach is to use the nonterminal renaming
proposal.
This is an experimental feature while the renaming proposal is
under development. To use it, you must specify that the
grammar version is “1.1-nineml
”.
2.3.2. rewrite
This pragma is no longer supported. The same effect can be achieved with the standard insertions feature.
2.3.3. priority
This pragma associates a priority with a nonterminal.
Usage:
{[nineml priority 2]}
When an ambiguous parse is being serialized, there will be places in the output where a choice must be made between two or more alternatives. A priority can be used to control the selection. The nonterminal with the highest priority will be selected. If there are no priorities, or if several nonterminals have the same priority, no guarantees are made about which alternative will be selected. The default priority for all nonterminals is 0.
Consider the following grammar:
number: hex | decimal .
hex: hex-digit+ .
decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .
It parses numbers in either hexadecimal or decimal. In the case of a number like “42”, the parse is ambiguous, it matches either hex or decimal:
$ coffeepot --pretty-print -g:hex.ixml 42
Found 2 possible parses.
<number xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<hex>42<hex>
</number>
You can give decimal a higher priority:
{[+pragma nineml "https://nineml.org/ns/pragma/"]}
number: hex | {[nineml priority 2]} decimal .
hex: hex-digit+ .
decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .
Now decimal will be selected:
$ coffeepot --pretty-print -g:hex.ixml 42
Found 2 possible parses.
<number>
<decimal>42</decimal>
</number>
The parse is no longer considered ambiguous because no arbitrary choices were made.
Use the --strict-ambiguity
option to mark
the output ambiguous.
If a grammar is infinitely ambiguous, the same part of the parse may be serialized more than once. When this happens, the selection is always between the remaining alternatives.