Chapter 2. Pragmas

It’s possible to influence the behavior of the processor by placing pragmas in your grammar.

☝

Experimental

Pragmas are a separate feature; they are not part of Invisible XML 1.0. As of 4 September 2022, the pragma syntax accepted by CoffeeFilter (and CoffeePot) has been updated to the grammar described in Designing for change: Pragmas in Invisible XML as an extensibility mechanism presented at Balisage, 2022.

If you run CoffeePot with the --pedantic option, you cannot use pragmas.

A pragma begins with “{[” and is followed by a pragma name, pragma data (which may be empty), and closes with “]}”. The pragma name is a shortcut for a URI which provides the “real” identity of the pragma. This mechanism leverages URI space to achieve distributed extensibility.

The mapping from names to URIs is done with the “pragma” pragma at the top of your grammar. This, for example, declares the name “nineml” as the pragma identified by the URI “https://nineml.org/ns/pragma/”:

{[+pragma nineml "https://nineml.org/ns/pragma/"]}

CoffeePot ignores any pragmas it does not recognize. The rest of this document assumes that you have declared the pragma name “nineml” as shown above. You must do this in every grammar file where you use pragmas.

Pragmas can be associated with the entire grammar or with a rule, a nonterminal symbol, or a terminal symbol:

A pragma placed before a symbol applies to the symbol that follows it:

rule: {[pragma applies to “A”]} A,
      {[pragma applies to “b”]} 'b'.

A pragma placed before a rule, applies to the rule that follows it:

{[pragma applies to “rule”]}
rule: {[pragma applies to “A”]} A,
      {[pragma applies to “b”]} 'b'.

To apply a pragma to the entire grammar, it must be in the prolog.

{[+pragma applies to whole grammar]}
 
{[pragma applies to “rule”]}
rule: {[pragma applies to “A”]} A,
      {[pragma applies to “b”]} 'b'.

More than one pragma can appear at any of those locations:

{[+pragma applies to whole grammar]}
{[+second pragma applies to whole grammar ]}
 
{[pragma applies to “rule”]}
{[second pragma applies to “rule”]}
rule:
   {[pragma applies to “A”]}
   {[second pragma applies to “A”]} A,
   {[pragma applies to “b”]}
   {[second pragma applies to “b”]} 'b'.

If a pragma is not recognized, or does not apply, it is ignored. CoffeePot will generate debug-level log messages to alert you to pragmas that it is ignoring.

2.1. Grammar pragmas

There following pragmas apply to a grammar as a whole.

2.1.1. csv-columns

Identifies the columns to be output when CSV output is selected.

Usage:

{[+nineml csv-columns list,of,names]}

Ordinarily, CSV formatted output includes all the columns in (roughly) the order they occur in the XML. This pragma allows you to list the columns you want output and the order in which you want them output.

If the grammar renames nonterminals, the new “renamed” name must be used in the list of column names.

If a column requested does not exist in the document, it is ignored. An empty column is not produced.

2.1.2. import

Allows one grammar to import another.

Usage:

{[+nineml import "grammar-uri"]}

In principle, this pragma allows you to combine grammars. This feature is experimental and no coherent semantics have yet been established.

2.1.3. ns

Declares the default namespace for the output XML.

Usage:

{[+nineml ns "namespace-uri"]}

2.1.4. record-end

The record-end pragma enables record-oriented processing by default. It’s value is the regular expression that marks record ends. Unlike the other pragmas, this one has a different URI binding:

Usage:

{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-end "\n([^ ])"]}

2.1.5. record-start

The record-start pragma enables record-oriented processing by default. It’s value is the regular expression that marks record starts. Unlike the other pragmas, this one has a different URI binding:

Usage:

{[+pragma opt "https://nineml.org/ns/pragma/options/"]}
{[+opt record-start "([^\\])\n"]}

2.1.6. strict

The strict pragma controls which grammar hygiene rules apply. This pragma overrides configuration and command-line options that effect hygiene rules. Note that this option uses a different URI binding.

Usage:

{[+pragma strict "https://gyfre.org/ns/pragma/strict"]}
{[+strict exceptions]}

If the strict pragma is applied, all grammar hygiene rules are strictly applied. Exceptions can be used to relax constraints that would otherwise be enforced:

allow-empty-alt

The Invisible XML grammar allows an empty string to represent an empty alternative. The rule S: A;. says that an S matches either an A or “nothing”. The use of an empty string for this purpose can be difficult to read. If empty alternatives are forbidden, then you must use () to represent empty: S: A;()..

Unlike the other hygiene constraints, this one is purely stylistic.

allow-multiple-definitions

With this exception, multiple rules defining the same nonterminal are allowed. They are treated as if they were alternatives. In other words, S = A. S = B. is the same as S = A|B.

allow-undefined

With this exception, undefined symbols are allowed. An undefined symbol that can be encountered during a parse is forbidden.

allow-unproductive

With this exception, unproductive symbols are allowed. An unproductive symbol is one that can never match an input.

allow-unreachable

With this exception, unreachable symbols are allowed. An unreachable symbol is one that is defined but cannot be reached from the start rule.

2.2. Rule pragmas

The following pragmas apply to rules.

2.2.1. csv-heading

Specify the heading title to use in CSV output if the nonterminal defined by this rule is used as the value of a column. (If no heading is specified, the name of the nonterminal is used as the heading.)

Usage:

{[nineml csv-heading "Heading Title"]}

Heading titles may be quoted with either single (') or double (") quotes.

2.2.2. discard-empty

If the nonterminal defined by this rule is empty, it will be discarded (not serialized at all).

Usage:

{[nineml discard empty]}

2.2.3. regex

This pragma replaces the “right hand side” of a nonterminal with a regular expression. This is a “greedy match” and can greatly improve performance in some cases.

Usage:

{[nineml regex "regular expression"]}

For example:

{[+pragma n "https://nineml.org/ns/pragma/"]}
 
   number-list = (number, -#a)+, number? .
        number = hex | decimal .
           hex = hex-digit+        {[n regex "[0-9a-fA-F]+"]} .
       decimal = decimal-digit+    {[n regex "[0-9]+"]} .
    -hex-digit = ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit = ["0"-"9" ] .

This grammar will consume hex and decimal nonterminals with a regular expression.

The regex pragma is a sharp tool and comes with a number of caveats.

The regular expression is used in place of the right hand side specified in the grammar. No effort is made to determine if the original iXML rule (that will be used by a processor that doesn’t support the regex pragma) matches the same input as the regular expression.
The regular expression match is greedy. It will consume all of the characters that match. If this consumes “too much” input, the parse will fail. Consider this grammar:
```
S = 'a', B, 'a' .
B = ('a'|'b')+ . 
```
It will match “aaaba” to produce “<S>a<B>aab</B>a</S>”. At first glance, this grammar may seem equivalent:
```
{[+pragma n "https://nineml.org/ns/pragma/"]}
 
                    S = 'a', B, 'a' .
{[n regex "[ab]+"]} B = ('a'|'b')+ . 
```
But if regular expressions are used, this grammar will not match “aaaba”. The entire string “aaba” is consumed by the regular expression, leaving no “a” for the last terminal in S to match.

2.3. Symbol pragmas

The following pragmas that apply to a symbols.

2.3.1. rename

This pragma changes the name used when the element is serialized. It applies only to nonterminals.

Usage:

{[nineml rename newname]}

An alternative approach is to use the nonterminal renaming proposal. This is an experimental feature while the renaming proposal is under development. To use it, you must specify that the grammar version is “1.1-nineml”.

2.3.2. rewrite

This pragma is no longer supported. The same effect can be achieved with the standard insertions feature.

2.3.3. priority

This pragma associates a priority with a nonterminal.

Usage:

{[nineml priority 2]}

When an ambiguous parse is being serialized, there will be places in the output where a choice must be made between two or more alternatives. A priority can be used to control the selection. The nonterminal with the highest priority will be selected. If there are no priorities, or if several nonterminals have the same priority, no guarantees are made about which alternative will be selected. The default priority for all nonterminals is 0.

Consider the following grammar:

 number: hex | decimal .
 hex: hex-digit+ .
 decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .

It parses numbers in either hexadecimal or decimal. In the case of a number like “42”, the parse is ambiguous, it matches either hex or decimal:

$ coffeepot --pretty-print -g:hex.ixml 42
Found 2 possible parses.
<number xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <hex>42<hex>
</number>

You can give decimal a higher priority:

{[+pragma nineml "https://nineml.org/ns/pragma/"]}
 
 number: hex | {[nineml priority 2]} decimal .
 hex: hex-digit+ .
 decimal: decimal-digit+ .
-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
-decimal-digit: ["0"-"9" ] .

Now decimal will be selected:

$ coffeepot --pretty-print -g:hex.ixml 42
Found 2 possible parses.
<number>
   <decimal>42</decimal>
</number>

The parse is no longer considered ambiguous because no arbitrary choices were made. Use the --strict-ambiguity option to mark the output ambiguous.

ⓘ

Note

If a grammar is infinitely ambiguous, the same part of the parse may be serialized more than once. When this happens, the selection is always between the remaining alternatives.