Chapter 1General use

CoffeeGrinder is the lowest-level parser API. You can use it to build a grammar, construct a parser with that grammar, and then parse input with that parser. But you have to build all the pieces with Java APIs.

The examples in this section are taken from “running code” that you can find in src/test/java/org/nineml/examples/CoffeeGrinderExamples in the repository. The code lives in the unit testing framework, but doesn’t depend on that framework.

This is just an overview, consult the JavaDoc for more details.

1.1Constructing a grammar

In this section, we’ll construct a small grammar for parsing numbers: integers (1, 2, -7), floating point numbers (3.14, -2.7), and numbers in “scientific notation” (1.0E6, -2.3E-7). The first thing you need to do to build a parser is construct a source grammar:

ParserOptions options = new ParserOptions();
SourceGrammar grammar = new SourceGrammar(options);

There are a number of options that you can specify to control various aspects of the parse, but the defaults are reasonable. A source grammar is where you build the BNF rules for your grammar.

With a grammar in hand, you add rules to it. Rules define nonterminals in terms of other nonterminals and terminal symbols. The example grammar for parsing numbers begins by defining four nonterminals: one for numbers in general and then one for each of the three kinds of numbers.

NonterminalSymbol number = grammar.getNonterminal("number");
NonterminalSymbol integer = grammar.getNonterminal("integer");
NonterminalSymbol floatpt = grammar.getNonterminal("float");
NonterminalSymbol scientific = grammar.getNonterminal("scientific");

(All nonterminals created with the same grammar and the same name are “the same”, even if they’re created in different places.)

Once created, you define nonterminals with rules:

grammar.addRule(number, integer);
grammar.addRule(number, floatpt);
grammar.addRule(number, scientific);

These rules say that a number is either an integer, a floatpt, or a scientific. The nonterminal sign is defined in terms of terminal symbols + and -. Note that the last rule defines sign with “nothing on the right hand side”. That allows sign to match the empty sequence (often called ε in the literature).

TerminalSymbol plus = TerminalSymbol.ch('+');
TerminalSymbol minus = TerminalSymbol.ch('-');
NonterminalSymbol sign = grammar.getNonterminal("sign");
grammar.addRule(sign, plus);
grammar.addRule(sign, minus);
grammar.addRule(sign);

Additional rules define digits and the rest of the decimal forms.

1.2Constructing a parser

Once the source grammar is complete, it must be turned into a parser grammar in order to construct a parser:

NonterminalSymbol startSymbol = grammar.getNonterminal("number");
ParserGrammar pgrammar = grammar.getCompiledGrammar(startSymbol);

The parser grammar must be constructed for a particular start symbol, the nonterminal that the parser will be attempting to match. Unlike source grammars, parser grammars are immutable.

1.3Parsing the input

From a parser grammar, we can get an actual parser. And from a parser, we can get a result:

GearleyParser parser = pgrammar.getParser();
GearleyResult result = parser.parse(number);

Methods on the result can be used to check if the parse succeeded. If it did, we can construct a result tree.

if (result.succeeded()) {
    StringTreeBuilder builder = new StringTreeBuilder();
    ParseForest forest = result.getForest();
    Arborist.getArborist(forest).getTree(builder);
    String tree = builder.getTree();
    return String.format("%s is a number: %s", number, tree);
} else {
    return String.format("%s is not a number", number);
}

1.4Processing the results

The result of a successful parse is a parse forest.

ParseForest forest = result.getForest();

The parse tree (or trees, in the case of an ambiguous parse) are obtained from the forest with an arborist.

Arborist.getArborist(forest).getTree(builder);

There are two kinds of arborists, lumberjacks who work sequentially through the forest returning all of the parses and tree surgeons who return a single tree. The kind of arborist you get is determined by the axe you use. The default axe is the SequentialAxe that returns all of the parses. The PriorityAxe returns the highest priority parses first. The RandomAxe returns a random parse.

Trees are constructed by an arborist by calling methods on a tree builder. CoffeeGrinder comes with three standard tree builders: GenericTreeBuilder that constructs a generic tree of branches and leaves; PrintStreamTreeBuilder that prints the results (StdoutTreeBuilder sends them to standard output); and StringTreeBuilder that returns them as a string. The print stream and string tree builders represent the trees using XML.

1.5Ambiguous results

In the case of ambiguous results, you can ask how many trees there are:

int parseCount = forest.getParseTreeCount();
if (parseCount > 1) {
    if (forest.isInfinitelyAmbiguous()) {
        return String.format("%s is a date (in infinite ways): %s", date, tree);
    }
    return String.format("%s is a date (%d ways): %s", date, parseCount, tree);
}

Beware that in an infinitely ambigous forest, the number returned is wildely inaccurate (in as much as it is infinitely less than ∞!). It tells you something about how the forest divides into ambiguous branches, but doesn’t attempt to account for loops.