Chapter 1. General use
CoffeeGrinder is the lowest-level parser API. You can use it to build a grammar, construct a parser with that grammar, and then parse input with that parser. But you have to build all the pieces with Java APIs.
The examples in this section are taken from “running code” that you can find in
src/test/java/org/nineml/examples/CoffeeGrinderExamples
in
the repository. The code lives in the unit testing framework, but doesn’t
depend on that framework.
This is just an overview, consult the JavaDoc for more details.
1.1. Constructing a grammar
In this section, we’ll construct a small grammar for parsing numbers: integers (1, 2, -7), floating point numbers (3.14, -2.7), and numbers in “scientific notation” (1.0E6, -2.3E-7). The first thing you need to do to build a parser is construct a source grammar:
ParserOptions options = new ParserOptions();
SourceGrammar grammar = new SourceGrammar(options);
There are a number of options that you can specify to control various aspects of the parse, but the defaults are reasonable. A source grammar is where you build the BNF rules for your grammar.
With a grammar in hand, you add rules to it. Rules define nonterminals in terms of other nonterminals and terminal symbols. The example grammar for parsing numbers begins by defining four nonterminals: one for numbers in general and then one for each of the three kinds of numbers.
NonterminalSymbol number = grammar.getNonterminal("number");
NonterminalSymbol integer = grammar.getNonterminal("integer");
NonterminalSymbol floatpt = grammar.getNonterminal("float");
NonterminalSymbol scientific = grammar.getNonterminal("scientific");
(All nonterminals created with the same grammar and the same name are “the same”, even if they’re created in different places.)
Once created, you define nonterminals with rules:
grammar.addRule(number, integer);
grammar.addRule(number, floatpt);
grammar.addRule(number, scientific);
These rules say that a number is either an integer
, a floatpt
,
or a scientific
. The nonterminal sign
is defined in terms of
terminal symbols +
and -
.
Note that the last rule defines sign
with “nothing on the right hand side”.
That allows sign
to match the empty sequence (often called ε in the literature).
TerminalSymbol plus = TerminalSymbol.ch('+');
TerminalSymbol minus = TerminalSymbol.ch('-');
NonterminalSymbol sign = grammar.getNonterminal("sign");
grammar.addRule(sign, plus);
grammar.addRule(sign, minus);
grammar.addRule(sign);
Additional rules define digits and the rest of the decimal forms.
1.2. Constructing a parser
Once the source grammar is complete, it must be turned into a parser grammar in order to construct a parser:
NonterminalSymbol startSymbol = grammar.getNonterminal("number");
ParserGrammar pgrammar = grammar.getCompiledGrammar(startSymbol);
The parser grammar must be constructed for a particular start symbol, the nonterminal that the parser will be attempting to match. Unlike source grammars, parser grammars are immutable.
1.3. Parsing the input
From a parser grammar, we can get an actual parser. And from a parser, we can get a result:
GearleyParser parser = pgrammar.getParser();
GearleyResult result = parser.parse(number);
Methods on the result can be used to check if the parse succeeded. If it did, we can construct a result tree.
if (result.succeeded()) {
StringTreeBuilder builder = new StringTreeBuilder();
ParseForest forest = result.getForest();
Arborist.getArborist(forest).getTree(builder);
String tree = builder.getTree();
return String.format("%s is a number: %s", number, tree);
} else {
return String.format("%s is not a number", number);
}
1.4. Processing the results
The result of a successful parse is a parse forest.
ParseForest forest = result.getForest();
The parse tree (or trees, in the case of an ambiguous parse) are obtained from the forest with an arborist.
Arborist.getArborist(forest).getTree(builder);
There are two kinds of
arborists,
lumberjacks
who
work sequentially through the forest returning all of the parses and
tree surgeons
who return a single tree. The kind of arborist you get
is determined by the axe you use. The
default axe is the
SequentialAxe
that returns
all of the parses.
The PriorityAxe
returns the highest priority parses first.
The RandomAxe
returns a random parse.
Trees are constructed by an arborist by calling methods on a
tree builder.
CoffeeGrinder comes with three
standard tree builders: GenericTreeBuilder
that constructs a generic
tree of branches and leaves; PrintStreamTreeBuilder
that prints
the results (StdoutTreeBuilder
sends them to standard output);
and StringTreeBuilder
that returns them as a string.
The print stream and string tree builders represent the trees using XML.
1.5. Ambiguous results
In the case of ambiguous results, you can ask how many trees there are:
int parseCount = forest.getParseTreeCount();
if (parseCount > 1) {
if (forest.isInfinitelyAmbiguous()) {
return String.format("%s is a date (in infinite ways): %s", date, tree);
}
return String.format("%s is a date (%d ways): %s", date, parseCount, tree);
}
Beware that in an infinitely ambigous forest, the number returned is wildely inaccurate (in as much as it is infinitely less than ∞!). It tells you something about how the forest divides into ambiguous branches, but doesn’t attempt to account for loops.