Introduction

This package provides an implementation of two different parsers: an Earley parser and a Generalized LL (GLL) parser. Both parsing algorithms are able to use grammars that are ambiguous. These APIs parse a sequence of tokens and return a parse forest. Individual parse trees can be obtained from the forest.

Why two parsers?

After implementing the Earley parser, I decided to implement the GLL parser in the hopes that it would be substantially faster than the Earley parser. After investing some effort optimizing the GLL implementation it appears to be a faster than the Earley parser, but not dramatically so. (Both parsers are “fast enough” for a lot of projects, but if your goal is parsing megabytes of input, you may be disappointed.)

Both parser implementations expose the same interface. The Earley parser has one feature that the GLL parser does not: it is possible to extract a “prefix parse” and restart parsing after the prefix. Suppose for example, you have a grammar that recognizes a+. If you give it the input “aaabbb, you can extract the graph for the recognized prefix “aaa” even though parse as a whole did not succeed. You can subsequently restart parsing at “bbb” with the same or a different parser.

Conversely, the GLL parser represents derivations not with a shared packed parse forest (SPPF) directly, but with binary subtree sets (BSRs). It may be possible to improve the performance of the GLL parser further by directly extracting trees from the BSRs rather than converting it into an SPPF first. But my initial explorations of this idea were unsuccessful. (There’s part of the paper that describes BSRs that I don’t understand.)