So I've been reading a lot about Earley parsers. They're divided into predictor completer and scanner sections. There's some “interesting observations” in the “Directly Executable Earley Parsing” paper: mainly, that additions are only ever made to the current and next Earley sets. But we still have to store the Earley sets; on the other hand, I noticed that we only have to look at them indexed by production. So we can index that by the next non-terminal and then look back to retrieve them. But it seems clear that the sets are mostly for notational, pedagogical, and debugging purposes, and the algorithm only needs the lists of suspended nonterminals pointed to by in-progress non terminals.
The other thing I looked at was the Earley-LR optimization, basically turning the grammar into an LR(0) automaton with shift/reduce ambiguity resolved by Earley sets, and generally speaking it seems like it can speed up the parser. Each Earley item is normally a dotted production rule together with a parent set, but you can replace the production rule with an LR(0) DFA state, which is a set of LR(0) items.
According to Jeffrey Kegler: “In creating Marpa, I may have acquired as much experience with the AH algorithm as anybody, possible even Aycock/Horspool themselves. I found the special LR(0) items they create to be counter-productive. AH's own implementation apparently had bugs due to duplication of Earley items. I solved these, but it made the code very complex and slower. Most of the compression that the AH method bought was in predictions, which could be compressed in other, less complicated, ways. Using AH items means that whenever something needs to be done in terms of the user's grammar, there has to be a translation. Since the AH item to Earley item relationship is many-to-many, this is not trivial. The cost of the more complex evaluation phase seems likely to eliminate any speed-up in the recognizer.
In “Practical Earley Parsing”, as a side issue, they describe a way of handling nulled symbols and productions via grammar rewrites. This proved very valuable and Marpa *does* continue to use that. It's a bit ironic that I ended up ignoring a lot of AH's paper, but that their little implementation trick has proved central to my efforts.” (edited, see here)
So in summary LR doesn’t add much to Earley parsing, but computing nullability and handling it specially does. It seems like they're optimizing their parser a lot, but they're optimising the wrong things. The memory layout and cache use is the main thing to pay attention to on modern processors and they don't seem to have a very interesting layout. And the radix tree for Earley items seems completely pointless. You have to iterate over all the pending Earley items anyway so you might as well just use a linked list.
The practical paper also leads to this rabbit hole of direct code execution / threaded execution / direct threaded code. It’s useful in parsers, but appears in any interpreter, e.g. Python. Python uses computed gotos and the basic idea is that instead of jumping back to a dispatch loop you include the code which looks up the dispatch table and jumps directly. Python uses the first class labels feature available in GNU C. But reading the “Directly Executable Earley Parsing” paper it just seems like a bunch of pointer stuff and it doesn't do the state splitting that they do in their later papers.
Another question is whether we can implement parser combinators on top of Earley, memoizing all the states somehow. It seems like it might be possible with enough recursion and memoization hacking but probably not worth it.
And I still don't know how to handle extended grammars, and I haven't looked at the Leo paper at all. But from some code skims of people who’ve implemented it Leo seems like it’s just a matter of keeping track of certain items. Extended Earley grammars generalize the dot position in the rule, so it’s an NFA (DFA?) instead of just a linear sequence. This of course should be easily accomplished with the help of some papers I have on extended rules.
Thursday, January 9, 2020
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment