Sunday, December 22, 2019

Stroscot Devlog 2

Okay so programming language project post 2. Sorry that these are a little slow, I get backlogged...

I was wandering down a misty road and I saw a parser at the end of the tunnel. But did I see this parser’s code? No, I did not. 😞 I saw many parsers, all different! There was a tangled mess of parsers!

Well, I guess what I'm trying to say is that I've been reading lots of parsing papers, and some of them work with each other and some of them don't, and it's not clear which is which. I mean, it looks like they can all be assembled together into a single massive parser, but I don't know. So far what’s winning is the Earley parser. It basically has three stages: the scanner, the predictor, and the completer. They correspond to state transitions in LR parsers so a reduce is the same as a complete and a shift is the same as a scan. There’s no prediction step in an LR parser because it precomputes all the predictions when it processes the grammar into states.

The standard paper on Earley parsing is “Practical Earley Parsing” by Aycock and Horspool. It shows how to merge an LR automaton with an Earley parser, based on work from an earlier 1996 paper “A Faster Earley Parser”. There's also the Marpa paper which some code snippets and improvements on the Aycock paper. Then there are a bunch of LR papers that look vaguely interesting but I haven't read yet. I'm honestly not sure I will.

So an LL parser has the first and follow sets and then an LR parser is shift reduce and it builds up this LR automaton. And the LR automaton is apparently the same for all LR algorithms SLR/LALR/GLR, an LR(0) automaton, because LR(1) automatons with lookahead are too big.  And then the Aycock Horpsool paper uses Earley-LR, a.k.a. LRE following the 1996 paper, also built on top of an LR(0) automaton. Meanwhile GLR is inefficient (exponential) and nobody uses it besides people who don’t care about parser speed.

Then there’s the Adaptive LL paper by the ANTLR guy about how to use LL look ahead to predict things. That actually has a very simple structure: it just pretends it's an LL(0) parser except it has a look-ahead function which constructs a deterministic finite automaton and this DFA is what actually predicts the next production. And then it extends the automaton adaptively as it parses the input (which is the complicated part) and apparently it goes pretty fast. But the approach doesn’t handle ambiguous grammars or left recursion; he has a trick in the appendix but it doesn’t handle all cases. I still think I need to read more papers though. I haven’t looked at the “parsing with derivatives” paper which is probably the most promising since it’s the newest approach.


But first I'm going to read this paper “SPPF-style parsers from cubic recognizers”. So there are two main algorithms in the paper: there's the Earley algorithm and the RIGLR. So we'll start with the Earley algorithm. The Earley algorithm, let’s see...  so it has packed nodes and non packed nodes. I found a nice blog post on SPPF trees. But the algorithm is still impenetrable.


I guess going section by section is easier than trying to scan back and forth in the paper for clues. So first section is just introduction, trying to extend the Earley algorithm given in textbooks to parse returns spurious derivations. Then they have BRNGLR which is the alternative that they compare with. So section 2 is just definitions, they include created SPPF nodes in the Earley triples so the triples are labeled with grammar symbol / position, parent set index, SPPF node. And so then there's derivation trees and the shared forests are binarized because nodes have out-degree at most 2. Then they have a summary of Earley’s algorithm in one sentence which is so that's the middle of page 57. I think the contributions of this SPPF paper are mostly in the form of how to write an Earley parser (and RIGLR parser) correctly and the amount of explanation of the motivation or reasoning behind it are essentially nil. However it does have a good section on the algorithm that separates the phases of Earley parsing out and it seems to have more attention paid to the order in which the scanner / completer / predictor get executed than other papers which are just like “execute them all until they stop adding items”.


So the other thing in life I’m thinking about was the operational semantics and whether using the optimal reduction graph machine is a good idea. I've been mulling it over and concluded that it can’t hurt; I mean I know you can't implement eager reduction later, but it’s easy to add special-purpose reductions that split up the graph and hard to find those just by thinking about the problem in advance. So yeah in conclusion I think the best thing to implement optimal reduction.


The other thing I was thinking about was scopes. So if we have a statement like “let x equal 3 in y” then, interpreting that statement, it seems like the best approach would be to go to construct a record with x equals 3 as its contents and then y will be passed that record as an implicit environment argument. And so that should let all the names and stuff be resolved appropriately and so then you'll have like read and write scope commands and those commands will end up passing things through and yeah.

So another thing to do is look at the difference between parsing regular expressions with derivatives and parsing grammars with derivatives and also I want to see what the Antimirov derivatives are about because apparently they help with the capturing group stuff and that might be useful for parsing. It would be interesting to have non-capturing productions, that don’t show up in the generated parse tree.


So back to the big picture, parser goals:

Number 1 is getting to parse regular expression type things and EBNF or even boolean grammars on the right hand sides

Number 2 is doing linear time LR(k) grammars

Number 3 is generating SPPF-packed grammars or results

Number 4 is handling recursive grammars cyclic productions and nullable elements


So let's talk about the structure of a parser. A parser is generally something that takes a string of characters and the grammar and produces an output data structure which is in our case is going to be the SPPF format because that's the easiest format to use so far and it’s cubic time for ambiguous grammars. I guess that’s slow but there hopefully isn’t much ambiguity in a grammar.

The first step in parsing is to process the grammar into a usable form and so when we're building a parser generator, this does a lot of pre-processing and probably builds a big finite state machine that handles every character in the grammar. it seems like it is kind of necessary to use for efficiency. And there are some papers on incremental parsers and modifying LR state machines to add or remove grammar rules.

As far as the testing I guess I’ll use a number of different complicated expressions. First are the examples I've gotten from other parser papers of counter-examples that are not supposed to work. The simplest is probably the calculator or arithmetic language, it’s not LR and handling precedence and associativity is hard.


So the first goal is to understand the SPPF format. Parsing is in some sense an assignment problem, we want to assign values to the different tree start-and-stop positions in order to construct a tree. And in Marpa it’s a bit more complicated because we have different start-and-stop positions for each token-Earleme-lexeme thing.

The interesting part is handling recursion. Recursion happens on the left side, on the right, it happens in the middle, and then you can have cyclic rules that refer to each other indirectly or directly so you have indirect recursion. So how do we deal with this? The answer is we have to maintain multiple states. The question is how you maintain this. The Earley parser has seems to have the most efficient representation of ambiguity because it allows sharing token states between multiple different forms whereas GLR uses a more complicated stack and it seems to have a lot of duplication since it only shares prefixes and postfixes. So the Earley parser seems to have a finer grained sharing/tracking algorithm.

So the ambiguity part is interesting as well. Earley parsing handles left recursion really well. It just collapses the left recursion into a single state and then Leo’s optimization from his paper handles right recursion really well and again it collapses everything into a single state. The reason to use Aycock-Horpsool’s LR stuff is that it collapses a lot of the state transitions by parsing LR stuff. The questionable part is how it handles nullable rules. Nullable rules, I was thinking, can also be handled by creating an epsilon-LR automaton and adding some more complicated transition logic, so it's not clear that you have to modify the grammar like Aycock-Horspool does or if it's possible to just construct a special deterministic LR automaton instead that works for the graph.

The next question is how do I use it for operators? The operator precedence graph is also hard or parser to deal with. I found this parsing thesis that disambiguates SPFF trees using mixfix precedence information, and it seems easy enough, so I just have to read it.

The next thing is how to handle layout parsing in lexing and indentation and whitespace and they seem like trivial things since I found some nice papers (and I haven't read them either yet) but they should just be modifications to the grammar with more concatenative operators and extra ways of matching tokens.

I have no idea how this cubic Earley algorithm works. It's just so complicated. It's really intimidating. The start can be ignored; it’s initialization. You can just use a dummy rule with the start symbol on the right hand side and it will initialize itself, and this will also be needed for the Leo optimization to work anyway. This strategy is used in Bison too in its LR parsers.

The next part of the algorithm is all in a loop over the Earley sets; this is the index of the character.

Then there's the set R, which is defined in their earlier 2008 paper on Earley parsing along with Q. They’re both queues of pending items to be processed. Q is the set of items whose next symbol is a terminal and R is everything else. So R is processed first and then Q is processed to move to the next token. And then there is a holding set H for completing null derivations.

The Earley parser algorithm's input is a grammar gamma that is composed of the (let me look it up again) a set N of non-terminal symbols, a set T of terminal symbols, and an element S which is in N that called the start symbol, and a set P of numbered grammar rules of the form A := Alpha where a is a non-terminal symbol and Alpha is a possibly empty string of terminals and non-terminals. And the symbol epsilon denotes the empty string.
And then there is the string a1 a2 etc. up to a n. Okay.
So the actual Earley parser algorithm: first, we initialize e0 through e n to the empty set we also initialize the sets R, Q old, and V to be empty.
Next for all rules in P the set of grammar rules we set find the ones that start with the start symbol S letting their right hand side the alpha and if the right hand side is start with a string or of terminal or start is a terminal or non-terminal that begins with a non terminal... alright, this is at the point where writing it down as code directly is easier.

So I also wanted to talk about types. I really don't like types of cluttering the program and making it run slow but they are useful for type inference. In particular you have constraint solving. So the constraint solving works really well in Haskell, and it lets you write more code automatically, and so I want to expand on that in my language and have at least some amount of constraint solving. In particular you need a coerce function that automatically converts between any two types and then you need to be able to overload methods and have them resolved by type.
The most useful thing I've seen so far is the subtyping algorithm in Brick and MLSub which Stephen Dolan wrote. As far as doing something else I think the only solution is to hack until it works. I’ve read a bunch of type papers and they all end up leaving it up to you make it work. If they do present an algorithm it’s very simple and then all the type papers are mostly about proving the algorithm correct and complete rather than on the implementation.

I did a crawl of old papers citing Earley parsers and I didn't really find anything new. Google only shows the first 1000 results, sorted by citations, and there’s the Leo paper and the Horspool paper but not much else. I downloaded a few just to look at later but there aren't any really novel ideas, just optimizations and benchmarks.