Wednesday, January 23, 2019

More on parsing for programming languages

Marpa, although nice on first glance (parser shootout w/ Marpa in the lead), from a maintainability/forkability perspective is a nightmare. All the code is written by one guy, hence the bus factor is 1. The only real documentation of the code is this handwavy paper focused on the algorithm's complexity, the API overview, and then the literate-programming book of code. The literate stuff seems like it's necessary to deal with the syntactic overhead of writing C code, but if you're writing a new programming language that stuff isn't relevant and it turns a 10-line algorithm into a 30-page nightmare. There's an attempt at re-implementing it in C# but unfortunately I think it's nowhere near Marpa in terms of efficiency or features since the author isn't well-versed in parsing theory. The Marpa guy honestly doesn't have much background either but I guess he's old enough to have been in university when these algorithms were the hot research topic and luckily some of it stuck.

One interesting feature of Marpa that isn't in the C code is its two-level grammar support ("scanless interface" or SLIF). I'm not sure how it works since I didn't look at the Perl but there seems to be something called an earleme and these get processed for each character position. And then two Marpa-C grammars. At the surface level though you just have two types of rules, lexical rules and grammar rules, both CFG's, and some extra lexical rules about discarding whitespace and finding the "longest acceptable token match".

The Spoofax Workbench implements a similar two-level "scannerless" thing, they use GLR though. I didn't look at the (Java-ish) code at all but conceptually at least they have an explicit "whitespace expansion" step which inserts optional lexical whitespace productions between every grammar rule. Then greedy matching is implemented by means of follow restrictions. They also have keyword priorities so that reserved words generate syntax errors when they are used, this seems like a misfeature though. Why reject a perfectly non-ambiguous program just because of a stupid language specification? In general just browsing the documentation it seems like GLR parsing leads to mysterious parse errors when trying to disambiguate using priority or associativity. Marpa hasn't seen much real-world use so it could be a problem there too, I don't know.

Another interesting point is the rule alternative syntax. Marpa models itself after EBNF with the pipes, and then has actions after each defining how to construct the tree, like "A := B { B a } | C { C a }. Spoofax uses separate rule alternative lines for each, like "A.B = B\nA.C = C". The action syntax closely models yacc/bison/etc. and their philosophy, while the Spoofax approach gives constructor names and hence an automatic ATerm ("Annotated Term") format representation to the grammar. For comparison DMS just uses "A = B\nA = C" and then uses "token numbers" (which seem to act roughly like grammar source line numbers but less useful) to specify which production is used in their parser output. This is the least overhead writing-wise for grammars but seems unmaintainable. A compromise might be numbering the constructors, A.1 A.2 etc., which is implicit like DMS but more closely maps to code structure instead of surface syntax.

DMS just uses a 'real' regexp-based lexer, I guess it's alright but what I really want is composable/scopable lexing so that you can embed arbitrary fragments of one language inside another language. Which doesn't really work if you have quotes and escapes etc. that vary from language to language. This paper on "local lexing" seems like the right idea, maybe impossible to implement well though. At least there's actual code.

Watching Parr on ANTLR it seems pretty clear you need a generalized tool. Grammar ambiguity isn't obvious in practice (undecidable), I guess you could try statically checking for it but honestly letting it be ambiguous and fixing the common cases seems like the right approach. And putting in parentheses isn't that expensive for readability while keeping track of hundreds of disambiguation rules is. Even doing some ML approach like predicting which rule is more useful is probably going to break in practice. The rest of the talk is mostly Parr spreading FUD and bashing everything besides ANTLR. He does bring up another idea though, which is how to process the parse tree afterwards. He listed OMeta, TXL, Stratego, TOM/Java, Rascal, and his own XPath-based ANTLR visitor-pattern thing as possibilities/inspiration, I'll research those further.

The most promising approach I missed previously was parsing with derivatives; it's hard to tell what the algorithm does though, since it modifies the grammar rather than a state machine.

No comments:

Post a Comment