Wednesday, December 19, 2018

Programming language notes

Typical compiler courses focus on syntax and a compilation pipeline, small projects satisfying whatever toy problem the author had in mind. If we're lucky the pipeline will be structured and decomposed into multiple entry points for a REPL, but usually it'll just be a big blob.

Except for the parser, we won't get a usable program without a parser and LR parsing is pretty much all there is, since LL is a subset of LR. A more general category would be shift-reduce. There are some interesting non-deterministic parsers like GLL (2), GLR, Marpa, Viterbi, ALL(*), etc.  DMS Software Reengineering Toolkit claims to use GLR for pretty much all mainstream programming languages. All the courses I've seen use parser generators or hand-rolled recursive descent. The Marpa guy wrote a timeline which seems pretty accurate. He calls CYK "Sakai's Algorithm", a paper shows the only time it outperforms Earley is with A -> a | AA grammar due to state explosion. A small but significant subproblem is layout parsing, whitespace and so forth; this paper and related work seems relevant and efficient.

Processing a raw AST by itself is hard work, so a compiler also has auxiliary data structures such as symbol tables, import graphs, and type annotations. This paper coins the term program object model (POM) for the object model representing a program in the compiler.

Left out of compiler courses are the standard features of any program: logging, options, error messages, monitoring and debugging.

For compilers in particular though, once a language breaks out of the few-month-throwaway death trap, the language itself pretty much fossilizes and then the standard library begins to evolve and grow. Most closely linked is the runtime system, which specifies a loose collection of calling conventions, bytecode interpretation, and memory management. Another way to look at it is that interpreters are just compilers with really big run-time systems. In fact we can formalize this with the Futamura projections by introducing the notion of a "specializer".

Concurrency is interesting, and in today's multi-core world probably a necessity. Task graphs are definitely something to keep in mind, at a course-grained level. The other kind of concurrency is basically databases, where STM shines, when data isn't just being processed but has to be shared. Both of these end up in the runtime system usually.

A key component at this stage is motivation. It's not like anyone's paying for programming language design. The value of the top programming languages lies in their extensive test suites, documentation, and optimizations for performance, which are basically self-fulfilling prophecies. More people use it, the tests/docs/speed improves, rinse and repeat.

It would be nice if machine learning could just take test suites and produce a compiler. Test suites are easy to write and modify, refactoring is as simple as overwriting the expected results with what the compiler says, and test cases are small and self-contained so the amount of context/experience needed is small as well.

Repositories are log-normal sized. But the mean size is a factor of what programming language is used; Java programs are bigger.

Documenting decisions is a key point in software development. Without documentation, there's no way to go back and revisit the decision except by extensive refactoring and commit log inspection.

No comments:

Post a Comment