Wednesday, January 30, 2019

Name resolution

The part of Spoofax I haven't really looked at yet but sounds cool is their name binding language. A (partial) implementation of semantics using declarative rules! But Stroscot is imperative, so the interesting part is the name resolution algorithm. We have "scopes", which have a parent-child relationship where lookup goes to the parent if it's not in the child.



In this example, we have scopes 1,2, and 3, and then use sites with outgoing edges and declarations with incoming edges. There are also imports and exports. Then there's a complicated tie-breaking graph-walking algorithm which determines which use maps to which definition. I don't know, I expected more. There's another paper, they point out that name resolution and type checking are often co-dependent. It seems to amount to even more complex graph walking.

Haskell doesn't have TDNR so it has a separate "renaming" pass before type-checking.

Wednesday, January 23, 2019

More on parsing for programming languages

Marpa, although nice on first glance (parser shootout w/ Marpa in the lead), from a maintainability/forkability perspective is a nightmare. All the code is written by one guy, hence the bus factor is 1. The only real documentation of the code is this handwavy paper focused on the algorithm's complexity, the API overview, and then the literate-programming book of code. The literate stuff seems like it's necessary to deal with the syntactic overhead of writing C code, but if you're writing a new programming language that stuff isn't relevant and it turns a 10-line algorithm into a 30-page nightmare. There's an attempt at re-implementing it in C# but unfortunately I think it's nowhere near Marpa in terms of efficiency or features since the author isn't well-versed in parsing theory. The Marpa guy honestly doesn't have much background either but I guess he's old enough to have been in university when these algorithms were the hot research topic and luckily some of it stuck.

One interesting feature of Marpa that isn't in the C code is its two-level grammar support ("scanless interface" or SLIF). I'm not sure how it works since I didn't look at the Perl but there seems to be something called an earleme and these get processed for each character position. And then two Marpa-C grammars. At the surface level though you just have two types of rules, lexical rules and grammar rules, both CFG's, and some extra lexical rules about discarding whitespace and finding the "longest acceptable token match".

The Spoofax Workbench implements a similar two-level "scannerless" thing, they use GLR though. I didn't look at the (Java-ish) code at all but conceptually at least they have an explicit "whitespace expansion" step which inserts optional lexical whitespace productions between every grammar rule. Then greedy matching is implemented by means of follow restrictions. They also have keyword priorities so that reserved words generate syntax errors when they are used, this seems like a misfeature though. Why reject a perfectly non-ambiguous program just because of a stupid language specification? In general just browsing the documentation it seems like GLR parsing leads to mysterious parse errors when trying to disambiguate using priority or associativity. Marpa hasn't seen much real-world use so it could be a problem there too, I don't know.

Another interesting point is the rule alternative syntax. Marpa models itself after EBNF with the pipes, and then has actions after each defining how to construct the tree, like "A := B { B a } | C { C a }. Spoofax uses separate rule alternative lines for each, like "A.B = B\nA.C = C". The action syntax closely models yacc/bison/etc. and their philosophy, while the Spoofax approach gives constructor names and hence an automatic ATerm ("Annotated Term") format representation to the grammar. For comparison DMS just uses "A = B\nA = C" and then uses "token numbers" (which seem to act roughly like grammar source line numbers but less useful) to specify which production is used in their parser output. This is the least overhead writing-wise for grammars but seems unmaintainable. A compromise might be numbering the constructors, A.1 A.2 etc., which is implicit like DMS but more closely maps to code structure instead of surface syntax.

DMS just uses a 'real' regexp-based lexer, I guess it's alright but what I really want is composable/scopable lexing so that you can embed arbitrary fragments of one language inside another language. Which doesn't really work if you have quotes and escapes etc. that vary from language to language. This paper on "local lexing" seems like the right idea, maybe impossible to implement well though. At least there's actual code.

Watching Parr on ANTLR it seems pretty clear you need a generalized tool. Grammar ambiguity isn't obvious in practice (undecidable), I guess you could try statically checking for it but honestly letting it be ambiguous and fixing the common cases seems like the right approach. And putting in parentheses isn't that expensive for readability while keeping track of hundreds of disambiguation rules is. Even doing some ML approach like predicting which rule is more useful is probably going to break in practice. The rest of the talk is mostly Parr spreading FUD and bashing everything besides ANTLR. He does bring up another idea though, which is how to process the parse tree afterwards. He listed OMeta, TXL, Stratego, TOM/Java, Rascal, and his own XPath-based ANTLR visitor-pattern thing as possibilities/inspiration, I'll research those further.

The most promising approach I missed previously was parsing with derivatives; it's hard to tell what the algorithm does though, since it modifies the grammar rather than a state machine.

Wednesday, January 16, 2019

The end

The nature of death has always been a subject of debate. One might compare it to sleep, another poorly understood phenomenon. On the one hand people die all the time, so it should be well-understood. On the other hand controlling the process of death (or indeed that of sleep) is extremely difficult. Hence the lack of understanding.

Even defining death is a difficult task. The subjective experience of death is by definition not possible to retrieve. The closest we get is near-death experiences, which are pretty much indistinguishable from comas or sleeping.

Another problem is discussing death; death anxiety means that discussions or even contemplation of the subject will quickly lead to dropping the subject. Although perhaps this is because there just isn't much to say.

Wednesday, January 9, 2019

PL Design process

Designing a programming language without some idea of what it will be used for is a recipe for failure. The first observation is that programming languages tend to end up with particular patterns or 'styles' of coding and these features get used heavily. For example in Haskell, they have some giant record for keeping track of state, some ADT-tree that gets manipulated, and then various glue code in between. TCL tends towards small programs and throwaway scripts, when bash isn't powerful enough and you need complex loop logic or regular expressions. Agda, Coq, and other proof assistants are so complex to write that they are only written with their accompanying IDE. C++ tends towards giant methods ensconced in a few classes, plus a lot of small copy-paste classes for data storage and interface munging. etc. Honestly though my sample size is small so evaluating this is hard.

Another idea was the schlep filter: writing a compiler is tedious and uninteresting. Hence any language gets a small boost once it's implemented, as other people will use it instead of writing their own. We can imagine a conversion funnel: name recognition -> basic syntax -> toy program -> side project -> startup -> active language developer. Designing a simple website for the PL and measuring conversion rates over time (even if/while it's not implemented) might be useful.

The uninteresting part isn't quite true, by my estimates a new PL gets posted to HN around every 3 months or so. Perhaps information/attention is the most valuable commodity; nobody knows what PL's are out there or what's suited to their project. They might know the library or tool they need though, hence that fixes their choice of PL.

PL design seems to be mostly an art, similar to painting or writing. Since notation is a tool of thought a well-designed PL ends up reshaping the view of the programmer; thoughts that are hard to express are avoided while thoughts that are easy are reinforced. We measure the performance of a PL as in the last post, possibility first then speed and correctness.

Hence I think the best start is a programming examples collection, one might almost say a test suite, where we use and/or implement various abstractions and algorithms. In literate programming the documentation comes with code; here we want the code to be the documentation. It's not so much self-documenting as the knowledge, the meaning, comes from experience with other PL's. For example map is a common function in Haskell but in other languages like C one never sees it as writing it is really hard. And once we write down enough algorithms maybe we can combine them together to form a compiler. So examples should focus on a few beginner coder examples but then mostly compiler algorithms. And then as the language progresses the collection will go from being interesting data to executable test suites. Another good idea is Stone Forth, where any abstraction used must also have its implementation expressed. But this might bog things down, hence why everything starts as a sketch rather than an implementation. We want to design a language starting from the things we want to write and not based on what can be implemented easily, rather that it's implementable at all. VPRI demonstrated with their Nile project that programming language design can make some problems very simple. But Nile itself is pretty big so it's not clear that they actually end up with a reduction in complexity.

Wednesday, January 2, 2019

Project Stroscot

Using a Markov name generator and pasting in Github names, Esolangs and jokes, and Wikipedia names, I came up with the name Stroscot for my next programming language. I have created a Github repo and subreddit for it.

Any design for a language should first take into consideration the fate of ALGOL 68. Years of committee meetings resulted in an incomprehensible and almost unusable specification, defining numerous overspecialized features in exacting detail and often using precise formalism. The goal of a language is to assist the programmer in "the reliable creation of sophisticated programs". Hence, the three tests for a language are: Can you write the program? Can you write the program quickly? Can you write the program correctly?

Declarative programming is basically dead, besides SQL and maybe Prolog; even SQL doesn't look that declarative anymore, it's basically functional programming in disguise, just with a little query planner. Prolog is a bit more interesting but at this point it's basically synonymous with depth-first SLD resolution and everybody who writes Prolog assumes those are the execution semantics; hence, it's just an interesting syntax for imperative programming with some backtracking capabilities thrown in. Anyways, the conclusion is that since "declarative programming" is defined as "anything that's not imperative", and declarative programming is dead, Stroscot will be an imperative programming language.

As far as types go, I'm still not decided. I definitely want to implement a lot of static analysis passes. But maybe types are a decoy. Type inference is basically a limited form of flow analysis and liveness. Gradual types and inferred types are ideas floating around though.

In the medium term, generating reasonably-efficient compiled code is a goal. Doing assembly directly is a mess, although Intel maintains this x86 instruction library in C. Stone knife Forth is a good start along those lines. Potion is similar with a hard-coded JIT. There are also mid-level JIT libraries. Zed Shaw wrote a little prototype EaRing on top of GNU Lightning (slides); GNU Lightning is basically dead though and really crufty so a new one should use libjit or something (recommended here). The most recent thing is probably WebAssembly but I'm not sure how useful targeting that would be. GHC and friends are experimenting with using LLVM; LLVM was designed with C++ in mind though so it's not really flexible enough for assembly. But Kavon Farvardin is working on that.

The other big issue with compilation is memory management; implementing garbage collectors is a pain, and I haven't really researched them. So for now I'll focus on just an interpreter since then I can use the host language's GC. Alternatively this micro virtual machines project looks interesting; at the very least they have a Rust Immix GC implementation. Skipping GC and doing a memory-safe language is also a possibility, but borrow checking seems hard. This guy spent a year on "Gradual Memory Management" and still can barely compile a simple C-like language. Although he's working on a harder problem; Rust can't even do a doubly-linked list without going into unsafe code or adding reference-counting overhead and basically doing GC yourself. Still another possibility is to just use malloc and free and the various specialized allocators like normal C programmers do and then maybe add static or dynamic analysis to detect errors. Tracing is invasive so supporting explicit memory and memory managed systems simultaneously is still mostly an open research problem. But explicit is efficient: see this paper for "Destination Passing Style" memory usage, where their optimizer produces the fastest C code.

Henri Tuhola describes compiling via CPS as "insanely effective", which seems hard to argue with. There's a CPS to SSA pass here. According to this mail, SSA uses a DAG to represent scoping while CPS uses a tree and parameter passing, hence CSE is harder in CPS. Some constructs in CPS can't be translated to SSA though, mostly ones generated by callcc and related functions. The other supposed problems with CPS are that it fixes the evaluation order and that the transformed terms are harder to reason about or use rewrite rules on. This paper says that CPS improves binding time analysis and flow analysis by exposing more static calls. But again CPS doesn't work well with LLVM.

What evaluation order do we want? For Haskell it is call-by-need, for ML and most other languages it is call-by-value. Personally though I am attracted to optimal reduction; it optimizes repeated application to be logarithmic, and also exposes some opportunities for fusion, deforestation, etc. And the underlying representation, interaction nets, can support call-by-name and call-by-value and also call-by-need. The issues seem to be (1) blowup in the graph representation due to redundant fans and duplication (2) handling recursion (e.g. let) (3) no efficient implementation of the 'bookkeeping' oracle which determines whether two fans should cancel or copy (4) graph representation / reduction overhead. I tried hacking on an optimal reduction implementation for a week and my conclusion is that it has a high learning curve and high tool support requirement but that it's workable. Solving the bookkeeping overhead seems like a PhD topic though so an initial prototype could just use closed reduction, it's still better than call-by-need.