Wednesday, December 19, 2018

Programming language notes

Typical compiler courses focus on syntax and a compilation pipeline, small projects satisfying whatever toy problem the author had in mind. If we're lucky the pipeline will be structured and decomposed into multiple entry points for a REPL, but usually it'll just be a big blob.

Except for the parser, we won't get a usable program without a parser and LR parsing is pretty much all there is, since LL is a subset of LR. A more general category would be shift-reduce. There are some interesting non-deterministic parsers like GLL (2), GLR, Marpa, Viterbi, ALL(*), etc.  DMS Software Reengineering Toolkit claims to use GLR for pretty much all mainstream programming languages. All the courses I've seen use parser generators or hand-rolled recursive descent. The Marpa guy wrote a timeline which seems pretty accurate. He calls CYK "Sakai's Algorithm", a paper shows the only time it outperforms Earley is with A -> a | AA grammar due to state explosion. A small but significant subproblem is layout parsing, whitespace and so forth; this paper and related work seems relevant and efficient.

Processing a raw AST by itself is hard work, so a compiler also has auxiliary data structures such as symbol tables, import graphs, and type annotations. This paper coins the term program object model (POM) for the object model representing a program in the compiler.

Left out of compiler courses are the standard features of any program: logging, options, error messages, monitoring and debugging.

For compilers in particular though, once a language breaks out of the few-month-throwaway death trap, the language itself pretty much fossilizes and then the standard library begins to evolve and grow. Most closely linked is the runtime system, which specifies a loose collection of calling conventions, bytecode interpretation, and memory management. Another way to look at it is that interpreters are just compilers with really big run-time systems. In fact we can formalize this with the Futamura projections by introducing the notion of a "specializer".

Concurrency is interesting, and in today's multi-core world probably a necessity. Task graphs are definitely something to keep in mind, at a course-grained level. The other kind of concurrency is basically databases, where STM shines, when data isn't just being processed but has to be shared. Both of these end up in the runtime system usually.

A key component at this stage is motivation. It's not like anyone's paying for programming language design. The value of the top programming languages lies in their extensive test suites, documentation, and optimizations for performance, which are basically self-fulfilling prophecies. More people use it, the tests/docs/speed improves, rinse and repeat.

It would be nice if machine learning could just take test suites and produce a compiler. Test suites are easy to write and modify, refactoring is as simple as overwriting the expected results with what the compiler says, and test cases are small and self-contained so the amount of context/experience needed is small as well.

Repositories are log-normal sized. But the mean size is a factor of what programming language is used; Java programs are bigger.

Documenting decisions is a key point in software development. Without documentation, there's no way to go back and revisit the decision except by extensive refactoring and commit log inspection.

Tuesday, December 11, 2018

The state of America

According to Peter Turchin, the nation is collapsing, or at least becoming more chaotic. But is it collapsing entirely, or just in a few states?

Who knows. From maps it looks like the south has major problems, as does New Mexico and Michigan, but this doesn't really seem like enough data.

A better question is where the money is. During the Civil War, profiteering by businesses skyrocketed as traditional governmental and societal norms broke down. Corporations and businessman could basically charge whatever prices they wanted, and the government paid for it with the spoils of war. So in the coming crisis, who stands to profit most? Companies like Facebook, Google, Twitter are going to broadcast the revolution, no question. Due to the First Amendment and settled case law they are unlikely to be stopped. But they probably won't make much money, since their business models rely on advertising rather than direct supply.

Civil wars in other countries have been funding arms manufacturers and mercenaries. Production in the U.S. of handguns has reached record levels but has flattened out again. Guns are pretty much saturated in America though, with an estimated 1.2 guns per person. Even if they weren't, the present rate of gun sales of roughly 1 per 10 people per year means 50% will own a gun by the time the 2020-2021 predicted crisis peak rolls around. And more advanced weapons like missiles or nukes are unlikely to be accessed. Typical protest accessories like bandanas, smoke grenades, tear gas, water trucks, and riot shields are likely to sell well though. Cameras and phones will continue, although the market for phones is likely saturated.

Sunday, December 2, 2018

Design for the build system

Inspired by buildsome, the awesome build system, we can outline what a new build system would look like. The main technology is FUSE, as suggested in this issue, because it's the only reasonable way to globally hook filesystem access. We mount a basic pass-through FUSE file system over / and then just run whatever commands we want. FUSE traps all requests and then we can get the pid of the process using fuse_req_ctx or fuse_get_context which then through /proc gives us more useful information like process start time (pid + start time is unique), process command line (/proc/PID/cmdline), process environment (/proc/PID/environ), and process working directory (/proc/PID/cwd - this last is only valid before the process changes it, but the first thing any program does is read its executable files so we can trap it early). We also can get the parent process id, to recover the tree of subprocesses.

But just running commands isn't enough, we also have to handle rebuilds. When re-executing a process we must first roll back all the changes it and its sub-processes made, in order to ensure a clean build. To ensure an isolated build, we must also hide all the changes that were made by other processes that run after the re-executed process. These two steps should suffice for most purposes to enable deterministic builds, but of course more work can be done.

Next we introduce caching; as opposed to rebuilding, caching means that we can skip the build in the first place by downloading the pre-built result instead. For this we need a content-addressed store for all of the build files etc. and to store the metadata on which commands use which files. Also we need a special tool to run commands with, which checks if the command is cached before running it.

Finally we have to solve the problem of dependency hell. To summarize this is the situation where you have a library that gets updated but an application that depends on the library breaks on the update. The easiest solution is to do like Nix and install libraries into their own hashed directories, so that the new library goes in a new directory while the old directory sticks around and has the application still use it. But this breaks compatibility with a lot of software because the directory paths have to be hard-coded into the application somehow. A better solution is to use filesystem traps: a process that accesses the application files has a special label applied to it, so that when it later on accesses the library the label makes it see the old version of the library rather than the new library. This might(?) be implementable as another FUSE filesystem but an easier method is to just replace all executables with setcap CAP_SYS_ADMIN wrappers that mount an overlayfs on top and then exec the process.

Monday, February 5, 2018

Nix pros and cons

Pros
  • Sandboxed builds
  • Cross-compile support
  • Run-time dependency information via hash-scanning
    • not unique, e.g. NuTyX (cards) scans dependencies in the Elf header. 
  • Multiple software versions don't conflict 
  • Non-destructive updates
  • Automated builds/tests with Hydra
  • Single config file for NixOS
Cons
  • Non-FHS layout
  • Repeated dependency information
  • Slow to evaluate
  • Manual version bumps, cluttering VCS