Using a Markov name generator and pasting in Github names, Esolangs and jokes, and Wikipedia names, I came up with the name Stroscot for my next programming language. I have created a Github repo and subreddit for it.
Any design for a language should first take into consideration the fate of ALGOL 68. Years of committee meetings resulted in an incomprehensible and almost unusable specification, defining numerous overspecialized features in exacting detail and often using precise formalism. The goal of a language is to assist the programmer in "the reliable creation of sophisticated programs". Hence, the three tests for a language are: Can you write the program? Can you write the program quickly? Can you write the program correctly?
Declarative programming is basically dead, besides SQL and maybe Prolog; even SQL doesn't look that declarative anymore, it's basically functional programming in disguise, just with a little query planner. Prolog is a bit more interesting but at this point it's basically synonymous with depth-first SLD resolution and everybody who writes Prolog assumes those are the execution semantics; hence, it's just an interesting syntax for imperative programming with some backtracking capabilities thrown in. Anyways, the conclusion is that since "declarative programming" is defined as "anything that's not imperative", and declarative programming is dead, Stroscot will be an imperative programming language.
As far as types go, I'm still not decided. I definitely want to implement a lot of static analysis passes. But maybe types are a decoy. Type inference is basically a limited form of flow analysis and liveness. Gradual types and inferred types are ideas floating around though.
In the medium term, generating reasonably-efficient compiled code is a goal. Doing assembly directly is a mess, although Intel maintains this x86 instruction library in C. Stone knife Forth is a good start along those lines. Potion is similar with a hard-coded JIT. There are also mid-level JIT libraries. Zed Shaw wrote a little prototype EaRing on top of GNU Lightning (slides); GNU Lightning is basically dead though and really crufty so a new one should use libjit or something (recommended here). The most recent thing is probably WebAssembly but I'm not sure how useful targeting that would be. GHC and friends are experimenting with using LLVM; LLVM was designed with C++ in mind though so it's not really flexible enough for assembly. But Kavon Farvardin is working on that.
The other big issue with compilation is memory management; implementing garbage collectors is a pain, and I haven't really researched them. So for now I'll focus on just an interpreter since then I can use the host language's GC. Alternatively this micro virtual machines project looks interesting; at the very least they have a Rust Immix GC implementation. Skipping GC and doing a memory-safe language is also a possibility, but borrow checking seems hard. This guy spent a year on "Gradual Memory Management" and still can barely compile a simple C-like language. Although he's working on a harder problem; Rust can't even do a doubly-linked list without going into unsafe code or adding reference-counting overhead and basically doing GC yourself. Still another possibility is to just use malloc and free and the various specialized allocators like normal C programmers do and then maybe add static or dynamic analysis to detect errors. Tracing is invasive so supporting explicit memory and memory managed systems simultaneously is still mostly an open research problem. But explicit is efficient: see this paper for "Destination Passing Style" memory usage, where their optimizer produces the fastest C code.
Henri Tuhola describes compiling via CPS as "insanely effective", which seems hard to argue with. There's a CPS to SSA pass here. According to this mail, SSA uses a DAG to represent scoping while CPS uses a tree and parameter passing, hence CSE is harder in CPS. Some constructs in CPS can't be translated to SSA though, mostly ones generated by callcc and related functions. The other supposed problems with CPS are that it fixes the evaluation order and that the transformed terms are harder to reason about or use rewrite rules on. This paper says that CPS improves binding time analysis and flow analysis by exposing more static calls. But again CPS doesn't work well with LLVM.
What evaluation order do we want? For Haskell it is call-by-need, for ML and most other languages it is call-by-value. Personally though I am attracted to optimal reduction; it optimizes repeated application to be logarithmic, and also exposes some opportunities for fusion, deforestation, etc. And the underlying representation, interaction nets, can support call-by-name and call-by-value and also call-by-need. The issues seem to be (1) blowup in the graph representation due to redundant fans and duplication (2) handling recursion (e.g. let) (3) no efficient implementation of the 'bookkeeping' oracle which determines whether two fans should cancel or copy (4) graph representation / reduction overhead. I tried hacking on an optimal reduction implementation for a week and my conclusion is that it has a high learning curve and high tool support requirement but that it's workable. Solving the bookkeeping overhead seems like a PhD topic though so an initial prototype could just use closed reduction, it's still better than call-by-need.
Wednesday, January 2, 2019
Wednesday, December 19, 2018
Programming language notes
Typical compiler courses focus on syntax and a compilation pipeline, small projects satisfying whatever toy problem the author had in mind. If we're lucky the pipeline will be structured and decomposed into multiple entry points for a REPL, but usually it'll just be a big blob.
Except for the parser, we won't get a usable program without a parser and LR parsing is pretty much all there is, since LL is a subset of LR. A more general category would be shift-reduce. There are some interesting non-deterministic parsers like GLL (2), GLR, Marpa, Viterbi, ALL(*), etc. DMS Software Reengineering Toolkit claims to use GLR for pretty much all mainstream programming languages. All the courses I've seen use parser generators or hand-rolled recursive descent. The Marpa guy wrote a timeline which seems pretty accurate. He calls CYK "Sakai's Algorithm", a paper shows the only time it outperforms Earley is with
Processing a raw AST by itself is hard work, so a compiler also has auxiliary data structures such as symbol tables, import graphs, and type annotations. This paper coins the term program object model (POM) for the object model representing a program in the compiler.
Left out of compiler courses are the standard features of any program: logging, options, error messages, monitoring and debugging.
For compilers in particular though, once a language breaks out of the few-month-throwaway death trap, the language itself pretty much fossilizes and then the standard library begins to evolve and grow. Most closely linked is the runtime system, which specifies a loose collection of calling conventions, bytecode interpretation, and memory management. Another way to look at it is that interpreters are just compilers with really big run-time systems. In fact we can formalize this with the Futamura projections by introducing the notion of a "specializer".
Concurrency is interesting, and in today's multi-core world probably a necessity. Task graphs are definitely something to keep in mind, at a course-grained level. The other kind of concurrency is basically databases, where STM shines, when data isn't just being processed but has to be shared. Both of these end up in the runtime system usually.
A key component at this stage is motivation. It's not like anyone's paying for programming language design. The value of the top programming languages lies in their extensive test suites, documentation, and optimizations for performance, which are basically self-fulfilling prophecies. More people use it, the tests/docs/speed improves, rinse and repeat.
It would be nice if machine learning could just take test suites and produce a compiler. Test suites are easy to write and modify, refactoring is as simple as overwriting the expected results with what the compiler says, and test cases are small and self-contained so the amount of context/experience needed is small as well.
Repositories are log-normal sized. But the mean size is a factor of what programming language is used; Java programs are bigger.
Documenting decisions is a key point in software development. Without documentation, there's no way to go back and revisit the decision except by extensive refactoring and commit log inspection.
Except for the parser, we won't get a usable program without a parser and LR parsing is pretty much all there is, since LL is a subset of LR. A more general category would be shift-reduce. There are some interesting non-deterministic parsers like GLL (2), GLR, Marpa, Viterbi, ALL(*), etc. DMS Software Reengineering Toolkit claims to use GLR for pretty much all mainstream programming languages. All the courses I've seen use parser generators or hand-rolled recursive descent. The Marpa guy wrote a timeline which seems pretty accurate. He calls CYK "Sakai's Algorithm", a paper shows the only time it outperforms Earley is with
A -> a | AA grammar due to state explosion. A small but significant subproblem is layout parsing, whitespace and so forth; this paper and related work seems relevant and efficient.Processing a raw AST by itself is hard work, so a compiler also has auxiliary data structures such as symbol tables, import graphs, and type annotations. This paper coins the term program object model (POM) for the object model representing a program in the compiler.
Left out of compiler courses are the standard features of any program: logging, options, error messages, monitoring and debugging.
For compilers in particular though, once a language breaks out of the few-month-throwaway death trap, the language itself pretty much fossilizes and then the standard library begins to evolve and grow. Most closely linked is the runtime system, which specifies a loose collection of calling conventions, bytecode interpretation, and memory management. Another way to look at it is that interpreters are just compilers with really big run-time systems. In fact we can formalize this with the Futamura projections by introducing the notion of a "specializer".
Concurrency is interesting, and in today's multi-core world probably a necessity. Task graphs are definitely something to keep in mind, at a course-grained level. The other kind of concurrency is basically databases, where STM shines, when data isn't just being processed but has to be shared. Both of these end up in the runtime system usually.
A key component at this stage is motivation. It's not like anyone's paying for programming language design. The value of the top programming languages lies in their extensive test suites, documentation, and optimizations for performance, which are basically self-fulfilling prophecies. More people use it, the tests/docs/speed improves, rinse and repeat.
It would be nice if machine learning could just take test suites and produce a compiler. Test suites are easy to write and modify, refactoring is as simple as overwriting the expected results with what the compiler says, and test cases are small and self-contained so the amount of context/experience needed is small as well.
Repositories are log-normal sized. But the mean size is a factor of what programming language is used; Java programs are bigger.
Documenting decisions is a key point in software development. Without documentation, there's no way to go back and revisit the decision except by extensive refactoring and commit log inspection.
Tuesday, December 11, 2018
The state of America
According to Peter Turchin, the nation is collapsing, or at least becoming more chaotic. But is it collapsing entirely, or just in a few states?
Who knows. From maps it looks like the south has major problems, as does New Mexico and Michigan, but this doesn't really seem like enough data.
A better question is where the money is. During the Civil War, profiteering by businesses skyrocketed as traditional governmental and societal norms broke down. Corporations and businessman could basically charge whatever prices they wanted, and the government paid for it with the spoils of war. So in the coming crisis, who stands to profit most? Companies like Facebook, Google, Twitter are going to broadcast the revolution, no question. Due to the First Amendment and settled case law they are unlikely to be stopped. But they probably won't make much money, since their business models rely on advertising rather than direct supply.
Civil wars in other countries have been funding arms manufacturers and mercenaries. Production in the U.S. of handguns has reached record levels but has flattened out again. Guns are pretty much saturated in America though, with an estimated 1.2 guns per person. Even if they weren't, the present rate of gun sales of roughly 1 per 10 people per year means 50% will own a gun by the time the 2020-2021 predicted crisis peak rolls around. And more advanced weapons like missiles or nukes are unlikely to be accessed. Typical protest accessories like bandanas, smoke grenades, tear gas, water trucks, and riot shields are likely to sell well though. Cameras and phones will continue, although the market for phones is likely saturated.
Who knows. From maps it looks like the south has major problems, as does New Mexico and Michigan, but this doesn't really seem like enough data.
A better question is where the money is. During the Civil War, profiteering by businesses skyrocketed as traditional governmental and societal norms broke down. Corporations and businessman could basically charge whatever prices they wanted, and the government paid for it with the spoils of war. So in the coming crisis, who stands to profit most? Companies like Facebook, Google, Twitter are going to broadcast the revolution, no question. Due to the First Amendment and settled case law they are unlikely to be stopped. But they probably won't make much money, since their business models rely on advertising rather than direct supply.
Civil wars in other countries have been funding arms manufacturers and mercenaries. Production in the U.S. of handguns has reached record levels but has flattened out again. Guns are pretty much saturated in America though, with an estimated 1.2 guns per person. Even if they weren't, the present rate of gun sales of roughly 1 per 10 people per year means 50% will own a gun by the time the 2020-2021 predicted crisis peak rolls around. And more advanced weapons like missiles or nukes are unlikely to be accessed. Typical protest accessories like bandanas, smoke grenades, tear gas, water trucks, and riot shields are likely to sell well though. Cameras and phones will continue, although the market for phones is likely saturated.
Sunday, December 2, 2018
Design for the build system
Inspired by buildsome, the awesome build system, we can outline what a new build system would look like. The main technology is FUSE, as suggested in this issue, because it's the only reasonable way to globally hook filesystem access. We mount a basic pass-through FUSE file system over / and then just run whatever commands we want. FUSE traps all requests and then we can get the pid of the process using fuse_req_ctx or fuse_get_context which then through /proc gives us more useful information like process start time (pid + start time is unique), process command line (/proc/PID/cmdline), process environment (/proc/PID/environ), and process working directory (/proc/PID/cwd - this last is only valid before the process changes it, but the first thing any program does is read its executable files so we can trap it early). We also can get the parent process id, to recover the tree of subprocesses.
But just running commands isn't enough, we also have to handle rebuilds. When re-executing a process we must first roll back all the changes it and its sub-processes made, in order to ensure a clean build. To ensure an isolated build, we must also hide all the changes that were made by other processes that run after the re-executed process. These two steps should suffice for most purposes to enable deterministic builds, but of course more work can be done.
Next we introduce caching; as opposed to rebuilding, caching means that we can skip the build in the first place by downloading the pre-built result instead. For this we need a content-addressed store for all of the build files etc. and to store the metadata on which commands use which files. Also we need a special tool to run commands with, which checks if the command is cached before running it.
Finally we have to solve the problem of dependency hell. To summarize this is the situation where you have a library that gets updated but an application that depends on the library breaks on the update. The easiest solution is to do like Nix and install libraries into their own hashed directories, so that the new library goes in a new directory while the old directory sticks around and has the application still use it. But this breaks compatibility with a lot of software because the directory paths have to be hard-coded into the application somehow. A better solution is to use filesystem traps: a process that accesses the application files has a special label applied to it, so that when it later on accesses the library the label makes it see the old version of the library rather than the new library. This might(?) be implementable as another FUSE filesystem but an easier method is to just replace all executables with setcap CAP_SYS_ADMIN wrappers that mount an overlayfs on top and then exec the process.
But just running commands isn't enough, we also have to handle rebuilds. When re-executing a process we must first roll back all the changes it and its sub-processes made, in order to ensure a clean build. To ensure an isolated build, we must also hide all the changes that were made by other processes that run after the re-executed process. These two steps should suffice for most purposes to enable deterministic builds, but of course more work can be done.
Next we introduce caching; as opposed to rebuilding, caching means that we can skip the build in the first place by downloading the pre-built result instead. For this we need a content-addressed store for all of the build files etc. and to store the metadata on which commands use which files. Also we need a special tool to run commands with, which checks if the command is cached before running it.
Finally we have to solve the problem of dependency hell. To summarize this is the situation where you have a library that gets updated but an application that depends on the library breaks on the update. The easiest solution is to do like Nix and install libraries into their own hashed directories, so that the new library goes in a new directory while the old directory sticks around and has the application still use it. But this breaks compatibility with a lot of software because the directory paths have to be hard-coded into the application somehow. A better solution is to use filesystem traps: a process that accesses the application files has a special label applied to it, so that when it later on accesses the library the label makes it see the old version of the library rather than the new library. This might(?) be implementable as another FUSE filesystem but an easier method is to just replace all executables with setcap CAP_SYS_ADMIN wrappers that mount an overlayfs on top and then exec the process.
Monday, February 5, 2018
Nix pros and cons
Pros
- Sandboxed builds
- Cross-compile support
- Run-time dependency information via hash-scanning
- not unique, e.g. NuTyX (cards) scans dependencies in the Elf header.
- Multiple software versions don't conflict
- Non-destructive updates
- Automated builds/tests with Hydra
- Single config file for NixOS
- Non-FHS layout
- Repeated dependency information
- Slow to evaluate
- Manual version bumps, cluttering VCS
Subscribe to:
Comments (Atom)