Tuesday, October 31, 2017

Continuous delivery and distribution

Software development suffers from lag; every change must be made, then compiled, then tested in few dozen ways, before the next change can be made. The faster this process completes, the faster development can move forward. Non-incremental development, where changes are made without testing or compiling, leads to slower development times overall due to the numerous regressions that are introduced and have to be tracked down later.

Nix encourages incremental development by making it easy to download and use software; it creates a uniform interface for adding libraries and other dependencies, in the .nix file. But it also slows down development because the build process itself is non-incremental; it has to pull down every source file every single build which adds significant overhead (1.5 minutes for a simple Java app). Part of this is necessary overhead; in a cluster, a build file has to be replicated among the machines. And dependencies have to be expire every so often to ensure they don't get stale. Nix's insistence on knowing the hash of every source dependency before it's fetched means that you can't implement automatic updates without updating your Nix files automatically. But the whole point of Nix is to be a human-readable and human-editable description of your configuration; updating them automatically makes them just another intermediate file format.

A cardinal rule of source control is to never check generated files into the repository; Nixpkgs has been violating this rule, with the consequent ballooning of repository size as a result. Hence the need to start over with a newer, cleaner distro - solve both the incremental build problem, via a new package manager and build tool, and the repository problem, via a more principled approach to dependency management.

The place to start is a small prototype to prove that it works; I only program in Haskell these days. For monitoring file dependencies we can use hs-watchman, although it needs to be updated for 2017. For logging we can use GZipped JSON as the file format and the pipes-zlib library. For process tracing we can re-use the command infrastructure from Shake.

Type system features

Subtyping and open data types

One interesting feature of type systems is extensibility; we have the typical object-oriented inheritance, but this is actually too narrow for a proper type system. We instead have the notion of an extended conversion, a function from -> to. Useful examples start from integer conversions, but the most interesting example is adding more constructors to a datatype; A | B is a subtype of A | B | C. We can also do it with product types, by removing elements; (A,B,C) is a subtype of (A,B) is a subtype of A. It makes more sense with labels and records though; {foo: A, bar: B} is a subtype of {foo: A}. This is the closest it gets to inheritance in a type system. Implementations are few, unfortunately; it's primarily Stephen Dolan's work on BRICK (blog archived) and MLSub. BRICK is notable for supporting equirecursive and regular types, so that records do not need a newtype to destructure or bind, similar to dynamically typed languages such as Python, and for compiling to LLVM (no GC, unfortunately). MLSub is just a boring extension with lots of proofs of correctness in the thesis and the coined terms of "biunification" and "bisubstitution". He basically tries to convince you that his subtyping algorithm is almost as simple to implement as Hindley-Milner; which honestly I agree with.

The most interesting part of subtyping is that it explicitly exposes the idea of polarization or duality; for example a function arrow (A -> B) is contravariant in A and covariant in B. When we type check we can divide the type variables into two classes: the pos-variables (lower bounds) and the neg-variables (upper bounds). In fact we usually need both; for mutable variables, arrays, etc. we need both in and out parameters (what can be stored vs what can be retrieved). For example the parameter type for a function that accepts a mutable array of Ints and only reads from it will be MutArray[store=\bot, retrieve=Int], where the store is a lower bound and retrieve is an upper bound.

Dependent types

Once you have subtyping, what point is there in distinguishing values and types? Letting them be used in place almost interchangeably requires some complicated compilation tricks, and an interactive interpreter, but Idris has shown that it works reasonably well in a practical programming language. But dependent types can be extended further.

Insanely dependent types let you define dependent pairs (Σ A (B a), Idris term a ** B a) using a dependent product (Π, (a : A) -> B a). There's some trickery involved in the implementation, see the source for details. The short answer is you need extra scope to reference all the values, and a careful evaluation strategy to avoid infinite loops. The related idea of circular type signatures allows indexing datatypes by values of the same datatype; for example a matrix indexed by a vector, while defining a vector as a single-dimension matrix of length n. (Their notion of matrix is n-dimensional so really it's a tensor). Also interesting from that reddit thread is the idea of making values (terms) a subtype/instance of their type, so for example 1 is a subtype of Integer.

Richard Eisenberg is working on adding them to Haskell, although only the normal form of dependent types.

Linear Types

We could borrow type constructors from Girard's logic: ! and ?, respectively weakening a type to be usable more than once and to not be used at all. There remains the linear arrow ⊸. The usual intuitionistic arrow A -> B then desugars as !A ⊸ B. But this cannot be the final answer, because in IL we decompose A -> B as ¬(A & ¬B); functions do not really exist. Following this paper it seems like it makes more sense to just add multiplicities to the constructors, 0 1 and ω for erasable, linear, and ordinary types respectively. A note here mentions extending ω to having multiple types.

We could also use monads or their more general cousins of Arrows, but why?

Tweag says they're adding them to GHC. although it's hard to say.

Higher-order Types

This paper seems interesting and useful, shouldn't be too hard to add.

Extended type classes

Haskell's type classes have been wildly successful; from an implementation perspective, they amount to adding a constraint-solving engine that discovers and writes in new terms. But I think type classes should be able to add new type information too; unfortunately my efforts in that direction have been unsuccessful so far. It seems like a simple idea though; the set of type class instances in scope is fixed, so if only one instance unifies with the types in scope then it must be the instance to use (otherwise the term must be resolved elsewhere). This is to be distinguished from GHC's current implementation, which has a more restrictive condition requiring matching instead of unification, where matching is defined to add no new information. The question is how to propagate the information from unification in the constraint solver back to the type checker at large; I don't really know enough GHC internals to diagnose the errors in the test suite I ran into.

Type-directed name resolution

One of the most annoying parts of Haskell is that record constructor names must be globally unique since they count as functions. This has been worked on here, but the problem also shows up with type classes (particularly RebindableSyntax) and just in general with similarly-named or similarly-performing functions. What would be ideal IMO is if Haskell supported ad-hoc overloading, so you could just use the name as many times as you wanted and the compiler disambiguated by type. This would also get rid of the 'import Data.Map as M, Data.Set as S' boilerplate. But this would require intermingling the renaming and type-checking phases of GHC, so unlikely to happen.

Sunday, October 29, 2017

Scope and closures


In functional programming, values are bound to names. But how are names resolved? The main strategies are lexical scope and dynamic scope. Stallman writes some sections on why Emacs Lisp has dynamic scope (now extended to support lexical scope as well). But of course the two strategies are not mutually incompatible; Vesta supports both dynamic and static scoping, with dynamic as the fallback for static. (PDF page 27)  Dynamic scoping is limited to function blocks with ... in the argument list, but at least it's there. Such a feature would be quite useful in Nixpkgs, where packages are repeated twice, first in the formal parameters and second in the actual dependencies, although a version of that syntax was considered and rejected due to being too slow to execute. Vesta and Nix share much in common though; Vesta's runtool is very similar to Nix's derivation, although Nix uses cryptographic hashes and Vesta uses virtual filesystems.

In more formal terms, we may classify scoping as resolving a name along two dimensions, its semantically enclosing environment (stack frames or dynamic scope), and its syntactically enclosing environment, its AST tree or lexical scope. A filtering function selects from the possible environments and can produce an error or the proper resolution of the name.

Despite the name, dynamic scoping is not incompatible with static typing; a syntactic analysis of which names resolve where is easy, and whatever names remain unresolved can be resolved dynamically.

But the question of how environments/scopes are implemented is not trivial; a paper I read but lost lists many strategies, with choices such as heap-allocated vs stack-allocated, callee-saved vs caller-saved, and ordering of the frame layout. Given the choices, it seems foolish to lock oneself into a particular strategy, at least at a low level.

Saturday, October 28, 2017

A new build system

A survey of build systems, and their advantages:
  • Make - installed everywhere
  • Ninja - portable format
  • Shake - parallelism, unchanging files, monadic dependencies
  • Pluto - dynamic dependencies
  • Tup - fast rebuilds with a filesystem monitor
  • Nix - configuration management, isolated builds
There are no easily found build systems that combine all these advantages. Why not? Probably due to the difficulty of switching projects over to new build systems; most people are happy to let sleeping dogs lie, even if in this case they are really hissing snakes that are still biting and constricting developers.

With no existing build system, the solution is to write a new build system. Relevant xkcd:
https://xkcd.com/927/

But the problem is in fact that there is no standard for build systems, just people (companies) doing whatever they feel like. When a committee comes along, develops USB-C, Unicode, etc., in practice it seems that everyone does end up switching. In software, there's an even stronger standardization process: the best open-source implementation usually ends up becoming the de-facto standard, just because everyone else is too lazy or too stupid to reimplement it. Look at LAMP and its history, or more recently Wordpress.


The closest candidate to a de-facto standard for build systems is Make; it has a large user community. But it's missing all the features of the other systems. Can we extend make? Probably not, look at the source, e.g. the dependencies file and main.c and job.c. It's heavily process-based and GNU-centric, the deferred string processing and loading of commands makes everything really slow, and then of course it's in C, which isn't exactly my language of choice. Also it doesn't support parallelism on Windows (maybe?).  OTOH a new implementation maintaining backwards compatibility with make is definitely possible - for example Mozilla has been using PyMake since 2011. Python has been much more successful as a scripting language than GNU Make's built in scripting language of Guile. And porting even a large project isn't that hard, the Tup guy got halfway there already. The hard part is changing over tooling / automation / infrastructure which assumes the use of Make. But as Tup mentions, make is flawed in the way it's most commonly used (overly recursive, incomplete DAG) and also slow because of how it processes files. And it doesn't use a database, all the cool build systems have a database. So all in all, Make is dead for all standards-making purposes; nobody uses it anymore in their new projects, except GNU. And autoconf can generate config files, so there's no need to replace the whole GNU toolchain entirely. Just replacing automake should be sufficient.

Requirements for a new tool in priority order, from 2007:
  1. Dependency Management (we have many many libraries)
    1. Search Paths 
    2. Variants
  2. Maintainability
  3. Cost of training for day-to-day tasks (40+ developers)
  4. Portability (but Linux is a must)
    1. Windows, MSYS
  5. Speed (A distant fifth; anything would be better than recursive make.)
But of course everyone has their own priorities. The only real priority is avoiding yet another Make DSL.