Saturday, May 11, 2019

Stroscot Devlog 1

A programming language has a lot of components: the parser, the operational semantics, the code generator...
For the code generator I think we'll just use LLVM because LLVM is the most well-developed. It can handle the register spilling and some of the other stuff that we need to handle. LLVM is used by Haskell and Manticore, the interesting CPS compiler which I wanted to try, not to mention Zig, Rust, Jai, Julia, Pure. And of course LLVM has clang so it's easy to interface with C and C++ code. So a lot of good reasons to use LLVM and there aren't really many projects competing with it. The issues seem to be that it changes frequently and that the optimizations are kind of hit-or-miss depending on what kind of code you generate (although they can be disabled). If I was going to write an alternative I would use the Intel xed library because I don't want to read all those huge Intel PDFs of instruction tables and stuff, but LLVM is like xed on steroids and with cross-platform compatibility thrown in.

Another part of the project is the support tools, like the documentation and the code highlighting and the pretty printer/source code formatter and the automated refactoring and the hyperlinked source code browser search interface.

That last is implemented at Mozilla with the searchfox interface. Then there’s debian’s code search that uses the library codesearch. But for a small project grep probably suffices. I also hope these devlog notes will get turned into reference material somehow like GHC’s notes and commentary.

For documentation the main functionality I want is to include highlighted snippets of code for elucidating explanation and then just a way to write normal HTML stuff for a web browser. I assume the web is the best way to convey documentation. I don't know about videos; I find they're generally hard to skim so they're best it for introductions. There are 2 minute videos for real introductions and then 10 minute videos which seem more like infomercials than useful video. And then you go to a talk that's an hour-long or something; it's a complete waste of time unless you’re the speaker.

Another big component is the standard library. And of course the standard library ends up being used by the compiler to write itself then so you have to design the library concurrent with the actual compiler at least somewhat. Most of the library writes itself or uses standard algorithms but then there’s intrinsics and primops. Primops are the most interesting part of the of the compiler because those are the stuff that you can't implement but really they are implemented. Perhaps primop is a misleading name because really what they are is calls to external functions written in other languages, so they're really “foreign function calls”, and they’re just called “foreign” because they're not written in the native programming language that you're using right now. At this point, for GHC, primops are just LLVM IR with a special calling convention, so it's probably not necessary to do it the way GHC does.

Now, the best way to use an FFI, that's one thing I've been looking at a lot, because the better the FFI is the easier your language is to use. You just say “Aha! I want to use this PL” and you drop it into your project, you import all your existing files, which is where the work is, depending on how hard your FFI is to use, and then you just write your programs and your new language. That's how Mozilla is translating Firefox from C++ to Rust. They're using the Rust FFI with C bindings. It's not very good, honestly. I think the best thing is to use the LLVM clang interface and interface with C++ directly. But C++ as a language has problems, it’s huge. It has templates and gobbledygook, basically this name mangling for the function overloading, and writing a good C++ FFI you have to match this interface somehow and so I mean the only way to interface with C++ basically is to write your own vtable and name mangling or to use someone else's that's already implemented and of course clang is the obvious choice here.

So what I want to do is to use is the ROOT C++ interpreter Cling to compile snippets of code, and then gradually extend that to building a C++ AST ourselves. But right now I'm not sure how hard it will be, it might be impossible. There are some slides about how Cling builds up a global definition table and then it compiles things over and over but I'm not sure how that works here for ahead of time compilation. I have to look at the code because the slides are no help.

I still need to figure out the architecture for my project folder. I'm not sure if it’s important if you're doing everything with code search and just whizzing around the project with hyperlinks and stuff, but it would be good to figure out which modules can import each other and assign places for new functions that make sense. So it's a classification problem with a DAG requirement attached. Right now I have this docs glue library layout, and it's actually starting to be too much work. So maybe the most pressing priority now is for the documentation to be more navigable than just simple text files so I need to build an HTML linking thing.

I also want to know how to make money with programming language design. Tabled for now since it has no users. I’m interested in Python machine learning and CUDA, the hard part is mixing C++ machine learning and all that in one language. Ideally our FFI will support more than just C++, so that you can use programming languages from one place and just mix them and match them and have this polyglot language that kind of patches things together.

Another option is to fake mixing syntax by using its syntax definition, but figuring out how to compose grammars is a hard problem. There are some papers to read. My basic theory now is that you do it with grammar rules. You just say: I have a grammar for the C language, I have a grammar for the Java language, I'll just take the expression rule from one language and the other language and write a new expression rule that combines the syntax of both. And then you have a little tricky thing to modify the fixed point and yeah. And with scannerless parsing it actually works.

On a digression, changing the fixed point in the code is refactoring. I want refactoring to be possible with just writing a wrapper around a method that redefines its semantics and then you just run a little compiler, like expand this definition and then it goes through and it doesn't interpret most of the semantics but it fuses the language functions enough that you get rid of the old definition then you get this new prettier definition that does what you wanted to do and so yeah. There’s a debate over whether you want to refactor directly, because sometimes you want to do the rewrite directly but other times you want to include the definition somewhere else and you don't want the original modified, i.e. it's clearer to write it as the refactoring itself then to run the refactoring over the code. Like when you're writing a parser incrementally as a tutorial, you might want to start with this naive slow algorithm and then gradually add more complicated and more complex optimization stuff. But you don't want the optimizations to be in the code, you want them to be like a separate method on top of the code.
And so what we need is an optimization database. Kind of like GHC interface files. Interface files are complicated, they store a lot of data, but the data is just computed so you have to do some kind of dependency management system. So eventually you know you want an entire package manager for a single programming language that manages the code in the files but I don't know how that works. Dependency management is complicated except it's not really that hard. There’s a project called Adapton you basically have a graph and you have two phases you have invalidation which propagates up and then you have evaluation which propagates down and there's an adapton paper. The problem is when you get names and recursion and all the stuff and that's when you're like “oh I'll just fuck it all” and you need an entire programming language just to handle incremental computations. But incremental computation is hard so it’s a little complicated to extend it to handle recursion.
Also relevant is Nix, Nix has a programming language kind of and so in Nix we have this derivation thing and the derivation returns a hash of the input files and then it computes stuff and so it stores files in this hashed locations. But hashing doesn't really work, it's too expensive, so instead of hashes you have to use modification times. And the other problem with hashing the way Nix does it is it makes working with existing software really hard because the files are stored in random locations for isolation. The solution there is to use file access times or ptrace or a virtual file system or maybe even some virus scanning API.

Another problem with hashes, for checking for modifications, is that they're slow. So you do want to use modification times but you want to treat them as hashes so like basically pretend the modification time is a hash of the file and you don't compare them with less than you just say “are these equal or not equal?”. And they're cheap. I mean the alternative is to use some kind of filesystem watcher but it turns out that basically just generates custom timestamps in a change list. Timestamps are provided by the operating system and you don't need to use a file system watcher except when you're scanning for file system changes. Basically the watcher just turns a thing that does lots of file system accesses into something that processes the change list. So the only thing to be careful about when writing an algorithm that uses timestamps, for making it extensible to watchers, is that it has to scan first and then process the files into separate things they can't fuse. It has to handle changes sequentially basically. So it can't make an assumption that a file hasn't been changed. And then the timestamp format has to be flexible.

And keywords. yeah I should Implement keyword based programming as well and so that was in with the whole polyglot languages thing is that you just pick a code snippet in any language and you're like I want to use this code snippet and then just use it.

So as well as keyword programming we also could also do aspect oriented, but I feel it’s not necessary and we can do similar things with documentation, because it's easier to store everything in one place and when you're looking for what calls this function, or not that because it requires global semantic search, what calls this function is hard, but when you're looking for what does this function do it's easier to have a big one imperative method then it is to have lots of small stupid ones. That was some code advice in Jai by Jonathan Blow here.

Jai is not open source but he has a ton of YouTube videos where he programs and hacks on it so you can see a lot of the code just because it's on his screen. It's a very standard C++ language it doesn't really have anything new as far as syntax or anything I'm honestly not sure why he's writing it but it's cool to watch.

So you know I mean building and having the documentation be a single big hyperlink mess on the web is probably the best way to deal with it.

In other news I'm reading all these papers and I don't really feel like I'm processing them. I don't know how to process these big papers. I was writing an Earley parser and I was like “this paper is so complicated!” It wants me to implement a DFA and I don't know how they implemented a DFA and so then I looked up a regular expression library redgrep because I knew it implemented a DFA and I was like oh yeah that’s how it works. But I don't know how to use that adapted to a parser and it's all so complex. From Googling it seems there is no solution to a complex project except to document it carefully and work through it step by step, hence these development logs and trying to code step-by-step. It might work.

Maybe we could have a voice-activated programming language because if we combined keyword programming with voice recognition then you can have a programming language that's done by voice. Voice activated programming for the win. I don't know, we'll see. But if I can write most of the code with my voice like I’m writing these devlogs, that would be a good idea. And I was thinking about using English syntax for a lot of things so that's like one of the goals of the grammar parser and stuff is to have English like syntax. I should check to see what parser inform 7 uses because that's the closest thing to natural language parsing.

That reminds me of Zarf and Andrew Plotkin. He had this thing called rule-based programming and he had a page of ideas. The idea was that you could have a list and you would just specify precedence rules and then it would sort the list for you and I don't know that's kind of it I think, just do a topological sort. And then there were the rules and, I don't know, rules are complicated. You have wrappers and defeasible rules and before rules and after rules and rules that trigger under certain conditions. The hard part is figuring out how to combine the rules. So Common Lisp has multi methods and I think they were better written in Dylan, Slate, ... something had a better multi method system. This whole idea of dispatching commands is something that you can play with. Smalltalk is the original but atomo is a programming language that uses this message passing thing. I also need to look it up to see how it handled defining assignment as syntactic sugar.

Then I need to go over my entire blog again looking for the functions and stuff because they aren't in the searchable interface yet. So I’d like a searchable interface where I can just type words and have them show up. Google does an index thing but most of my blog doesn’t seem to be indexed. They’ve been getting worse lately, I think Bing has a better search cache. Building this website that you can search the thing and it has to be fast because fast is good. I need to figure out how to set up a website. That's the first project, set up a website. And oh and then of course the website has to be reproducible so you need a way to just set it up without losing your stuff. So GitHub will not work because I want a search and the search has to run a script and the script is not runnable on GitHub. Or maybe it is, just generate this big JS blob. Or I could find free hostingl I think Heroku has free hosting or something with dynamos and stuff but let's see.