In search of the ultimate programming

Saturday, November 11, 2023

Exception handling claims and counterpoints

Unsupported claim: The distinction between (unrecoverable) bugs/panics and recoverable errors is incredibly useful
Counterpoint: Most languages make no such distinction. The distinction is subjective and depends on the context - one person's "unrecoverable error" is another person's "must recover". For example, say we have a service like Repl.it - user code running in threads running on a server. For expressiveness, the user code should be able to fail by throwing an exception. But that shouldn't take down the entire thread (unless it needs to), and it should never take down the entire server (running thousands of different users' threads). Most exception systems here would only give a false of safety: they allow some construct for recoverability (like try-catch or the ? operator), but then they allow exit() or some kinds of errors (OOM, divide by zero, ...) to take down the entire system. This is unacceptable. For example, H2O handled OOM in Java and generally recovered nicely - if OOM was unrecoverable, this couldn't be done. As soon as you allow any sort of unrecoverable bugs/panics, then one unvetted library or untrusted user code snippet can take down the entire system. The conclusion: unrecoverable panics shouldn't exist. A true exception system needs to actually take recoverability seriously and allow every single error in the entire software stack to be recoverable at some level of containment.

Unsupported claim: Rust users over-estimate how often they need code that never panics
Counterpoint: A lot of software is mission-critical and absolutely has to keep going no matter what, like the server example mentioned previously. A server that constantly crashes is a useless server compared to one that logs the error and keeps going. In contrast, there is never a time when a Rust user said "Gee, my program kept running. I really wish it had crashed with a panic instead of catching the exception and printing a stack trace" - such a user would just add an exit() to their exception handler.

Unsupported claim: Every error in a high-assurance system (think pacemakers/medical devices, critical automotive systems like break control, flight control engine, nuclear device release control software) has to be a hard compile-time error. You absolutely should not be able make "mistakes" here.
Counterpoint: It is very useful during development to be able to see all the errors in a program at once. If a compiler stops on the first error, this is impossible. Similarly, some errors in programs are only visible by actually running them. If programs under development cannot be run due to not passing one or more quality checks, this means a developer will have to waste time fixing compiler errors in code that may end up being deleted or completely rewritten anyway. If there is a valid semantics to the program, the compiler should compile the program with this semantics and not get in the developer's way. Certainly for the final production-ready binary, it is worth addressing all the quality checks, but not necessarily before then. It is unwise to encode quality control processes at the technical level when organizational policies such as commit reviews, continuous integration testing (which will include quality checks), and QA teams are so much more effective. Similarly to how one person's "must-recover" exception may be another person's "unrecoverable error", one person's "hard compile-time error" may be another person's "useless noise", and policies on such errors can only be dictated by the organization.

Unsupported claim: Most projects don't configure their compiler warning flags properly. Warnings are not enough for mission-critical--has to be hard compile-time error.
Counterpoint: Per https://devblogs.microsoft.com/cppblog/broken-warnings-theory/, in Microsoft's Visual Studio Diagnostics Improvements Survey, 15% of 270 respondents indicated they build their code with /Wall /WX indicating they have a zero tolerance for any warnings. Another 12% indicated they build with /Wall. Another 30% build with /W4. These were disjoint groups that altogether make 57% of users that have stricter requirements to code than the default of Visual Studio IDE (/W3). Thus, there is a majority of users that definitely configures their warning flags, and likely there are more that simply left the flags at the default after determining that the default satisfied their needs. If this is not the case, it is easy enough to make it mandatory for at least one compiler warning level flag to be specified. The NASA/JPL rules specify to enable all warnings.

Unsupported claim: Guaranteeing all possibilities are handled is not something a general purpose language can handle. We'd have some sort of type system capable of expressing everything. This in turn would require some form of advanced dependent type checking, but such a type system would likely be far too complex to use to be practical. You'll only get 5 users, 4 of which work on proof assistants at INRIA.
Counterpoint: Ada is/was a general-purpose language, and Spark is a conservative extension of Ada that adds high-assurance capabilities without changing the underlying language. Many practical projects have been implemented with Ada and with Ada/Spark. And adding a basic analysis that determines what exceptions a function can throw and whether these are all handled is easy (it is just Java's checked exception system).

Unsupported claim: "Yorick's shitty law of error handling": Mixing guaranteed handling and abort-on-error into the same language is likely to result in users picking the wrong approach every time, and a generally confusing mess. You can't make an error handling mechanism that fits 100% of the use cases, rather you can either target the first 90%, or the remaining 10%, with the latter requiring an increasingly difficult approach the more of that chunk you want to cover
Counterpoint: This is exactly what Java did with its checked and unchecked exceptions, and nobody was confused. Annoyed, yes (at the dichotomy), but not confused. And in my approach there is actually not a checked/unchecked distinction - it is only the function signature that controls whether an exception is checked, so it is even less likely to result in confusion or annoyance. Don't want to bother with checked exceptions? Don't mention them in the function signature.

Unsupported claim: From the point of view of the code that says throw, everything is a panic.
Counterpoint: There are actually several ways to handle an exception (traditional exception, assertion failure, sys.exit call, etc.):

warn: compile time error/warning
error: crash the program (exit, infinite loop, throw exception)
lint: log failure and keep going (banned at Bloomberg)
allow: (in release mode) keep going as though the throw call never existed

Supported claim: Java checked exceptions have issues https://testing.googleblog.com/2009/09/checked-exceptions-i-love-you-but-you.html
Counterpoint: The issues are (1) long throws clauses (2) exception-swallowing traps that rethrow as an unchecked exception (3) unreachable exceptions. Exception sets solve the issues of long throws clauses - Java was trying to do that with the inheritance hierarchy but I think being able to have overlapping, customizable sets of exceptions will make a huge difference in usability. Allowing type signatures to skip over exceptions (similar to unchecked exceptions) avoids the need for exception swallowing. For unreachable exceptions there is a specific pragma for the compiler that an exception is unreachable and to not emit any warnings about it.

Saturday, August 22, 2020

The perfect time-frequency transform

A while back I took a music class. We had to do a final project, for which I researched time-frequency transforms. These take a sound signal in the time domain and produce a 2D intensity graph (spectrogram) of intensity at each time/frequency pair. For example, there is the short-time Fourier transform (STFT) that takes slices of the signal, runs the Fourier transform over each slice to get a frequency pattern, and then averages them together.

The problem with the STFT is its resolution; it requires picking a window size for the sliding slices and this window restricts the frequency information that may be obtained. The right way to handle sampling is to construct an ideal signal using the Whittaker–Shannon interpolation formula, transform that to an ideal time-frequency spectrum, and then compute an average/integral of the area of the spectrum corresponding to each pixel.

So, handling sampling is easy and just requires summing a double integral of the transform of a sinc function a number of times. But what is the actual transform? Reviewing the scientific literature I found many different transforms; the two most interesting were minimal-cross entropy (MCE-TFD) and the tomography time-frequency transform (TTFT). The MCE method tries to minimize the entropy of the spectrum, using an iterative process. The TTFT uses a fractional Fourier transform to get intensities along each frequency/time angle, and then uses the inverse Radon transform to turn these sections into a 2D spectrum.

The TTFT perfectly captures linear chirps; a linear chirp \(\sin((a+ b t) t)\) creates a line on the time-frequency spectrum. But when two or more chirps are present the TTFT shows interference between them. This is due to a quadratic cross-term. The MCE minimizes entropy, not the cross-term, so it too has interference, although less of it. So the question is, can we get rid of the interference? This amounts to the transform being linear; the transform of a weighted sum is the weighted sum of the spectrums.

We also want "locality" or clamping; a waveform that is 0 everywhere but a small time slice should have a spectrum that is 0 everywhere but in that slice. Similarly if the frequencies are limited to a band then the spectrum should also be limited to that band. Additionally we want time-shifting, so that the spectrum of \(f(x+k)\) is the spectrum of \(f\) shifted by \(k\).

So to review the properties:

Linearity: \(T(a f + b g) = a T(f)+b T(g) \)
Chirps: \(T(\sin((a+b t) t))(t,\omega) = \delta ( \omega - (a+b t) ) \) where \(\delta\) is the Dirac delta
Locality: \(T(H (k t) f(t))(t,\omega) = H ( k t) T(f(t))(t,\omega) \) where \(H\) is the step function
Bandlimiting: if for all \( \omega \in [a,b] \), \( \hat f (\omega) = 0 \), then \( T (f) (t,\omega) = 0 \) for all \( \omega \in [a,b] \)
Shifting: \(T(f(t))(t,\omega) = T(f(t+k)(t+k,\omega)\)

The question now is whether such a transform exists. It seems like even these conditions are not sufficient to specify the result, because writing the function as a sum of chirps isn't enough to get a converging graph.

Monday, January 13, 2020

Stroscot devlog 4

The parser’s kind of working, besides features like EBNF, scannerless parsing, etc. that will be useful later but aren’t strictly necessary. So now I am writing the middle parts for the Stroscot programming language.

I started out by wanting to implement universes with dependent types, in particular experimenting with the New Foundations idea. I also wanted to add subtyping a la Dolan. Then I remembered insane and that it had to suspend most of the “check” calls in order to check recursive types successfully, so I needed an evaluation model. I wondered if type inference was any more complicated, so I looked at Roy and saw the unify function and concluded that implementing type inference was going to involve adding type variables and type solving (unification) to an existing type checker, but wouldn’t change much about the type checker structure, so starting with a dependent type checker would be best.

Then I looked for dependent type checker tutorials and found this; it starts out by presenting an evaluator for the untyped lambda calculus and then extends it gradually. LambdaPi takes a similar approach. They both use environments and closures, but with optimal reduction there are duplicators rather than closures and the environments are stored in the graph, so the implementation is quite different.

So I figured I’d start with to the optimal lambda calculus implementation in macro lambda calculus. But the first thing it does is a pseudo-alpha-renaming pass that identifies all the free variables and replaces them with wires. The code is complicated quite a bit by the separation into a separate inet-lib library which takes a custom-syntax interaction net description and uses JavaScript’s eval features to implement the interaction rules. I went and read “Optimality and inefficiency” and they show poor behavior on C_n = \x -> (2 (2 (2 (2 x)))) where 2 = \s z -> s(s z), when applied to \x y -> x y and \x y -> y x. I still like optimality though; the issues they present are unrelated to interaction nets and can be worked around with the garbage collection in Asperti/BOHM.

I looked for more implementations of LambdaScope and it seems that Jan Rochel’s graph-rewriting-lambdascope package (Haskell) is the only one. There is MaiaVictor’s optlam / absal (JS) and Vizaxo’s optal (Haskell) but they use Lamping’s algorithm which has brackets/croissants and more overhead. They’re still worth looking at for ideas on how to implement interaction nets though.

I was thinking that perhaps Kernel/Fexprs had an interesting solution to variable lookup, as the environment is passed explicitly, so I was reading through the thesis and got lost. I tried reading the paper and it also seems very long and full of Scheme-isms with not much on lambdas. The primitives are wrap and vau, somewhere in there we can insert optimal evaluation but I didn’t get to environments yet.

Thursday, January 9, 2020

Stroscot Devlog 3

So I've been reading a lot about Earley parsers. They're divided into predictor completer and scanner sections. There's some “interesting observations” in the “Directly Executable Earley Parsing” paper: mainly, that additions are only ever made to the current and next Earley sets. But we still have to store the Earley sets; on the other hand, I noticed that we only have to look at them indexed by production. So we can index that by the next non-terminal and then look back to retrieve them. But it seems clear that the sets are mostly for notational, pedagogical, and debugging purposes, and the algorithm only needs the lists of suspended nonterminals pointed to by in-progress non terminals.

The other thing I looked at was the Earley-LR optimization, basically turning the grammar into an LR(0) automaton with shift/reduce ambiguity resolved by Earley sets, and generally speaking it seems like it can speed up the parser. Each Earley item is normally a dotted production rule together with a parent set, but you can replace the production rule with an LR(0) DFA state, which is a set of LR(0) items.

According to Jeffrey Kegler: “In creating Marpa, I may have acquired as much experience with the AH algorithm as anybody, possible even Aycock/Horspool themselves. I found the special LR(0) items they create to be counter-productive. AH's own implementation apparently had bugs due to duplication of Earley items. I solved these, but it made the code very complex and slower. Most of the compression that the AH method bought was in predictions, which could be compressed in other, less complicated, ways. Using AH items means that whenever something needs to be done in terms of the user's grammar, there has to be a translation. Since the AH item to Earley item relationship is many-to-many, this is not trivial. The cost of the more complex evaluation phase seems likely to eliminate any speed-up in the recognizer.

In “Practical Earley Parsing”, as a side issue, they describe a way of handling nulled symbols and productions via grammar rewrites. This proved very valuable and Marpa *does* continue to use that. It's a bit ironic that I ended up ignoring a lot of AH's paper, but that their little implementation trick has proved central to my efforts.” (edited, see here)

So in summary LR doesn’t add much to Earley parsing, but computing nullability and handling it specially does. It seems like they're optimizing their parser a lot, but they're optimising the wrong things. The memory layout and cache use is the main thing to pay attention to on modern processors and they don't seem to have a very interesting layout. And the radix tree for Earley items seems completely pointless. You have to iterate over all the pending Earley items anyway so you might as well just use a linked list.

The practical paper also leads to this rabbit hole of direct code execution / threaded execution / direct threaded code. It’s useful in parsers, but appears in any interpreter, e.g. Python. Python uses computed gotos and the basic idea is that instead of jumping back to a dispatch loop you include the code which looks up the dispatch table and jumps directly. Python uses the first class labels feature available in GNU C. But reading the “Directly Executable Earley Parsing” paper it just seems like a bunch of pointer stuff and it doesn't do the state splitting that they do in their later papers.

Another question is whether we can implement parser combinators on top of Earley, memoizing all the states somehow. It seems like it might be possible with enough recursion and memoization hacking but probably not worth it.

And I still don't know how to handle extended grammars, and I haven't looked at the Leo paper at all. But from some code skims of people who’ve implemented it Leo seems like it’s just a matter of keeping track of certain items. Extended Earley grammars generalize the dot position in the rule, so it’s an NFA (DFA?) instead of just a linear sequence. This of course should be easily accomplished with the help of some papers I have on extended rules.

Sunday, December 22, 2019

Stroscot Devlog 2

Okay so programming language project post 2. Sorry that these are a little slow, I get backlogged...

I was wandering down a misty road and I saw a parser at the end of the tunnel. But did I see this parser’s code? No, I did not. 😞 I saw many parsers, all different! There was a tangled mess of parsers!

Well, I guess what I'm trying to say is that I've been reading lots of parsing papers, and some of them work with each other and some of them don't, and it's not clear which is which. I mean, it looks like they can all be assembled together into a single massive parser, but I don't know. So far what’s winning is the Earley parser. It basically has three stages: the scanner, the predictor, and the completer. They correspond to state transitions in LR parsers so a reduce is the same as a complete and a shift is the same as a scan. There’s no prediction step in an LR parser because it precomputes all the predictions when it processes the grammar into states.

The standard paper on Earley parsing is “Practical Earley Parsing” by Aycock and Horspool. It shows how to merge an LR automaton with an Earley parser, based on work from an earlier 1996 paper “A Faster Earley Parser”. There's also the Marpa paper which some code snippets and improvements on the Aycock paper. Then there are a bunch of LR papers that look vaguely interesting but I haven't read yet. I'm honestly not sure I will.

So an LL parser has the first and follow sets and then an LR parser is shift reduce and it builds up this LR automaton. And the LR automaton is apparently the same for all LR algorithms SLR/LALR/GLR, an LR(0) automaton, because LR(1) automatons with lookahead are too big. And then the Aycock Horpsool paper uses Earley-LR, a.k.a. LRE following the 1996 paper, also built on top of an LR(0) automaton. Meanwhile GLR is inefficient (exponential) and nobody uses it besides people who don’t care about parser speed.

Then there’s the Adaptive LL paper by the ANTLR guy about how to use LL look ahead to predict things. That actually has a very simple structure: it just pretends it's an LL(0) parser except it has a look-ahead function which constructs a deterministic finite automaton and this DFA is what actually predicts the next production. And then it extends the automaton adaptively as it parses the input (which is the complicated part) and apparently it goes pretty fast. But the approach doesn’t handle ambiguous grammars or left recursion; he has a trick in the appendix but it doesn’t handle all cases. I still think I need to read more papers though. I haven’t looked at the “parsing with derivatives” paper which is probably the most promising since it’s the newest approach.

But first I'm going to read this paper “SPPF-style parsers from cubic recognizers”. So there are two main algorithms in the paper: there's the Earley algorithm and the RIGLR. So we'll start with the Earley algorithm. The Earley algorithm, let’s see... so it has packed nodes and non packed nodes. I found a nice blog post on SPPF trees. But the algorithm is still impenetrable.

I guess going section by section is easier than trying to scan back and forth in the paper for clues. So first section is just introduction, trying to extend the Earley algorithm given in textbooks to parse returns spurious derivations. Then they have BRNGLR which is the alternative that they compare with. So section 2 is just definitions, they include created SPPF nodes in the Earley triples so the triples are labeled with grammar symbol / position, parent set index, SPPF node. And so then there's derivation trees and the shared forests are binarized because nodes have out-degree at most 2. Then they have a summary of Earley’s algorithm in one sentence which is so that's the middle of page 57. I think the contributions of this SPPF paper are mostly in the form of how to write an Earley parser (and RIGLR parser) correctly and the amount of explanation of the motivation or reasoning behind it are essentially nil. However it does have a good section on the algorithm that separates the phases of Earley parsing out and it seems to have more attention paid to the order in which the scanner / completer / predictor get executed than other papers which are just like “execute them all until they stop adding items”.

So the other thing in life I’m thinking about was the operational semantics and whether using the optimal reduction graph machine is a good idea. I've been mulling it over and concluded that it can’t hurt; I mean I know you can't implement eager reduction later, but it’s easy to add special-purpose reductions that split up the graph and hard to find those just by thinking about the problem in advance. So yeah in conclusion I think the best thing to implement optimal reduction.

The other thing I was thinking about was scopes. So if we have a statement like “let x equal 3 in y” then, interpreting that statement, it seems like the best approach would be to go to construct a record with x equals 3 as its contents and then y will be passed that record as an implicit environment argument. And so that should let all the names and stuff be resolved appropriately and so then you'll have like read and write scope commands and those commands will end up passing things through and yeah.

So another thing to do is look at the difference between parsing regular expressions with derivatives and parsing grammars with derivatives and also I want to see what the Antimirov derivatives are about because apparently they help with the capturing group stuff and that might be useful for parsing. It would be interesting to have non-capturing productions, that don’t show up in the generated parse tree.

So back to the big picture, parser goals:

Number 1 is getting to parse regular expression type things and EBNF or even boolean grammars on the right hand sides

Number 2 is doing linear time LR(k) grammars

Number 3 is generating SPPF-packed grammars or results

Number 4 is handling recursive grammars cyclic productions and nullable elements

So let's talk about the structure of a parser. A parser is generally something that takes a string of characters and the grammar and produces an output data structure which is in our case is going to be the SPPF format because that's the easiest format to use so far and it’s cubic time for ambiguous grammars. I guess that’s slow but there hopefully isn’t much ambiguity in a grammar.

The first step in parsing is to process the grammar into a usable form and so when we're building a parser generator, this does a lot of pre-processing and probably builds a big finite state machine that handles every character in the grammar. it seems like it is kind of necessary to use for efficiency. And there are some papers on incremental parsers and modifying LR state machines to add or remove grammar rules.

As far as the testing I guess I’ll use a number of different complicated expressions. First are the examples I've gotten from other parser papers of counter-examples that are not supposed to work. The simplest is probably the calculator or arithmetic language, it’s not LR and handling precedence and associativity is hard.

So the first goal is to understand the SPPF format. Parsing is in some sense an assignment problem, we want to assign values to the different tree start-and-stop positions in order to construct a tree. And in Marpa it’s a bit more complicated because we have different start-and-stop positions for each token-Earleme-lexeme thing.

The interesting part is handling recursion. Recursion happens on the left side, on the right, it happens in the middle, and then you can have cyclic rules that refer to each other indirectly or directly so you have indirect recursion. So how do we deal with this? The answer is we have to maintain multiple states. The question is how you maintain this. The Earley parser has seems to have the most efficient representation of ambiguity because it allows sharing token states between multiple different forms whereas GLR uses a more complicated stack and it seems to have a lot of duplication since it only shares prefixes and postfixes. So the Earley parser seems to have a finer grained sharing/tracking algorithm.

So the ambiguity part is interesting as well. Earley parsing handles left recursion really well. It just collapses the left recursion into a single state and then Leo’s optimization from his paper handles right recursion really well and again it collapses everything into a single state. The reason to use Aycock-Horpsool’s LR stuff is that it collapses a lot of the state transitions by parsing LR stuff. The questionable part is how it handles nullable rules. Nullable rules, I was thinking, can also be handled by creating an epsilon-LR automaton and adding some more complicated transition logic, so it's not clear that you have to modify the grammar like Aycock-Horspool does or if it's possible to just construct a special deterministic LR automaton instead that works for the graph.

The next question is how do I use it for operators? The operator precedence graph is also hard or parser to deal with. I found this parsing thesis that disambiguates SPFF trees using mixfix precedence information, and it seems easy enough, so I just have to read it.

The next thing is how to handle layout parsing in lexing and indentation and whitespace and they seem like trivial things since I found some nice papers (and I haven't read them either yet) but they should just be modifications to the grammar with more concatenative operators and extra ways of matching tokens.

I have no idea how this cubic Earley algorithm works. It's just so complicated. It's really intimidating. The start can be ignored; it’s initialization. You can just use a dummy rule with the start symbol on the right hand side and it will initialize itself, and this will also be needed for the Leo optimization to work anyway. This strategy is used in Bison too in its LR parsers.

The next part of the algorithm is all in a loop over the Earley sets; this is the index of the character.

Then there's the set R, which is defined in their earlier 2008 paper on Earley parsing along with Q. They’re both queues of pending items to be processed. Q is the set of items whose next symbol is a terminal and R is everything else. So R is processed first and then Q is processed to move to the next token. And then there is a holding set H for completing null derivations.

The Earley parser algorithm's input is a grammar gamma that is composed of the (let me look it up again) a set N of non-terminal symbols, a set T of terminal symbols, and an element S which is in N that called the start symbol, and a set P of numbered grammar rules of the form A := Alpha where a is a non-terminal symbol and Alpha is a possibly empty string of terminals and non-terminals. And the symbol epsilon denotes the empty string.
And then there is the string a1 a2 etc. up to a n. Okay.
So the actual Earley parser algorithm: first, we initialize e0 through e n to the empty set we also initialize the sets R, Q old, and V to be empty.
Next for all rules in P the set of grammar rules we set find the ones that start with the start symbol S letting their right hand side the alpha and if the right hand side is start with a string or of terminal or start is a terminal or non-terminal that begins with a non terminal... alright, this is at the point where writing it down as code directly is easier.

So I also wanted to talk about types. I really don't like types of cluttering the program and making it run slow but they are useful for type inference. In particular you have constraint solving. So the constraint solving works really well in Haskell, and it lets you write more code automatically, and so I want to expand on that in my language and have at least some amount of constraint solving. In particular you need a coerce function that automatically converts between any two types and then you need to be able to overload methods and have them resolved by type.
The most useful thing I've seen so far is the subtyping algorithm in Brick and MLSub which Stephen Dolan wrote. As far as doing something else I think the only solution is to hack until it works. I’ve read a bunch of type papers and they all end up leaving it up to you make it work. If they do present an algorithm it’s very simple and then all the type papers are mostly about proving the algorithm correct and complete rather than on the implementation.

I did a crawl of old papers citing Earley parsers and I didn't really find anything new. Google only shows the first 1000 results, sorted by citations, and there’s the Leo paper and the Horspool paper but not much else. I downloaded a few just to look at later but there aren't any really novel ideas, just optimizations and benchmarks.