Co-dfns versus BQN's implementation

Co-dfns is under active development so this document might not reflect its current state. Last update 2025-05-02.

The BQN self-hosted compiler is directly inspired by the Co-dfns project, a compiler for a subset of Dyalog APL. I'm very grateful to Aaron for showing that array-oriented compilation is even possible! In addition to the obvious difference of target language, BQN differs from Co-dfns both in goals and methods.

The shared goals of BQN and initial Co-dfns research are to implement a compiler for an array language with whole-array operations. This provides the theoretical benefit of a short critical path, which in practice means that both can make good use of a GPU or a CPU's vector instructions simply by providing an appropriate runtime. The two implementations also share a preference for working "close to the metal" by passing around arrays of numbers rather than creating abstract types to work with data. Objects are right out. These choices lead to a compact source code implementation, and may have some benefits in terms of how easy it is to write and understand the compiler.

More recent versions of Co-dfns are far less strict about the array style. The patterns are still usually parallel in some sense, but they might work with many small arrays and it's no longer clear that they are suitable for GPU evaluation, or fast in a classic APL interpreter. Co-dfns has never truly run on a GPU to my knowledge. The compiler has never been self-hosting, and the GPU measurements in Aaron's thesis were taken with a hand-translated version of the APL code.

Compilation strategy

The Co-dfns source separates tokenization and parsing, core compilation, and code generation from each other. Initial work focused on the core compiler only. The associated Ph.D. thesis and famous 17 lines figure refer to this section, which transforms the abstract syntax tree (AST) of a program to a lower-level form, and resolves lexical scoping by linking variables to their definitions. While all of Co-dfns is written in APL, other sections aren't necessarily data-parallel, and could perform differently. In contrast, the BQN compiler is entirely written in a data-parallel style. Its source splits out tokenization but mixes everything else together; code generation is much simpler because it only outputs IR, as discussed in the next section.

The core Co-dfns compiler is based on manipulating the syntax tree, which is mostly stored as parent and sibling vectors—that is, lists of indices of other nodes in the tree. BQN is less methodical, but generally treats the source tokens as a simple list. This list is reordered in various ways, allowing operations that behave like tree traversals to be performed with scans under the right ordering. This strategy allows BQN to be much stricter in the kinds of operations it uses. The compiler shown in Aaron's thesis regularly uses ⍣≡ (repeat until convergence) for iteration and creates nested arrays with ⌸ (Key, like Group). BQN has only a single instance of iteration to resolve quotes and comments, plus one complex but parallelizable scan for numeric literal processing. And it only uses Group to extract identifiers and strings, and collect metadata for final output. This means that most primitives, if we count simple reductions and scans as single primitives, are executed a fixed number of times during execution, making complexity analysis even easier than in Co-dfns.

As of 2025 I think Co-dfns has backed off a bit from the data-parallel array paradigm: even the core compiler slips in a few each-ish patterns that work on an arbitrary number of small arrays (it does rely less on nested arrays). These parts are still parallel with good complexity in a multi-core model, because the different iterations are independent, but they don't have the homogeneity that makes array primitives such a good fit for GPU and SIMD CPU implementations. For parsing, many different methods have been used historically, beginning with sequential ones like parser combinators and a parsing expression grammar. In 2021, tokenization and most of parsing were fairly close to the style used by the core compiler. In 2025 I would describe both the tokenizer and parser as using a hybrid system, which sticks to flat arrays when convenient, but also readily uses loops or occasionally nested arrays based on source code features like numeric literals, function headers, or expressions. These changes point to an emphasis on handling more APL features and improving the compiler output, recognizing compiler speed as less practically important.

Backends and optimization

Co-dfns was designed from the beginning to build GPU programs, and outputs code in ArrayFire (a C++ framework), which is then compiled. GPU programming is quite limiting, and as a result Co-dfns began with strict limitations in functionality that were slowly removed—the big ones like arbitrary rank and nested arrays are gone now. BQN is designed with performance in mind, but implementation effort focused on functionality first, so that arbitrary array structures as well as trains and lexical closures have been supported from the beginning. Rather than target a specific language, it outputs object code to be interpreted by a virtual machine. Another goal for BQN was to not only write the compiler in BQN but to use BQN for the runtime as much as possible. The BQN-based runtime uses a small number of basic array operations provided by the VM. The extra abstraction causes this runtime to be very slow, so CBQN relies on it very little and re-implements most functions natively.

The BQN compiler essentially outputs IR that does the same thing as the source, leaving optimization up to the IR interpreter (it does try to find variable lifetimes in blocks so that the last access can clear the variable). Co-dfns also leaves most optimization to ArrayFire, but in v5.7.0 (2024) started adding some classic improvements like dead code elimination. It also switched to a register model for arrays to reuse stack slots, which are fixed-size and store metadata or the values of small arrays. This is done with a randomized dense matrix algorithm that doesn't look so hot in terms of asymptotics (at least quadratic in the number of relevant nodes), and is called out as being slow in the release notes. Not like I know anything better though, if we're sticking to array programming.

Error reporting

It started a year or two after we finished, but Co-dfns is now pretty similar to BQN in its error handling and source position tracking. BQN has complete error checking and good error messages, and includes source positions in compiler errors as well as in the compiled code for use in runtime errors. Position tracking and error checking together add about 20% to the number of lines in compiler and 10% to compile times. And improving the way errors are reported once found has no cost for working programs, because reporting code only needs to be run if there's a compiler error. The only thing that really takes advantage of this now is the reporting for bracket matching, which goes over all brackets with a stack-based (not array-oriented or parallel) method.

Comments

At the time I began work on BQN, Aaron advocated the almost complete separation of code from comments (thesis) in addition to his very terse style as a general programming methodology. I find that such a style makes it hard to connect the documentation to the code, and is very slow in providing a summary or reminder of functionality that a comment might. I chose one comment per line as a better balance of compactness and faster accessibility. Co-dfns has since fluctuated wildly in its approach to documenting the code, including a stab at literate programming with perhaps half a page to describe each line for the parts that were finished.

Is it a good idea?

In short: no, I don't think it makes sense to use an array style for a compiler without a good reason. BQN uses it so it can self-host while maintaining good performance; Co-dfns uses it to prove it's possible to implement the core of a compiler with low parallel asymptotic complexity. It could also make a fun hobby project, although it's very draining. If the goal is to produce a working compiler then my opinion is that using the array style will take longer and require more skill, and the resulting compiler will be slower than a traditional one in a low-level language for practical tasks. Improvements in methodology could turn this around, but I'm pessimistic.

It's important to note that my judgment here applies specifically to implementing compilers. Many people already apply APL successfully to problems in finance, or NumPy in science, TensorFlow in machine learning, and so on. Whatever their other qualities, these programs can be developed quickly and have competitive performance. Other problems, such as sparse graph algorithms, are unlikely to be approachable with array programming (and the P-complete problems are thought to not be efficiently parallelizable, which would rule out any array solution in the sense used here). Compiling seems to occupy an interesting spot near the fuzzy boundary of that useful-array region. Is it more inside, or outside?

Ease of development

It needs to be said: Aaron is easily one of the most talented programmers I know, and I have something of a knack for arrays myself. At present, building an array compiler requires putting together array operations in new and complicated ways, often with nothing but intuition to hint at which ones to use. It is much harder than making a compiler the normal way. However, there is some reason for hope in the history of traditional compilers. It took a team led by John Backus two and a half years to produce the first FORTRAN compiler, and they gave him a Turing award for it! Between the 1950s and 1970s, developments like the LR parser brought compilers from being a challenge for the greatest minds to something accessible to a typical programmer with some effort. I don't believe the change will be so dramatic for array-based compilers, because many advantages in languages and tooling—keep in mind the FORTRAN implementers used assembly—are shared with ordinary programming. But Aaron and I have already discovered some concepts like tree manipulation and depth-based reordering that make it easier to think about compilation, and there are certainly more to be found.

One advantage that sticks out, despite the common jeer about "write-only" array code, is that the BQN compiler has proven very easy to debug. It can be months between reports and by now I never touch the compiler otherwise, but fixing bugs is still quick and comfortable. Almost every line is run once, in order, so when the output is wrong I skim the comments or work backwards through the code to find which variable was computed wrong, and work out how that happened by printing precursor values, sometimes comparing to similar but correctly-handled input. Usually the relevant variables have the same ordering and can be printed as a table, and the operations on them are all basic and familiar BQN except for a few utility functions, which aren't that complicated either. Flat data is a big benefit because of how easy it is to display and manipulate.

At a larger scale, I find that variable management is a major difficulty in working with the compiler. This is a problem that Aaron doesn't (or didn't) have, because his compiler is 17 lines long. What happens in a larger program is that various properties need to be computed in one place and used in another, making it hard to keep track of how these were computed and what they mean. In BQN, different sections of the compiler use different source orderings (one thing I've expended some effort on is to reduce the number of orderings used). A tree-based compiler would probably have similar problems, unless all the state is going to be transformed at each step, which would perform poorly. Using a variable with one ordering in the wrong place is a frequent source of errors, particularly if the ordering is something like expanding function bodies that has no effect in many small programs. Is there some way to protect against these errors?

Does APL Need a Type System?

Here's Aaron's take. Honestly I can't really go along with this: I think he ignores a lot of real distinctions between array and typed functional programming because it's convenient for the point he wants to make. On the other hand, it's abundantly clear that C-style types would be useless for an array compiler, because nearly every variable is a list of integers.

The sort of static guarantee I want is not really a type system but an axis system. That is, if I take a∧b I want to know that the arithmetic mapping makes sense because the two variables use the same axis. And I want to know that if a and b are compatible, then so are i⊏a and i⊏b, but not a and i⊏b. I could use a form of Hungarian notation for this, and write ia←i⊏a and ib←i⊏b, but it's inconvenient to rewrite the axis every time the variable appears, and I'd much prefer a computer checking agreement rather than my own fallible self.

Performance

In his Co-dfns paper Aaron compares to nanopass implementations of his compiler passes. Running on the CPU and using Chez Scheme (not Racket, which is also presented) for nanopass, he finds Co-dfns is up to 10 times faster for large programs. The GPU is of course slower for small programs and faster for larger ones, breaking even above 100,000 AST nodes—quite a large program.

The self-hosted compiler running in CBQN reaches full performance at 3KB or so of dense source code, and slows down slightly at larger sizes as the type for indices gets larger. Handling around 15MB/s on my current laptop, it's between the same speed and half as fast as dzaima/BQN's compiler when that's fully warmed up (which takes hundreds of runs and uses 3GB of RAM). That compiler was written in Java by dzaima in a much shorter time than the self-hosted compiler, and is very similar for benchmarking purposes—it's missing some functionality but this shouldn't change timings more than 10% at a very conservative estimate. The Java compiler is written with performance in mind, but dzaima has expended only a moderate amount of effort to optimize it.

We also have equivalent C and BQN implementations of a restricted compiler for bootstrapping. Full results here; the BQN version of the compiler tested 35% faster but there are good reasons to think BQN wouldn't have the same advantage on a full compiler.

There are a few factors to consider regarding the difference in results:

CBQN is a different runtime from Dyalog. We've profiled the compiler and implemented lots of things in SIMD, and I'm confident CBQN runs the compiler at least as fast as Dyalog 17.0, the latest version for Aaron's thesis, would have.
Co-dfns and BQN use different compilation strategies. I think that my methods are at least as fast, and scale better to a full compiler.
The portion of the compiler implemented for the paper could be better for arrays than other sections, which in fact seems likely to me—I would say parsing is the worst for array relative to scalar programming.
The Co-dfns and nanopass implementations are pass-for-pass equivalent, while BQN and Java are only comparable as a whole. As the passes were chosen to be ideal for Co-dfns, there's some chance this could be slowing down nanopass in a way that doesn't translate to real-world performance.

I reckon that most of the effect is that the nanopass compiler just isn't very fast. While our whole-compiler results aren't definitive, I do see them as superseding Aaron's comparison. So I conclude that, with current technology, array-based compilers are competitive with scalar compilers on the CPU and not vastly better. This result is still remarkable! APL and BQN are high-level dynamically-typed languages, and wouldn't be expected to go near the performance of a compiled language like Java. However, it's much harder to recommend array-based compilation without the dominant advantage the Co-dfns paper presents.

I stress here that I don't think there's anything wrong about the way Aaron has conducted or presented his research, even if it now appears that nanopass was a poor baseline. I think he chose nanopass because of his familiarity with the functional programming compiler literature and because its multi-pass system is more similar to Co-dfns. And I know that he actively sought out other programmers willing to implement the compiler in other ways including imperative methods; apparently these efforts didn't pan out. Aaron even mentioned to me that his outstanding results were something of a problem for him, because reviewers found them unbelievable!

What about the GPU?

BQN's compiler could certainly be made to run on a GPU, and it's fascinating that this is possible merely because I stuck to an array-based style. In Co-dfns, Aaron found a maximum factor of 6 improvement with the GPU translation, and if he used the ArrayFire runtime for it then there's probably further improvement to be had. The problem is this: who could benefit from this speed?

Probably not BQN. Recall that the BQN compiler runs at 15MB/s. This is fast enough that it almost certainly takes much longer to run the program than to compile it. The exception would be when a lot of code is loaded but not used, which can be solved by splitting the code into smaller files and only using those which are needed.

The programmers who complain about compile times are using languages like C++, Rust, and Julia that compile to native code, often through LLVM. The things that these compilers spend time on aren't the things that the BQN compiler does! BQN has a rigid syntax, no metaprogramming, and compiles to bytecode. The slow compilers are churning to perform tasks like:

Type checking
Templates, polymorphism, and other code generation
Reachability computations, for optimization
Automatic vectorization

I don't know how to implement these in an array style. I suspect many are simply not possible. They tend to involve relationships between distant parts of the program, and are approached with graph-based methods for which efficient array-style implementations aren't known.

It might be possible to translate a compiler with less optimization such as Go to an array style, partially or completely. But Go compiles very fast, so speeding it up isn't as desirable. In slower compilers, attacking problems like the above strikes me as many levels beyond the work on Co-dfns or BQN.

Pareas

Pareas is a GPU-based compiler that takes a simple custom-designed imperative language as input and outputs RISC-V. It's implemented in Futhark, an array-oriented ML-family language with GPU support. Pareas and the page you are reading were developed in… parallel, and I became aware of it after writing the pessimistic take on GPU compilers above. I'm very glad this work is being done, but it seems fully compatible with my conclusion that our current data-parallel methods aren't advanced enough to make useful improvements in slow compilers. Both the source and target language had to be kept simple and not many optimizations are performed. It does show that APL syntax isn't necessary to develop a GPU compiler. I think it's too different from Co-dfns and BQN to say whether one development style might be more effective, though.

Pareas seems to focus on applying general methods in a smart way before going special-purpose, even more so than Co-dfns. I think this is absolutely a good choice and the right direction for GPU research. But it does reduce the power available to the compiler writers in some sense, making Pareas appear less capable than it might otherwise—until you account for the flexibility to handle other languages. The main parser targets a class of languages called LLP(q,k), and for simplicity it's restricted further to LLP(1,1). This isn't expressive enough for any serious language, and even with their specialized input language the authors add a second pass using tree manipulation to refine the grammar. In particular, BQN isn't in this class because any part of an expression can be parenthesized an arbitrary number of times, requiring unbounded lookahead (or behind) to determine the role from the outside.

As it stands, the compiler's performance isn't great, but that's entirely due to slow register allocation. The rest of the compiler is capable of over 1GB/s on their high-end GPU. Sure, it takes megabytes of source to reach these speeds, but that leaves us at "compile just about anything in a fraction of a second" which is very attractive. The register allocation is kind of worrying because they rejected graph coloring for performance reasons and went with a very basic allocator that spills whenever things get tough. Nonetheless I think challenges like more realistic syntax and a good allocator can probably be accomodated and perform substantially faster than current CPU-based code. But I see that allocator as the tip of an iceberg: it's nowhere near the hardest or scalar-est task for an optimizing compiler, and a few intractable problems can sink the whole thing. Maybe there's room to pass such things off to the CPU. Maybe a linker is a better goal for GPU hosting? Even though I don't expect a GPU gcc on the horizon, I'm excited to see what comes out of this line of research!