Replicate and Indices benchmarks

Implementation notes

CBQN's boolean Replicate functions (bool/𝕩, "compress", and /bool, "where") are heavily optimized, and its non-boolean cases somewhat less so. It chooses algorithms based on the density +´÷≠ of the replication counts, and for the boolean case, the density of switches (+´÷≠)≠˝˘2↕c. Note that Replicate is generally faster than Select because it can use vectorized methods instead of reading single values from memory.

For compress and where, there are optimizations for several SIMD levels, with AVX2+BMI2 shown here. Methods other than compress of boolean mostly use lookup tables; AVX2 is only used for 4-byte and 8-byte data, as the very old SSSE3 is equally effective for 1- and 2-byte. AVX-512 compresses are used for all these widths if available and are several times faster. For boolean compress, BMI2's pext instruction is used if possible, and alternatives are several times slower.

The BMI2 and table-based methods make up the relatively flat part of graph where compress speed is plotted against density. Except for the boolean case, which writes each output word once, they read units from the arguments and perform the same instructions on each one, so that any performance difference comes from differing write locations. Writes overlap more on the left, and the slight tilt down here is because this uses less overall bandwidth. At the left edge there are two sparse methods, a branchless sparse one that has a more gentle slope and mostly appears for 8-byte data, and a branching sparse method used for much lower densities. At the right, a grouped method, which is also branching, is used.

General Replicate uses methods based on scans for smaller densities, and a simple loop for larger ones. Replicating by a constant factor uses many methods including lookup tables for permutations. We get a rather drastic performance drop from 64 to 65 for boolean replication; parts of the method we use for smaller numbers break down when there's no longer one result group starting point per word, but there are probably workarounds that would get similar performance.

The same concerns for Replicate apply to Indices, except that the output type is determined by the input length (or rather, the index of its last nonzero element), so that 1-byte and 8-byte types aren't as relevant. The grouped method from compress isn't yet used for boolean Indices.

Indices inverse is somewhat related to Indices, but also to Group and Bins. The main idea is just to increment a count for each element of 𝕩, although there are some additional improvements. For small ranges, like 10 below, counts are computed by comparing to all possible values and adding instead. And for long enough arguments (particularly 1-byte ints), a table that fits all possible values for the type is used to avoid needing an explicit range check. Incrementing is not a particularly fast operation, and it also slows down if values repeat often because this creates dependencies between an increment and the previous one at the same location.