Replicate and Indices benchmarks

Implementation notes

CBQN's boolean Replicate functions (bool/𝕩, "compress", and /bool, "where") are heavily optimized, and its non-boolean cases somewhat less so. It chooses algorithms based on the density +´÷≠ of the replication counts, and for the boolean case, the density of switches (+´÷≠)≠˝˘2↕c. Note that Replicate is generally faster than Select because it can use vectorized methods instead of reading single values from memory.

For compress and where, there are optimizations for several SIMD levels, with AVX2+BMI2 shown here. Methods other than compress of boolean mostly use lookup tables; AVX2 is only used for 4-byte and 8-byte data, as the very old SSSE3 is equally effective for 1- and 2-byte. AVX-512 compresses are used for all these widths if available and are several times faster. For boolean compress, BMI2's pext instruction is used if possible, and alternatives are several times slower.

The BMI2 and table-based methods make up the relatively flat part of graph where compress speed is plotted against density. Except for the boolean case, which writes each output word once, they read units from the arguments and perform the same instructions on each one, so that any performance difference comes from differing write locations. Writes overlap more on the left, and the slight tilt down here is because this uses less overall bandwidth. At the left edge there are two sparse methods, a branchless sparse one that has a more gentle slope and mostly appears for 8-byte data, and a branching sparse method used for much lower densities. At the right, a grouped method, which is also branching, is used.

General Replicate uses methods based on scans for smaller densities, and a simple loop for larger ones. Replicating by a constant factor uses many methods including lookup tables for permutations. For booleans, odd replication factors from 64 to 256 are somewhat worse on AVX2, roughly because propagating bits is harder when a result word doesn't have to contain any boundaries between argument bits.

The same concerns for Replicate apply to Indices, except that the output type is determined by the input length (or rather, the index of its last nonzero element), so that 1-byte and 8-byte types aren't as relevant. The grouped method from compress isn't yet used for boolean Indices.

Indices inverse

Indices inverse is somewhat related to Indices, but also to Group and Bins. The main idea is just to increment a count for each element of 𝕩, but there are various improvements. For large-range arguments, several tricks are used to keep from initializing a result with a type that's too large. Conversely, for long enough arguments of a small type, a table that fits all possible values is allocated to avoid a range-checking pass.

SIMD alternatives to incrementing are valuable because it's never very fast, and slows down with duplicates since an increment has a dependency on the previous one at the same location. One is enabled by checking the range in blocks. If it's small (10 in the graphs below), counts are computed by comparing to all possible values and summing in SIMD.

Adjacent equal elements are counted at the same time to switch to a method with SIMD run identification (like /»⊸≠𝕩) if there are many of them, giving the corners around average run length 2 and range/length 0.9 below. For sorted arguments there is also a galloping search that appears most clearly in long small-range arrays above.