Bins benchmarks

Implementation notes

Bins is well optimized in most cases, the major exception being when 𝕨 is larger than the L2 cache. The graph below shows its performance for long 𝕩. This allows preprocessing steps like building a table to be effective, which is what the 1-byte and 2-byte cases do (lookups are constant time; the lift at the right comes from the cost of building the table). The general case for 4-byte and 8-byte data uses a branchless binary search with multiple searches interleaved. Integer types use AVX2 vector binary searches for small 𝕨, allowing an entire register from 𝕩 to be searched in parallel. Both kinds of binary search show a characteristic stepped pattern, since the number of steps increases only when the search size crosses a power of two.

And graphs for shorter 𝕩 arrays, showing when table-based methods become effective. The jaggedness at the left is caused by interleaving: there's one loop that interleaves 4 searches at a time, followed by one that does just 1, so multiples of 4 are the most efficient.

Larger elements rely on binary searches, except the 10-element 4-byte search. A large 𝕨 array can't be kept in cache, making each access slow. This cost can be mitigated by partitioning, which isn't implemented yet.