Bins is well optimized in most cases, the major exception being when 𝕨
is larger than the L2 cache. The graph below shows its performance for long 𝕩
. This allows preprocessing steps like building a table to be effective, which is what the 1-byte and 2-byte cases do (lookups are constant time; the lift at the right comes from the cost of building the table). The general case for 4-byte and 8-byte data uses a branchless binary search with multiple searches interleaved. Integer types use AVX2 vector binary searches for small 𝕨
, allowing an entire register from 𝕩
to be searched in parallel. Both kinds of binary search show a characteristic stepped pattern, since the number of steps increases only when the search size crosses a power of two.
And graphs for shorter 𝕩
arrays, showing when table-based methods become effective. The jaggedness at the left is caused by interleaving: there's one loop that interleaves 4 searches at a time, followed by one that does just 1, so multiples of 4 are the most efficient.
Larger elements rely on binary searches, except the 10-element 4-byte search. A large 𝕨
array can't be kept in cache, making each access slow. This cost can be mitigated by partitioning, which isn't implemented yet.