Not all benchmarks are included on this page. Header links go to pages with full benchmarks and sometimes extra explanation.
The benchmarks found in bencharray are taken on a Tiger Lake i5-1135G7 processor using CBQN compiled with make o3 has='pclmul avx2 bmi2'
. These options are used instead of the usual make o3n
to compile without AVX-512 support, as this CPU supports it but most consumer hardware doesn't.
The arithmetic functions +-Γββ
, and comparisons, are standard SIMD functionality. With +-Γ
an overflow check is needed, and if it happens a result in a larger type needs to be created. Dyadic |
has some optimization for integers but is only really fast when π¨
is an atom.
Most other primitives, including Γ·ββ
and ββΌ
, require conversion to floats, so will ideally run at the same speed for all types. Γ·
and monadic β
have native SIMD support. Libraries to compute others using SIMD exist but CBQN doesn't use anything like this yet.
Folds and scans are defined as a sequential evaluation, but for operand functions that are associative and commutative, evaluation can be reordered to give efficient SIMD implementations. These include +-Γββ
on integers, ββ
on floats, and many boolean functions. Because floating-point +
(along with -
and Γ
) isn't exactly associative, we have β’math.Sum
to perform a fast sum with unspecified ordering. CBQN currently has special code for common operands but not rarely-used ones like -
. Uncovered cases run at about 5ns/value because that's the overhead of calling a primitive function.
If it had no overhead, Table would have the same per-output timing as scalar-list arithmetic, or sometimes better as overflow checking can be done purely using the range of both arguments. It's close for large right arguments but CBQN still does an extra copy for now. For smaller right arguments the limiting operation is a constant Replicate (/
) on the right argument, or sections of it.
BQN uses AVX2 for most cases of selection, with shuffles on small π©
sizes and the gather instruction on larger ones (unfortunately gather is slower than scalar code prior to Skylake in 2015). AVX2 can shuffle 16 bytes or 8 4-byte values in one instruction, giving the various steps as BQN blends more registers together. The 16KB benchmark shows these small sizes better where the 1e6-element benchmark is limited by write bandwidth. selection from a boolean array is usually slower because it requires reading the right byte from memory and then picking out the right bit within it, but for larger π©
it takes up less cache space and can be faster. For a small π©
but large π¨
, π©
is temporarily converted to 1-byte values so that it runs at the same speed as 1-byte selection.
The speed of Reshape with a large result should essentially be limited by the amount of memory written. The current method is to copy the argument to the result, then repeatedly double it until it reaches a suitable block size for filling the rest of the result. There are special cases for 1-byte and 8-byte blocks; however, the 8-byte case measures slower than the general case, giving the bumps at the left of the first graph. We've left it in on the grounds that it may be specific to my CPU (in fact, the 1-byte case went from slower to faster when I upgraded), and if not, later compiler and libc improvements could fix it.
Replicate is a complicated family of functions, and CBQN does a lot of work to get consistent worst-case performance and take advantage of special cases: sparse input, and booleans with 1s and 0s mostly grouped together. Except for the grouped case, each algorithm takes some amount of time to process each input value, and some to produce an output value, and CBQN chooses between them based on the density. Note that the graphs for a boolean argument use ns/input and those for the general case use ns/output! In the flat sections for the boolean case, the performance doesn't depend on density at all, usually because the same writes are performed regardless, just overlapping more when the density is smaller.
The benchmarks here are not very complete. The first shows how performance scales with the number of Β―1
s in the argument; when there are a lot Replicate is used to filter out these values and avoid excessive branching. The second concerns a sorted π¨
value, so that the groups are slices from π©
.
Searching and sorting are the heaviest of BQN's commonly used functions, performance-wise. Element size is very important, as doubling it often slows things down by more than a factor of two.
For 1- and 2-byte elements, lookup tables usually fit entirely in L1 cache (2-byte tables with large elements can be iffy). Lookup tables are so fast that it doesn't make sense to try for adaptivity on these arguments.
For larger elements, comparison sorting and hash tables are usually needed, and adaptive sorting can be used. CBQN doesn't have particularly fast implementations of these yet. Some cases of 4-byte elements are better handled by radix partitioning plus lookup tables.
Currently CBQN has solid counting and radix sort implementations used for 1 to 4-byte elements, and uses the generic Timsort for 8-byte elements. 4-byte sorting is competitive with the state of the art for random elements but isn't adaptive at all.
The x-axis format here is used to display a small searched-in array (π¨
for β
and β
, and π©
for β
) on the left and a small searched-for array (the other argument) on the right. "Half hits" means that half the values in the searched-for argument are found at some index, and is tough on branchy implementations; depending on the application all hits could be more relevant.
There is special code for 1- and 2-byte integers, and a small searched-in argument. SIMD searches would still be faster for some of these cases. Larger types are handled with generic hash table code.
CBQN runs fast on all CPU-native types and booleans. General cases use a mix of direct lookup and hash tables, and there's also all-pairs comparison for short arguments and adjacent-pairs for sorted ones. The function Classify (β
) is troublesome because of the way indices depend on the number of unique elements seen so far, a problem that dyadic β
(or βΛ
) doesn't have. β·
is mostly just implemented as ββΈ/
, which is hardly slower than β
.