The dyadic arithmetic functions are +×÷⋆√⌊⌈¬∧∨<>≠=≤≥
. There are also monadic arithmetic functions, but they're mostly easy to optimize.
Arithmetic with Table, leading axis extension, and Rank is also covered below.
IEEE defines the float value 0. But to make sure integervalued floats can be consistently optimized as integers, it's best to treat it identically to 0 (this is much easier than trying to not produce 0s, as a negative number times 0 is 0). To convert a number to 0 if it's 0, just add 0. This needs to be done in ÷
for 𝕩
, in ⋆
for 𝕨
, and in √
for both arguments if it's defined separately. Also in •math.Atan2
for both arguments if defined.
Many arithmetic functions give boolean results when both arguments are boolean. Because there are only 16 possible functions like this, they overlap a lot. Here's a categorization:
↗️f ← ∧‿∨‿<‿>‿≠‿=‿≤‿≥‿+‿‿×‿÷‿⋆‿√‿⌊‿⌈‿‿¬ bt ← {⥊𝕏⌜˜↕2}¨f ∧⌾(⌽¨⌾(⊏˘)) bt (⍷∘⊣≍˘⊐⊸⊔)○((∧´∘∊⟜(↕2)¨bt)⊸/) f ┌─ ╵ ⟨ 0 1 0 0 ⟩ ⟨ < ⟩ ⟨ 0 0 1 0 ⟩ ⟨ > ⟩ ⟨ 0 1 1 0 ⟩ ⟨ ≠ ⟩ ⟨ 0 0 0 1 ⟩ ⟨ ∧ × ⌊ ⟩ ⟨ 1 0 0 1 ⟩ ⟨ = ⟩ ⟨ 0 1 0 1 ⟩ ⟨ √ ⟩ ⟨ 1 1 0 1 ⟩ ⟨ ≤ ⟩ ⟨ 1 0 1 1 ⟩ ⟨ ≥ ⋆ ⟩ ⟨ 0 1 1 1 ⟩ ⟨ ∨ ⌈ ⟩ ┘
Some functions have fast implementations when one argument is boolean. The only ones that really matter are ×
/∧
, which can be implemented with a bitmask, and ∨
, which changes the other argument to 1 when the boolean argument is 1 and otherwise leaves it alone. ⋆
is ∨⟜¬
when 𝕩
is boolean.
A function of an atom and a boolean array, or a monadic function on a boolean array, can be implemented by a lookup performed with (preferably SIMD) bitmasking.
Several cases where either one argument is an atom, or both arguments match, have a trivial result. Either the result value is constant, or it matches the argument.
Constant  Constant  Identity 

a+0 

aa 
a0 

a¬a 
a¬1 

a×0 * 
a×1 

a∨1 
a∨0 

a÷a * 
0÷a * 
a÷1 
a⋆0 , 1⋆a 
a⋆1 

¯∞⌊a , ∞⌈a 
¯∞⌈a , ∞⌊a 

a>a etc. 
a⌊a , a⌈a 
None of the constant column entries work for NaNs, except a⋆0
and 1⋆a
which really are always 1. Starred entries have some values of a
that result in NaN instead of the expected constant: 0
for division and ∞
for multiplication. This means that constantresult ÷
always requires checking for NaN while the other entries work for integers without a check.
Division, integer division, and Modulus by an atom (a÷n
, a⌊∘÷n
, na
) are subject to many optimizations.
n
. For smaller integer types, using SIMD code with 32bit floats is also fast: see below.If p
and q
are integers and p
is small enough, then the floor division ⌊p÷q
gives an exact result. This isn't always the case for floats: for example 1 ⌊∘÷ (÷9)+1e¯17
rounds up to 9 when an exact floordivide would round down to 8. It means that qp
is exactly pq×⌊p÷q
, a faster computation.
In particular, if p
is a 16bit or smaller integer and q
is any integer, then the floor division ⌊p÷q
and modulus qp
can be computed exactly in single (32bit float) precision. If p
is a 32bit integer these can be computed exactly in double precision.
We will prove the case with 32bit p
in double precision. If q
is large, say it has absolute value greater than 2⋆32
, then (p÷q)<1
, meaning the correct result is ¯1
if it's negative and 0
otherwise. If p
is nonzero, then (p÷q) > ÷q
, which is greater than the smallest subnormal for any q
, and so the division won't round up to 0. So suppose q
is small. If p
is a multiple of q
, say p=k×q
, then k≤○p
so k
is exactly representable as a double and the division is exact. If p
is not an exact multiple, say p=(k×q)o
with nonzero o<q
, then write it as k×(qo÷k)
and the quotient is k×1o÷k×q
, or k×1o÷p+o
. For this number to round up to k
it would have to be larger than k×1÷2⋆52
, but p+o
is less than p+q
which is much less than 2⋆52
, so the rounded division result is less than k
, giving a floor of k1
.
While they can be viewed as special cases of the nested rank discussed in the next section, Table and leadingaxis extension are easier to analyze, and are the most common forms. To avoid some tedium with shapes, we'll consider a result shape m‿n
, by assuming 𝕨
is a list with length m
, and 𝕩
is either a list with length n
, for Table, or a shape m‿n
array, for leadingaxis.
With these definitions, ⥊𝕨𝔽⌜𝕩
is (n/𝕨) 𝔽 (m×n)⥊𝕩
, which can make use of a fast constant replicate. An ideal Table implementation is not actually to compute these expanded arguments in full, but to pick a unit size k≤m
so that k×n
is small but not too small. Then precompute (k×n)⥊𝕩
, and work on 𝕨
in chunks of length k
. For example, expand each chunk with k/
into the result, then apply the function inplace with the saved (k×n)⥊𝕩
. Of course, if the ideal k
is 1, then a scalarvector operation into the result works just as well.
Leadingaxis extension is similar: ⥊𝕨𝔽𝕩
is (n/𝕨) 𝔽 ⥊𝕩
, so the same strategy works, with minor modifications. Instead of (k×n)⥊𝕩
, a new chunk of length k×n
from 𝕩
is needed at each step. And if k
is 1, the base case is vectorvector instead of scalarvector.
Table also admits faster overflow checking for wellbehaved functions like +¬×
: all combinations of 𝕨
and 𝕩
will be used, so there's an overflow exactly if the extreme values would overflow. The range of 𝕨+⌜𝕩
is 𝕨+○(⌊´)𝕩
to 𝕨+○(⌈´)𝕩
, and similarly for 𝕨⌜𝕩
but swapping the max and min of 𝕩
. For 𝕨×⌜𝕩
all four combinations of min and max need to be checked.
Dyadic arithmetic can be applied to various combinations of axes with leadingaxis extension, Table (⌜
), and the Cells (˘
) and Rank (⎉
) modifiers. Cells is of course ⎉¯1
, and Table is ⎉0‿∞
, so the general case is arithmetic applied with the Rank operator any number of times.
An application of Rank is best described in terms of its frame ranks, not cell ranks. If a
and b
are these ranks, then there are a⌊b
shared frame axes and a(⌈⌊)b
nonshared axes from the argument with the higherrank frame. Repeating on rank applications from the outside in, with a final rank 0 inherent in the arithmetic itself, the two sets of argument axes are partitioned into sets of shared and nonshared axes. For example consider +⌜˘⎉4‿3
on arguments of ranks 6 and 8.
⎉4‿3
implies frame ranks of 64=2 and 83=5. This gives 2 shared axes (label 0 below), then 3 nonshared axes from 𝕩
(label 1).˘
takes a shared axis (label 2).⌜
takes all axes from 𝕨
nonshared (3), then all from 𝕩
(label 4).0 1 2 3 4 5 6 7 𝕨 0 0 2 3 3 3 𝕩 0 0 1 1 1 2 4 4
An axis set behaves like a single axis, and any adjacent axis sets of the same type (for example, nonshared from 𝕨
) can be combined. Length1 axes can be ignored as well, so a simplification pass like transpose might make sense.
Then the implementation needs to expand this mapping quickly. As usual, long axes are easy and short axes are harder. It's best to take enough result axes so that a cell is not too small, then use a slow outer loop (for example, based on index lists) to call that inner loop. One result axis can be split in blocks if the cell size would be too small without it, but too big with it.
Because a result cell should be much larger than a cache line, there's no need for the outer loop to traverse these cells in order—that is, the axes can be moved around to iterate in a transposed order. An easy way to do this is to represent each result axis with a length, stride in the result, and stride in both arguments; axes represented this way can be freely rearranged. It's best to iterate over shared axes first (outermost), then nonshared axes, because a nonshared axis means cells in the other argument are repeated, and if it's placed last then no other cells will be used between those repeated accesses. If both arguments have nonshared axes then a blocked order that keeps the repeated cells in cache might be best, but it's complicated to implement.
The scalarvector, vectorvector cases work as base cases on one axis, and Table and leadingaxis are two twoaxis cases. There's also a flipped Table ˜⌜˜
and a trailingaxis case like ⎉1
. Representing each result axis with "w" if it comes from 𝕨
only, "x" for 𝕩
only, and "wx" for shared, we can organize these cases.
That's six twoaxis combinations; the remaining three possibilities ww, xx, and wxwx are simplifiable. Trailingaxis agreement takes half of Table like leadingaxis agreement, but it's the reshaping half instead of replicate.
The general case is to expand the arguments with Replicate along various axes so that they have the same shape as the result, and then use vectorvector arithmetic. More concretely, insert a length1 axis into the argument for each nonshared axis in the other argument. Then replicate these axes to to required length. This can be done one axis at a time from the bottom up, or by constructing an array of indices and applying them at once, and probably other ways as well. See also highrank Replicate.