Commentary on the best methods I know for implementing various primitives. Often there are many algorithms that are viable in different situations, and in these cases I try to discuss the tradeoffs.
Raw speed is of course the most important factor; I also consider predictability and memory usage to be important. Predictability mostly for the benefit of the programmer, but it's also important when there are multiple algorithms to be able to compute which one will be fastest. In some cases an algorithm is best on some subset of inputs, but is effectively useless because it's too difficult to tell if the input falls in that set. Whitney and other K users sometimes profess that binary size is critically important; I don't believe this.
CBQN uses 1-byte, 2-byte, and 4-byte integers, 8-byte floats, and bit booleans. The various advantages of these types are discussed here. We optimize for x86 and 64-bit ARM with vector extensions, focusing on AVX2 rather than AVX-512 because of the slow rollout of these wider instructions.
It's difficult to know when better algorithms exist. I try to research the state of the art and to estimate how much room for improvement there is, but it will come as no surprise if I've completely overlooked something that's simple in hindsight, or postulated a lower bound that can be beaten with a clever trick.