Implementation of random stuff

BQN's random number utilities are provided by system functions and include some with non-obvious implementations. In the text below, rand represents any random number generator: •rand, or a result of •MakeRand.

Random number generation

CBQN is currently using wyrand, part of the wyhash library. It's extremely fast, passes the expected test suites, and no one's raised any concerns about it yet that I know of. It uses only 64 bits of state and doesn't have extra features like jump ahead.

Other choices are xoshiro++ and PCG. The authors of these algorithms (co-author for xoshiro) hate each other very much and have spent quite some time slinging mud at each other. As far as I can tell they both have the normal small bias in favor of their own algorithms but are wildly unfair towards the other side, choosing misleading examples and inflating minor issues. I think both generators are good but find the case for xoshiro a little more convincing, and I think it's done better in third-party benchmarks.

Shuffling

rand.Deal shuffles the natural numbers 𝕩. To shuffle arbitrary values an extra selection is needed, which adds a little overhead for small values but can be more than double the cost for large ones (depending on available algorithms). So it may also make sense to have a rand.Shuffle defined as {(𝕨 Deal 𝕩) 𝕩}.

For small sizes the Knuth shuffle works great. Initialize the whole result at the start; don't try anything fancy since it's only going to add branching. For large sizes CBQN uses one 256-way or smaller split as described below.

Knuth shuffle makes random accesses by design, so if the result doesn't fit in cache there's going to be a serious slowdown. Sorting-based strategies mitigate this: the two methods I've found in the literature are Rao-Sandelius, like quicksort, and MergeShuffle, like mergesort. A typical implementation of Rao-Sandelius sends each value to a random half in-place by swapping, and then the halves are shuffled. In fact, Sandelius described a decimal method with ten partitions, but his target hardware was… actual physical card decks, which don't have the same implementation issues present on CPUs. In MergeShuffle, halves are shuffled then merged, but the required merging technique is more sophisticated. However, I've found that the SIMD implementation presented by the authors is not actually too fast (once corrected) and branchlessly iterating over set bits of the random word works better.

MergeShuffle initially follows the naive strategy of choosing from one half or the other at random, with a single bit. On its own this is biased: the actual chance of selecting from a list should be proportional to the number of elements remaining in it. But at any given point the elements that have been chosen are shuffled, assuming the initial lists were. This is true by symmetry because merge selections are made independently with the same probability, so given the number of selections from each side, any ordering is equally likely. Then MergeShuffle stops when it would try to get an element from an empty list, and shuffles the remaining elements into the result one at a time.

The same argument applies to multi-way merging, and more obliquely to splitting. Splitting is like radix sorting, which would typically require counting partition sizes. But it also works to make fixed-size partitions in the result, and stop when one runs out. The partitions have gaps between them but can be packed together by moving them as they're shuffled. Multi-way can be several times faster than binary because it does fewer moves—binary needs one per bit.

Merging or splitting with two partitions is special in that it can be implemented in place with swapping; I don't think other numbers support this in general. But in Deal the first round of splitting has an implicit argument 𝕩, so the split can be performed directly into result space with only an index and ending position for each partition. Merging or splitting each of these partitions would take a fraction of the space, but one 256-way split seems fine for CBQN as it only runs into cache trouble at sizes in the billions.

Simple random sample

A simple random sample from a set is a subset with a specified size, chosen so that each subset of that size has equal probability. rand.Deal gets a sample of size 𝕨 from the set 𝕩 with elements in a uniformly random order, and rand.Subset does the same but sorts the elements.

Deal uses a Knuth shuffle, stopping after the first 𝕨 elements have been shuffled, as the algorithm won't touch them again. Usually it creates 𝕩 explicitly for this purpose, but if 𝕨 is very small then initializing it is too slow. In this case we initialize 𝕨, but use a "hash" table with an identity hash—the numbers are already random—for 𝕨↓↕𝕩. The default is that every value in the table is equal to its key, so that only entries where a swap has happened need to be stored. The hash table is the same design as for self-search functions, with open addressing and linear probing.

Subset uses Floyd's method, which is sort of a modification of shuffling where only the selected elements need to be stored, not what they were swapped with. This requires a lookup structure that can be updated efficiently and output all elements in sorted order. The choices are a bitset for large 𝕨 and another not-really-hash table for small 𝕨. The table uses a right shift—that is, division by a power of two—as a hash so that hashing preserves the ordering, and inserts like an insertion sort: any larger entries are pushed forward. Really this is an online sorting algorithm, that works because we know the input distribution is well-behaved (it degrades to quadratic performance only in very unlikely cases). When 𝕨>𝕩÷2, we always use a bitset, but select 𝕩-𝕨 elements and invert the selection.

I'm aware of algorithms like Vitter's Method D that generate a sorted sample in order, using the statistics of samples. There are a few reasons why I prefer Floyd's method. It's faster, because it uses one random generation per element while Vitter requires several, and it does less branching. It's exact, in that if it's given uniformly random numbers then it makes a uniformly random choice of sample. Vitter's method depends on floating-point randomness and irrational functions, so it can't accomplish this with finite precision random inputs. And the pattern of requests for Floyd's method is simple and deterministic. The advantage of reservoir algorithms like Vitter is that they are able to generate samples one at a time using a small fixed amount of memory. •MakeRand only allows the user to request a sample all at once, so this advantage doesn't matter as much. The CBQN algorithms are tuned to use much more temporary memory than the size of the final result. It could be lowered, but there's usually plenty of temporary memory available.