Another approach is to simply pick elements randomly from the collection and keep track of which elements have already been picked. If you hit the same element again, you simply pick another one. This would work if the entropy source was good enough to ensure that would always pick numbers in a range with equal probability, to avoid getting stuck picking numbers already used. If K is much smaller than N, this would work well, but if K was very close to N, you could probably end up doing far too many retries to be reasonable as soon as you start filling up the result.
Instead, you could use the original Fisher-Yates strike-out approach and simply stop after K elements. This could be done in O(K*(log(K) + K)), which can be simplified to O(K^2) with a naive implementation:
public <T> List<T> take(List<T> source, int k) {
int n = source.size();
if (k > n) {
throw new IllegalStateException(
"Can not take " + k +
" elements from a list with " +
n + " elements");
}
List<T> result = new ArrayList<T>(k);
SortedSet<Integer> offsets = new TreeSet<Integer>();
for (int i = 0; i < k; i++) { // O(K)
int off = random.nextInt(n - i);
for (int offset : offsets) { // O(K)
if (off >= offset) {
off++;
} else {
break;
}
}
offsets.add(off); // O(log(K))
result.add(source.get(off));
}
return result;
}
if K >= O(sqrt(N)), then the plain shuffle is better, but when K is smaller than sqrt(N), this approach would be the winner.
I wonder if it's possible to do it faster than O(K^2). One idea is to use some sort of binary search to quicker find the actual offset without having to iterate the list of offsets. This would make the algorithm O(K*log(K)) instead which would be much better.
I experimented a bit by repeatedly running a binary search of the offset until the offset can not be incremented further. If N is large and K is small, this would often stop after the first iteration since it's sparse between findings, but you can still easily construct worst case scenarios which would need to iterate O(K) times.
Does anyone know any good articles / papers / blogs about algorithms for picking K distinct random numbers in the range 1..N?
Update:
After having eaten lunch, I figured out how to make it run in O(K):
The idea here is to instead of recalculating off to the actual position, as in Fisher-Yates, just assume that the random index is correct. Then, use the map to check if the index has already been used, and if so, use another guaranteed free index instead.
public <T> List<T> take(List<T> source, int k) {
int n = source.size();
if (k > n) {
throw new IllegalStateException(
"Can not take " + k +
" elements from a list with " + n +
" elements");
}
List<T> result = new ArrayList<T>(k);
Map<Integer,Integer> used = new HashMap<Integer,Integer>();
int metric = 0;
for (int i = 0; i < k; i++) {
int off = random.nextInt(n - i);
while (true) {
metric++;
Integer redirect = used.put(off, n - i - 1);
if (redirect == null) {
break;
}
off = redirect;
}
result.add(source.get(off));
}
assert metric <= 2*k;
return result;
}
The while loop make look bad, but it never runs more than 2*K times in total.
(I haven't proven this formally, I just ran a lot of benchmarks on random input.)
If I were to prove it though, my general strategy would be that the loop is entered exactly K times, and you will follow at most K links, since only K links will be constructed and no link is followed more than once.
Update 2:
Apparently Robert Floyd already invented this algorithm (well, it's very similar at least) (published in Communications of the ACM, September 1987, Volume 30, Number 9).
public <T> List<T> take(List<T> source, int k) {The only problem with this is that the ordering of result is not random, only which elements have been chosen. This is easily fixed by running Collections.shuffle(result, random) though.
int n = source.size();
List<T> result = new ArrayList<T>(k);
Set<Integer> used = new HashSet<Integer>();
for (int i = n - k; i < n; i++) {
int off = random.nextInt(i + 1);
if (!used.add(off)) {
off = i;
used.add(off);
}
result.add(source.get(off));
}
return result;
}
I see two significant differences with this approach.
The first (which is of the least relevance) is the amount of entropy used.
If we assume that calling random.nextInt(n) consumes log_2(n) bits of entropy then my algorithm uses: log_2(n) + log_2(n-1) + log_2(n-2) + ... + log_2(n - k + 1) = log_2(n! / (n - k)!) = log_2(n!) - log_2((n - k)!) bits of entropy.
Robery Floyd with an additional plain shuffle uses:
log_2(n!) - log_2((n - k)!) + log_2(k!) bits of entropy which may be significant if k is large compared to n, and if entropy is expensive where the algorithm is used.
The second difference is much more important. The Robert Floyd-algorithm can't be run interactively, picking one or more numbers at a time. You need to know the value of k (or an upper bound of k) before requesting a random number, since the first number chosen by the algorithm will be in the range [0, n - k).
My algorithm on the other hand can be run partially and resumed without knowing the value of k at all.