Clever optimizations to finding the smallest/largest k elements in Ordering

Guava's Ordering API [1] is a Comparator that offers additional fluent functionality (it very well could have been called FluentComparator!).  Ordering allows you to perform partial sorting [2] (more specifically, find the smallest or largest k elements from an Iterator/Iterable containing n elements) using Ordering#leastOf and Ordering#greatestOf. We investigated several different optimizations to these methods:

1. The most straightforward approach is to sort the elements and then return a sublist. However, this requires O(n log n) time (to sort), plus O(n) space! If you’re trying to apply this operation to a large iterator (which won’t all fit in memory at once), this approach won’t work. However, if your k is large, we actually fall back to this approach.

2. Another relatively straightforward approach is to use a PriorityQueue. With this approach, you simply add elements to the queue, and remove the largest (or smallest) element every time you exceed k elements. This approach requires O(n log k) time and only O(k) space.

3. A slightly more complicated approach relies on the fact that you can find the median of a list and partition around it in linear time. We use the following algorithm: maintain a buffer of size 2k. Every time the buffer gets full, we find the median and partition around it, keeping only the lowest k elements.  This requires n/k find-median-and-partition steps, each of which take O(k) time with a traditional quickselect. The final output buffer is then sorted, which adds an additional O(k log k) time. In total, this approach requires only O(n + k log k) time and O(k) space. Excellent!

Cheers,
-+Kurt Alfred Kluever, Guavian

[1] http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Ordering.html
[2] http://en.wikipedia.org/wiki/Partial_sorting
Shared publiclyView activity