Public

**Clever optimizations to finding the smallest/largest k elements in Ordering**

Guava's

*Ordering*API [1] is a

*Comparator*that offers additional fluent functionality (it very well could have been called

*FluentComparator*!). Ordering allows you to perform

*partial sorting*[2] (more specifically, find the smallest or largest

*k*elements from an Iterator/Iterable containing

*n*elements) using

*Ordering#leastOf*and

*Ordering#greatestOf*. We investigated several different optimizations to these methods:

1. The most straightforward approach is to sort the elements and then return a sublist. However, this requires

*O(n log n)*time (to sort), plus

*O(n)*space! If you’re trying to apply this operation to a large iterator (which won’t all fit in memory at once), this approach won’t work. However, if your

*k*is large, we actually fall back to this approach.

2. Another relatively straightforward approach is to use a

*PriorityQueue*. With this approach, you simply add elements to the queue, and remove the largest (or smallest) element every time you exceed

*k*elements. This approach requires

*O(n log k)*time and only

*O(k)*space.

3. A slightly more complicated approach relies on the fact that you can find the median of a list and partition around it in linear time. We use the following algorithm: maintain a buffer of size

*2k*. Every time the buffer gets full, we find the median and partition around it, keeping only the lowest

*k*elements. This requires

*n/k*find-median-and-partition steps, each of which take

*O(k)*time with a traditional quickselect. The final output buffer is then sorted, which adds an additional

*O(k log k)*time. In total, this approach requires only

*O(n + k log k)*time and

*O(k)*space. Excellent!

Cheers,

-+Kurt Alfred Kluever, Guavian

[1] http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Ordering.html

[2] http://en.wikipedia.org/wiki/Partial_sorting

- This is cool and I'm glad you considered some alternatives.

Just curious: did you do performance testing of (2) vs. (3)? I'd expect the constant factors associated with (3) to be higher than those of (2), and in practice I'd expect O(n log k) to be not much different from O(n) for many common values of k (in this context).Feb 27, 2013 - +Joshua O'Madadhain: Great question! I had similar concerns during the original code review, but then did some more thinking about it. The worst case the PQ impl is actually pretty bad: you need to do a remove and offer for every element in the stream (aka, the stream is in decreasing order):

n(k + log k) // remove + offer for each element in the stream of n elements

+ k log k // remove k elements from the PQ

= O(n*k + ((n + k)*log k))

A Caliper benchmark confirmed that the theoretical worst case using the priority queue is extremely slow.

The benchmarks also show that for completely random input, the PQ solution is*slighty*faster than the quick select approach, but not orders of magnitude. We went with something that performs very well for all inputs.Feb 27, 2013 - ...of course, if you know the stream is in decreasing order then you just traverse it from the other end (if possible) and take the top k, so either O(n) or O(k), depending. ;) (Granted, you don't know that in general.)

I think you have a small typo in your complexity analysis above: that ought to be n(1 + log k), or possibly n(log k + log k), for the remove/offer (depending on what PQ implementation you're using). But your original complexity looks right.Feb 27, 2013 - Great post. I thought worst case performance of quickselect was O(n2)Mar 8, 2013