If you have ever read an algorithms textbook, you know about the handful of sorting algorithms that run in O(n*log(n))
time. These include quicksort, heapsort, and mergesort. Under the hood, Python’s List.sort
function uses yet another one called Timsort.
That’s not the point of this post. The point of this post is show you how you can use a partial heapsort to outperform all of the other sorts when you only need to know the top or bottom k
values of a list.
def naive_find_top_k(values, k=20): """ Given a list of values, find the highest k values by sorting the entire list first. This is O(n*log(n)) """ values.sort(reverse=True) return values[:k] import heapq def heap_find_top_k(values, k=20): """ Given a list of values, convert it into a heap in-place, then extract the top k values from the heap. This is O(n + k*log(n)) """ return heapq.nlargest(k, values)
The first function in the above code snippet is O(n*log(n))
, but the second one is O(n + k*log(n))
when extracting the top k
values or returning the k
-th highest value. To understand why heaps allow for this speedup, you need to understand a lot more than is covered here about heaps. I encourage anyone that is curious to check out my book Fast Python and read the chapter on sorting. In the book, we go even deeper into the mechanics of heaps by building one from scratch on a vanilla Python list.
To give you a teaser about how heaps work in Python, see the expanded version of heap_find_top_k
below. This function does the same thing, but accesses Python’s heapq
library at a lower level.
def heap_find_top_k_expanded(values, k=20): """ Given a list of values, convert it into a heap in-place, then extract the top k values from the heap. This is O(n + k*log(n)) """ _heap = [] for v in values: heapq.heappush(_heap, -v) top_k = [] for i in range(k): top_k.append(-heapq.heappop(_heap)) return top_k
Below are the code profiles for naive_find_top_k
and heap_find_top_k
for many values of n
and k=20
.
As you can see, the heapsort variant is faster around n=500
, and the difference gets more and more significant as n
increases.
If result order is not important, then QuickSelect could make it in O(n) regardless to k value. If important, then with result sorting will be O(n+k*log k) .
https://en.m.wikipedia.org/wiki/Quickselect