If you have ever read an algorithms textbook, you know about the handful of sorting algorithms that run in `O(n*log(n))`

time. These include quicksort, heapsort, and mergesort. Under the hood, Python’s `List.sort`

function uses yet another one called Timsort.

That’s not the point of this post. The point of this post is show you how you can use a partial heapsort to outperform all of the other sorts when you only need to know the top or bottom `k`

values of a list.

def naive_find_top_k(values, k=20): """ Given a list of values, find the highest k values by sorting the entire list first. This is O(n*log(n)) """ values.sort(reverse=True) return values[:k] import heapq def heap_find_top_k(values, k=20): """ Given a list of values, convert it into a heap in-place, then extract the top k values from the heap. This is O(n + k*log(n)) """ return heapq.nlargest(k, values)

The first function in the above code snippet is `O(n*log(n))`

, but the second one is `O(n + k*log(n))`

when extracting the top `k`

values or returning the `k`

-th highest value. To understand why heaps allow for this speedup, you need to understand a lot more than is covered here about heaps. I encourage anyone that is curious to check out my book **Fast Python** and read the chapter on sorting. In the book, we go even deeper into the mechanics of heaps by building one from scratch on a vanilla Python list.

To give you a teaser about how heaps work in Python, see the expanded version of `heap_find_top_k`

below. This function does the same thing, but accesses Python’s `heapq`

library at a lower level.

def heap_find_top_k_expanded(values, k=20): """ Given a list of values, convert it into a heap in-place, then extract the top k values from the heap. This is O(n + k*log(n)) """ _heap = [] for v in values: heapq.heappush(_heap, -v) top_k = [] for i in range(k): top_k.append(-heapq.heappop(_heap)) return top_k

Below are the code profiles for `naive_find_top_k`

and `heap_find_top_k`

for many values of `n`

and `k=20`

.

As you can see, the heapsort variant is faster around `n=500`

, and the difference gets more and more significant as `n`

increases.

Юрий Соколов says

June 12, 2020 at 8:16 amIf result order is not important, then QuickSelect could make it in O(n) regardless to k value. If important, then with result sorting will be O(n+k*log k) .

https://en.m.wikipedia.org/wiki/Quickselect