Chris Conlan

Financial Data Scientist

  • About
  • Blog
    • Business Management
    • Programming with Python
    • Programming with R
    • Automated Trading
    • 3D Technology and Virtual Reality
  • Books
    • The Financial Data Playbook
    • Fast Python
    • Algorithmic Trading with Python
    • The Blender Python API
    • Automated Trading with R
  • Snippets

When to Use Heap Sort to Speed Up your Python Code

June 10, 2020 By Chris Conlan 1 Comment

If you have ever read an algorithms textbook, you know about the handful of sorting algorithms that run in O(n*log(n)) time. These include quicksort, heapsort, and mergesort. Under the hood, Python’s List.sort function uses yet another one called Timsort.

That’s not the point of this post. The point of this post is show you how you can use a partial heapsort to outperform all of the other sorts when you only need to know the top or bottom k values of a list.

def naive_find_top_k(values, k=20):
    """
    Given a list of values, find the highest k values by 
    sorting the entire list first. This is O(n*log(n))
    """
    values.sort(reverse=True)
    return values[:k]


import heapq
def heap_find_top_k(values, k=20):
    """
    Given a list of values, convert it into a heap in-place,
    then extract the top k values from the heap. This is 
    O(n + k*log(n))
    """
    return heapq.nlargest(k, values)

The first function in the above code snippet is O(n*log(n)), but the second one is O(n + k*log(n)) when extracting the top k values or returning the k-th highest value. To understand why heaps allow for this speedup, you need to understand a lot more than is covered here about heaps. I encourage anyone that is curious to check out my book Fast Python and read the chapter on sorting. In the book, we go even deeper into the mechanics of heaps by building one from scratch on a vanilla Python list.

To give you a teaser about how heaps work in Python, see the expanded version of heap_find_top_k below. This function does the same thing, but accesses Python’s heapq library at a lower level.

def heap_find_top_k_expanded(values, k=20):
    """
    Given a list of values, convert it into a heap in-place,
    then extract the  top k values from the heap. This is 
    O(n + k*log(n))
    """
    _heap = []
    for v in values:
        heapq.heappush(_heap, -v)

    top_k = []
    for i in range(k):
        top_k.append(-heapq.heappop(_heap))

    return top_k

Below are the code profiles for naive_find_top_k and heap_find_top_k for many values of n and k=20.

As you can see, the heapsort variant is faster around n=500, and the difference gets more and more significant as n increases.

Filed Under: Programming with Python, Snippets

Comments

  1. Юрий Соколов says

    June 12, 2020 at 8:16 am

    If result order is not important, then QuickSelect could make it in O(n) regardless to k value. If important, then with result sorting will be O(n+k*log k) .
    https://en.m.wikipedia.org/wiki/Quickselect

    Reply

Leave a Reply Cancel reply

Fast Python: Master the Basics to Write Faster Code

Fast Python: Master the Basics to Write Faster Code

Available for purchase at Amazon.com.

Featured Posts: Python Programming

Pulling All Sorts of Financial Data in Python [Updated for 2021]

Fastest Way to Flatten a List in Python

70+ Code Profiles of Common Python Algorithms

Topics

  • 3D Technology and Virtual Reality (8)
  • Automated Trading (9)
  • Business Management (9)
  • Chris Conlan Blog (5)
  • Computer Vision (2)
  • Programming with Python (16)
  • Programming with R (6)
  • Snippets (8)
  • Email
  • LinkedIn
  • RSS
  • YouTube

Copyright © 2022 · Enterprise Pro Theme On Log in