Python Dict – An Explorable Explanation

qsantos · on June 18, 2023

Tangentially related, but I like how readable the source code of CPython (the official interpreter for Python) is. It is plain C without many shenanigans. If you wonder how Python handles big integer, hashmaps, and so on, just dive in the source code!

A few days ago, I wanted a concrete example about dynamic arrays. The examples in computer science textbook always uses a factor of 2 when expanding, since it is very convenient to demonstrate that it gives an amortized O(1) insertion time. Some people find it wasteful, and may resort to use a constant expansion (e.g. 128) instead, leading to a O(n / 128), i.e. O(n) insertion time instead.

But you can just make the expansion factor less than 2. I wanted a concrete example, so, naturally, I just went for CPython's implementation of `list`. It took me less than 2 minutes to find it: https://github.com/python/cpython/blob/main/Objects/listobje....

If you're curious, CPython essentially grows its dynamic array by ⅛ (12.5%) whenever it needs to.

Yajirobe · on June 18, 2023

I don't know if I can call 'new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;' very readable...

pavlov · on June 18, 2023

By C standards that’s perfectly fine. No crazy macros, just plain old bit-twiddling. The and-not combination is a very traditional pattern for bit masking.

defrost · on June 18, 2023

As pavlov said, it's readable enough to many (those that understand bitfields).

It's literally current size + (current size/8) + fiddle

( the fiddle is there to prevent a fail to increase for small input (exercise for reader?) )

followed by a mask to ensure a multiple of 8.

The rational would be no floating point ops and all the weirdness that can entail plus the speed of raw shift | addition | mask.

You're correct that it deserves a twice over as the fiddle raises an eyebrow that invites investigation across inputs.

bool3max · on June 18, 2023

What a fantastic post and explanation, big kudos to the author. I recently went from knowing next to nothing about hashtables, to implementing my own in C using simple separate chaining, and was surprised to find out that it's around twice as slow as an equivalent CPython `dict` benchmark, while also only being around 20% slower than an equivalent Golang `map` benchmark.

ztraxc · on June 18, 2023

For inserting and looking up 10,000,000 integer values with string keys I get 6.4s for both Python and SBCL and 5.2 seconds for C++ (std::string, unordered_map).

Python's dict implementation (in C) is good and the interpreter loop overhead is dwarfed by the dict overhead.

SBCL's implementation is in Lisp and achieves the same speed.

(For the general case of Python speed this benchmark is of course useless, since raw dicts written in C cannot be optimized much. Nevertheless it is sometimes quoted as "evidence" that "Python is almost as fast as C++" ...)

bool3max · on June 18, 2023

In such an environment my implementation achieves a runtime of around 5.8s while Python3 manages 5.2s.

41% of the runtime of my impl. is spent on resizing (19 times) + finally freeing (once) the hashtable.

It's interesting that in your testing unordered_map managed a faster runtime than Python. I've attributed my implementation's defeat to the fact that it uses separate chaining while CPython uses open addressing (+ a bunch of other clever optimizations...), but seeing as unordered_map also uses separate chaining (as mandated by the C++ standard, apparently) and it managed a win in your testing, I'm interested to see how they do things.

raymondh · on June 18, 2023

For those who learn better from videos, here's a 37 minute Pycon 2017 talk that covers all of the major aspects of Python dictionaries including how they became ordered and compacted, how object dictionaries share keys, why full hashes are stored, how the probing sequence works, when resizing occurs, etc.

https://www.youtube.com/watch?v=npw4s1QTmPg

joeman1000 · on June 18, 2023

Excellently explained and visualised. Thanks to the author for this.

bloopernova · on June 18, 2023

This was a great read. Very understandable and useful!