Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Python Dict – An Explorable Explanation (just-taking-a-ride.com)
78 points by akbarnama on June 18, 2023 | hide | past | favorite | 10 comments


Tangentially related, but I like how readable the source code of CPython (the official interpreter for Python) is. It is plain C without many shenanigans. If you wonder how Python handles big integer, hashmaps, and so on, just dive in the source code!

A few days ago, I wanted a concrete example about dynamic arrays. The examples in computer science textbook always uses a factor of 2 when expanding, since it is very convenient to demonstrate that it gives an amortized O(1) insertion time. Some people find it wasteful, and may resort to use a constant expansion (e.g. 128) instead, leading to a O(n / 128), i.e. O(n) insertion time instead.

But you can just make the expansion factor less than 2. I wanted a concrete example, so, naturally, I just went for CPython's implementation of `list`. It took me less than 2 minutes to find it: https://github.com/python/cpython/blob/main/Objects/listobje....

If you're curious, CPython essentially grows its dynamic array by ⅛ (12.5%) whenever it needs to.


I don't know if I can call 'new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;' very readable...


By C standards that’s perfectly fine. No crazy macros, just plain old bit-twiddling. The and-not combination is a very traditional pattern for bit masking.


As pavlov said, it's readable enough to many (those that understand bitfields).

It's literally current size + (current size/8) + fiddle

( the fiddle is there to prevent a fail to increase for small input (exercise for reader?) )

followed by a mask to ensure a multiple of 8.

The rational would be no floating point ops and all the weirdness that can entail plus the speed of raw shift | addition | mask.

You're correct that it deserves a twice over as the fiddle raises an eyebrow that invites investigation across inputs.


What a fantastic post and explanation, big kudos to the author. I recently went from knowing next to nothing about hashtables, to implementing my own in C using simple separate chaining, and was surprised to find out that it's around twice as slow as an equivalent CPython `dict` benchmark, while also only being around 20% slower than an equivalent Golang `map` benchmark.


For inserting and looking up 10,000,000 integer values with string keys I get 6.4s for both Python and SBCL and 5.2 seconds for C++ (std::string, unordered_map).

Python's dict implementation (in C) is good and the interpreter loop overhead is dwarfed by the dict overhead.

SBCL's implementation is in Lisp and achieves the same speed.

(For the general case of Python speed this benchmark is of course useless, since raw dicts written in C cannot be optimized much. Nevertheless it is sometimes quoted as "evidence" that "Python is almost as fast as C++" ...)


In such an environment my implementation achieves a runtime of around 5.8s while Python3 manages 5.2s.

41% of the runtime of my impl. is spent on resizing (19 times) + finally freeing (once) the hashtable.

It's interesting that in your testing unordered_map managed a faster runtime than Python. I've attributed my implementation's defeat to the fact that it uses separate chaining while CPython uses open addressing (+ a bunch of other clever optimizations...), but seeing as unordered_map also uses separate chaining (as mandated by the C++ standard, apparently) and it managed a win in your testing, I'm interested to see how they do things.


For those who learn better from videos, here's a 37 minute Pycon 2017 talk that covers all of the major aspects of Python dictionaries including how they became ordered and compacted, how object dictionaries share keys, why full hashes are stored, how the probing sequence works, when resizing occurs, etc.

https://www.youtube.com/watch?v=npw4s1QTmPg


Excellently explained and visualised. Thanks to the author for this.


This was a great read. Very understandable and useful!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: