Skip Lists: A Probabilistic Alternative to Balanced Trees

BrandonM · on June 27, 2008

I've used a skiplist before to implement the "Six Degrees of Kevin Bacon" program in C++. It was easier to implement than a balanced tree, and it had similar performance, so it was pretty interesting and worthwhile. It took about 30 seconds to read in over 100K actors (if I remember correctly), but after that, searches were very fast.

There are several negatives, however. One is that skiplists use about (lg n)/2 times as many pointers as a balanced tree, if you actually make your skiplist tall enough to be worthwhile. This could mean an extra MB or more of memory for a skiplist with 100K items in it. In some cases it could be that half of the memory consumed by the skiplist is the overhead involved in creating the skiplist. This is also true of hash tables that don't have many elements, of course, but with hash tables, the situation gets better as more items are added. With skiplists, the overhead remains high regardless of the number of items in the list.

Another "problem" is that of standard libraries. Even C has a tsearch function (in search.h) that implements balanced search trees, and many other languages include balanced search trees as well. Balanced search trees have less overhead, both in terms of memory and performance. With a skiplist, you have to choose a few random numbers (on average) for every item added to the tree. Then with searches, you always have to start at the top (the "skippiest" part) with comparisons, working your way down towards the bottom until you figure out which node to jump to. With BSTs, it's just a single comparison and then a movement left or right.

And of course, if your items are hashable, skiplists really lose out to hash tables, which have near-constant lookup time, very little overhead as the number of items in your table approaches the capacity, and implementations in nearly every language.

ajross · on June 27, 2008

I think your size computation is wrong, no? A skip list is a doubly-linked list at each level, so two pointers per node at level zero (i.e. same as a tree). Each higher level has (statistically) half as many entries, so you have to multiply by the sum of the 1/2^N series, which is 2. So on average a skip list takes 4 pointers per node, which is twice as much as a tree. It's just a constant factor, there's no dependence on set size.

(It's true that balanced trees take a little more than two pointer per node, but e.g. red-black trees only need one extra bit, which you can pack into the bottom of one of the pointers if you like).

On the whole, skip lists are cute, but not really "better" than a red-black or AVL tree. And the balanced trees aren't that hard to implement, even if you decide to do your own. Add that to the fact that tree stuctures in code are kinda esoteric (almost always, you just want to be able to fetch an item by identity. Range comparisons are pretty rare in memory -- that's what databases are for) and skip lists leave me a little ... meh.

BrandonM · on June 27, 2008

Yes, the size could be computed that way. However, I was going by the original paper in my implementation. This meant that I had to first estimate how many items would be in the tree, then take the base-2-log of that (in order to get O(lg n) performance), and then make each node allocate that many pointers. Of course, with most nodes, most of these pointers are uninitialized and unused.

In my comment, I was referring to my (and the reference) implementation. After reading your comment, though, you're right that you could allocate a number of pointers equal to the height of the specific node. The only downside is that you then have two mallocs for each new node (once for the node and once for its array of pointers), unless you do something hackish like putting the array of pointers at the end of the struct and manually adjusting your malloc call to allocate the right amount.

One other note: I don't think my skiplist was doubly-linked. The search algorithm involves moving forward on the highest level until the move would take you past the desired element, then you move down a level and continue. By the time you hit the bottom, you're guaranteed to either hit the element you're looking for, or your position is the node before where the element would be if it was in the list. Thus double-linking is entirely unnecessary.

silentbicycle · on June 27, 2008

A constant time, yes, but that constant time may be significantly higher than expected. There are quite a few factors that can reduce hash tables efficiency.

This post summarizes several potential problems with hash tables: http://enfranchisedmind.com/blog/2008/02/25/problems-with-ha...

BrandonM · on June 27, 2008

Okay, but which is slower, hashing a string and then looking into an array indexed at that position (possibly traversing a very short (2- or 3-item) list), or performing an average of 17 string comparisons (for the case of ~130,000 items)?

In particular, the hashing approach gives similar lookup times for any item. The skiplist approach is dependent on the (random) structure of the built-up skiplist. A given item could be found after 2 pointer traversals, or it could take 30. Also disturbing is that as you get closer to the item you're looking for, the string comparisons take longer and longer (because the common prefixes are getting longer and longer).

With a good hashing algorithm, similar strings will hash to different values. This means that even if 2 or 3 string comparisons have to be made (for strings that land in the same bucket), failing comparisons will be fast since they will likely fail on the first character.

silentbicycle · on June 27, 2008

The lookup would remain O(1) fast, which will overtake O(log N) skiplist lookup, true. The insertion and other costs differ between the two as well, though. A hash table would be clearly best in a case where the approx. final table size needed is known ahead of time, most data is inserted upfront, and characteristics of the data are known (for tuning the hashing function).

Also, good point about string comparisons getting longer among close nodes.

Personally, I think skiplists are most interesting because they rely on randomness to make the underlying structure efficient. I wonder what over data structures that could apply to.

johnm · on June 27, 2008

Joe Bob says: check out Radix/Crit-Bit/Patricia Trees http://en.wikipedia.org/wiki/Radix_tree

huhtenberg · on June 27, 2008

I remember using a skiplist few years ago and I implemented it so that its memory usage averaged to just two pointers and no ancillary information per node. This is as small as it gets when it comes to the binary search trees and alike.

In other words the memory usage is not really an issue with skiplists.

BrandonM · on June 27, 2008

Interesting... now that I think of it, it's not necessary for a node to know how high it is, because that information is available in the traversal algorithm (the node is guaranteed to be encountered at the level which represents its height). Using a log-2 dice roll (as the algorithm specifies), you will get nodes with an average height of 2, and with only forward pointers, that does work out to 2 pointers per node.

I may have to revisit my implementation and do it "right".

ntoshev · on June 27, 2008

It is a cool structure, but not a practical one because it is very cache-unfriendly. I'm surprised Wikipedia doesn't mention this.

johnm · on June 27, 2008

In terms of cache friendly and massively scalable, check out Cliff's work on non-blocking hashtables: http://blogs.azulsystems.com/cliff/2007/03/a_nonblocking_h.h...

gaius · on June 27, 2008

There are cases in which your dataset is larger than the cache, y'know.

seregine · on June 27, 2008

That is the most common reason for using a cache in the first place, so you seem to be stating the obvious. Am I missing something?

If I remember right, skip lists are unfriendly to caches (compared to balanced search trees) because they don't optimize locality of reference. This matters, for example, when your huge dataset is on disk and you're trying to cache the working set in memory. In a skip list, related elements don't end up on the same page as often, which means you have to read more data than you need from disk, and read from the disk more often.

gaius · on June 27, 2008

Well yes, that is true, but if you happen to know upfront that you're going to be accessing your data randomly, and that for whatever reason you can't cache it all, then you should optimize for that case. If on the other hand, you know that you're going to have hotspots then you can size your cache accordingly. Choosing before you know (or have had a chance to observe) is a premature optimization.

johnm · on June 27, 2008

But that's another argument of where skip list are not appropriate. Or do you have a real example?

gaius · on June 27, 2008

It is simply a matter of your cache hit ratio. If that is very low then you have two choices - increase the size of your cache, or use an algorithm that is less reliant on caching for predictable performance. The size of your cache is a pure price/performance calculation, "is it worth spending X to improve performance by Y%".

By all means stick to caching/LRU/B*tree if that works for your application. A cache is a brute-force solution, tho', and it helps to have more than one tool in your bag.

johnm · on June 27, 2008

You seem to be talking about something different than the skiplists and their effects on the cache. Unless you're building the actual chipsets, you don't get really any say in the amount and style of caching that your skiplist code is running on.

Skiplists end up with a ton of pointer chasing and hence dramatically increase the pressure on the memory cache. There's no real, profitable way around that fact.

You might want to look at the non-block hashtable work that I pointed to previously for a lot of information (and the actual code (in Java) is up on SF.net so you can actually try it out for yourself).

dreish · on June 27, 2008

But by increasing the number of pointers by log n, you reduce your cache hit ratio, do you not? It would seem that the only time this doesn't matter is when you've long since blown past the CPU cache and your RAM cache hit ratio is going to be in the <5% range, with almost every access to the data structure resulting in disk accesses.

gaius · on June 27, 2008

Yes that is the case to which I am referring. I am often working with multi-terabyte datasets.

silentbicycle · on June 27, 2008

I meant to link to William Pugh's original paper when posting, but submit doesn't seem to take FTP links: ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf

alexstaubo · on June 27, 2008

Skip lists are one of my favourite data structures. They're slower than balanced trees, but ridiculously simple to implement.