4

I have some candidate aspects:

  1. The hash function is important, the hashcode should be unique as far as possible.
  2. The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
  3. The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
  4. If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.

My questions are:

  1. Are there any more aspects worth attention?
  2. How to design the hash_table to full fill these aspects?
  3. Are there any resources I can refer to?

Many thanks!



After reading some material, update my questions. :)


In a book explaining the source code of SGI STL, I found some useful informations:
  1. The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
    1. Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
    2. When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
    3. I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
  2. Other data structures such as arrays with linear detection or square detection is not as good as linked list.
  3. A good hash function can avoid clusters, and double hash can help to resolve clusters.

The question about multi_threads is still open. :D


pengdu
  • 1,331
  • 2
  • 14
  • 21
  • One additional point: Persistence ? Persistent Data Structure are awesome to get free-sharing. They are naturally thread-safe (since the blocks are read-only, and mutation create a new structure) but can be downright hairy. I know of no persistent hash-table for example. – Matthieu M. May 09 '11 at 16:32
  • 1
    Full thread safety in a low-level datastructure is rarely worth it and is almost always a waste of resources. The reason is that only user of your class really knows his syncronization needs, e.g. it is often needed that a read and a subsequent write is one atomic operation. As a real-world example, old Java's `Hashtable` was thread-safe, but new `HashMap` is not. –  May 09 '11 at 21:50

2 Answers2

3

I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable

The link points to a blog by Cliff Click which has an entry on hash functions. Some of his conclusions are:

  • To go from hash to index, use binary AND instead of modulo a prime. This is many times faster. Your table size must be a power of two.
  • For hash collisions don't use a linked list, store the values in the table to improve cache performance.
  • By using a state machine you can get a very fast multi-thread implementation. In his blog entry he lists the states in the state machine, but due to license problems he does not provide source code.
Peer Sommerlund
  • 502
  • 6
  • 14
  • @Peer: it gave lieu to an awesome presentation on GoogleTalk http://video.google.com/videoplay?docid=2139967204534450862# by the way. Still, it is considered good practice on SO to cite or sum up part of what you link to. Both because this way people are not forced to click to read what the link is about and also because the link may become dead, at which point your answer will be irrelevant. Cheers :) – Matthieu M. May 09 '11 at 16:30
  • @Matthieu M: Thank you for telling me about good practice at stackoverflow. I have added a little info about what is at the other end of the link. – Peer Sommerlund May 10 '11 at 04:41
  • 1
    @Peer: no problem, I hope you'll enjoy providing answers :) Using the table itself is called Open Addressing (http://en.wikipedia.org/wiki/Hash_table#Open_addressing). Note that Open Addressing is not suitable for large elements, this is not an issue in Java (since a pointer is stored) but can become one in C++. I am afraid I never saw a good benchmark on Hash Tables (and their various implementations). I guess it's critical enough that companies keep their own implementation private :/ – Matthieu M. May 10 '11 at 06:25
  • @Matthieu M.: I have no idea why Open Addressing is not suitable for large elements, and can store pointer in C++ too. Actually, in SGI STL, it uses Open Addressing, and it stores K/V pairs not the pointers. I don't know whether it has a efficiency issue, maybe a benchmark can talk. – pengdu May 10 '11 at 08:03
  • @pengdu: the issue with Open Addressing and large elements (really large) is that cache lines are filled with only a few elements. This means that when looking up an item, you risk cache misses more often, because the cache fills up faster. Obviously, you can simply let the user switch to pointers to heap allocated items at her discretion (just provide a note with your implementation indicating that large objects may cause cache misses and suggest the work-around). – Matthieu M. May 10 '11 at 08:20
  • @Matthieu M.: Very helpful, I have never think of **Cache** problem. Bu why "this is not an issue in Java (since a pointer is stored)"? In my opinion, if you will look up an item via a pointer, the cache misses will be more often, because the heap allocated items are unordered. – pengdu May 11 '11 at 01:49
  • @pengdu: You want the table itself in the cache, for fast lookup. If the item you need is not, you can incur the cost of a single cache miss when retrieving it. If you have to incur a cache miss each time you move more than 5/6 entries in your table, it quickly adds up, because access patterns are *random* (more or less) in a hash table. – Matthieu M. May 11 '11 at 07:05
3

There are two (slightly) orthogonal concern.

While the hash function is obviously important, in general you separate the design of the backend from the design of the hash function:

  • the hash function depends on the data to be stored
  • the backend depends on the requirements of the storage

For hash functions, I would suggest reading about CityHash or MurmurHash (with an explanation on SO).

For the back-end, there are various concerns, as you noted. Some remarks:

  • Are we talking average or worst case complexity ? Without perfect hashing, achieving O(1) is nigh-impossible as far as I know, though the worst case frequency and complexity can be considerably dampened.
  • Are we talking amortized complexity ? Amortized complexity in general offer better throughput at the cost of "spikes". Linear rehashing, at the cost of a slightly lower throughput, will give you a smoother curve.
  • With regard to multi-threading, note that the Read/Write pattern may impact the solution, considering extreme cases, 1 producer and 99 readers is very different from 99 producers and 1 reader. In general writes are harder to parallelize, because they may require modifying the structure. At worst, they might require serialization.
  • Garbage Collection is pretty trivial in the amortized case, with linear-rehashing it's a bit more complicated, but probably the least challenging portion.

You never talked about the amount of data you're about to use. Writers can update different buckets without interfering with one another, so if you have a lot of data, you can try to spread them around to avoid contention.

References:

  • The article on Wikipedia exposes lots of various implementations, always good to peek at the variety
  • This GoogleTalk from Dr Cliff (Azul Systems) shows a hash table designed for heavily multi-threaded systems, in Java.
Community
  • 1
  • 1
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722