I have some candidate aspects:
- The hash function is important, the hashcode should be unique as far as possible.
- The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
- The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
- If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.
My questions are:
- Are there any more aspects worth attention?
- How to design the hash_table to full fill these aspects?
- Are there any resources I can refer to?
Many thanks!
After reading some material, update my questions. :)
In a book explaining the source code of SGI STL, I found some useful informations:
- The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
- Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
- When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
- I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
- Other data structures such as arrays with linear detection or square detection is not as good as linked list.
- A good hash function can avoid clusters, and double hash can help to resolve clusters.
The question about multi_threads is still open. :D