Hash Table Design

+ +

Overview

+ +

The collision-chaining hash-based container has the + following declaration.

+template<
+    typename Key,
+    typename Mapped,
+    typename Hash_Fn = std::hash<Key>,
+    typename Eq_Fn = std::equal_to<Key>,
+    typename Comb_Hash_Fn = direct_mask_range_hashing<>
+    typename Resize_Policy = default explained below.
+     bool Store_Hash = false,
+     typename Allocator = std::allocator<char> >
+class cc_hash_table;
+

+ +

The parameters have the following meaning:

+ +

Key is the key type.
Mapped is the mapped-policy, and is explained in + Tutorial::Associative + Containers::Associative Containers Others than Maps.
Hash_Fn is a key hashing functor.
Eq_Fn is a key equivalence functor.
Comb_Hash_Fn is a range-hashing_functor; + it describes how to translate hash values into positions + within the table. This is described in Hash Policies.
Resize_Policy describes how a container object + should change its internal size. This is described in + Resize Policies.
Store_Hash indicates whether the hash value + should be stored with each entry. This is described in + Policy Interaction.
Allocator is an allocator + type.

+ +

The probing hash-based container has the following + declaration.

+template<
+    typename Key,
+    typename Mapped,
+    typename Hash_Fn = std::hash<Key>,
+    typename Eq_Fn = std::equal_to<Key>,
+    typename Comb_Probe_Fn = direct_mask_range_hashing<>
+    typename Probe_Fn = default explained below.
+    typename Resize_Policy = default explained below.
+    bool Store_Hash = false,
+    typename Allocator =  std::allocator<char> >
+class gp_hash_table;
+

+ +

The parameters are identical to those of the + collision-chaining container, except for the following.

+ +

Comb_Probe_Fn describes how to transform a probe + sequence into a sequence of positions within the table.
Probe_Fn describes a probe sequence policy.

+ +

Some of the default template values depend on the values of + other parameters, and are explained in Policy Interaction.

+ +

Hash + Policies

+ +

General + Terms

+ +

Following is an explanation of some functions which hashing + involves. Figure Hash functions, + ranged-hash functions, and range-hashing functions) + illustrates the discussion.

+ +

Hash functions, ranged-hash functions, and + range-hashing functions.

+ +

Let U be a domain (e.g., the integers, or the + strings of 3 characters). A hash-table algorithm needs to map + elements of U "uniformly" into the range [0,..., m - + 1] (where m is a non-negative integral value, and + is, in general, time varying). I.e., the algorithm needs + a ranged-hash function

+ +

f : U × Z₊ → Z₊ + ,

+ +

such that for any u in U ,

+ +

0 ≤ f(u, m) ≤ m - 1 ,

+ +

and which has "good uniformity" properties [knuth98sorting]. One + common solution is to use the composition of the hash + function

+ +

h : U → Z₊ ,

+ +

which maps elements of U into the non-negative + integrals, and

+ +

g : Z₊ × Z₊ → + Z₊,

+ +

which maps a non-negative hash value, and a non-negative + range upper-bound into a non-negative integral in the range + between 0 (inclusive) and the range upper bound (exclusive), + i.e., for any r in Z₊,

+ +

0 ≤ g(r, m) ≤ m - 1 .

+ +

The resulting ranged-hash function, is

+ +

f(u , m) = + g(h(u), m) (1) .

+ +

From the above, it is obvious that given g and + h, f can always be composed (however the converse + is not true). The STL's hash-based containers allow specifying + a hash function, and use a hard-wired range-hashing function; + the ranged-hash function is implicitly composed.

+ +

The above describes the case where a key is to be mapped + into a single position within a hash table, e.g., + in a collision-chaining table. In other cases, a key is to be + mapped into a sequence of positions within a table, + e.g., in a probing table. Similar terms apply in this + case: the table requires a ranged probe function, + mapping a key into a sequence of positions withing the table. + This is typically achieved by composing a hash function + mapping the key into a non-negative integral type, a + probe function transforming the hash value into a + sequence of hash values, and a range-hashing function + transforming the sequence of hash values into a sequence of + positions.

+ +

Range-Hashing Functions

+ +

Some common choices for range-hashing functions are the + division, multiplication, and middle-square methods [knuth98sorting], defined + as

+ +

g(r, m) = + r mod m (2) ,

+ +

g(r, m) = ⌈ u/v ( a r mod v ) ⌉ ,

+ +

and

+ +

g(r, m) = ⌈ u/v ( r² mod v ) ⌉ + ,

+ +

respectively, for some positive integrals u and + v (typically powers of 2), and some a. Each of + these range-hashing functions works best for some different + setting.

+ +

The division method (2) is a + very common choice. However, even this single method can be + implemented in two very different ways. It is possible to + implement (2) using the low + level % (modulo) operation (for any m), or the + low level & (bit-mask) operation (for the case where + m is a power of 2), i.e.,

+ +

g(r, m) = r % m (3) ,

+ +

and

+ +

g(r, m) = r & m - 1, (m = + 2^k) for some k) (4),

+ +

respectively.

+ +

The % (modulo) implementation (3) has the advantage that for + m a prime far from a power of 2, g(r, m) is + affected by all the bits of r (minimizing the chance of + collision). It has the disadvantage of using the costly modulo + operation. This method is hard-wired into SGI's implementation + [sgi_stl].

+ +

The & (bit-mask) implementation (4) has the advantage of + relying on the fast bit-wise and operation. It has the + disadvantage that for g(r, m) is affected only by the + low order bits of r. This method is hard-wired into + Dinkumware's implementation [dinkumware_stl].

+ +

Ranged-Hash + Functions

+ +

In cases it is beneficial to allow the + client to directly specify a ranged-hash hash function. It is + true, that the writer of the ranged-hash function cannot rely + on the values of m having specific numerical properties + suitable for hashing (in the sense used in [knuth98sorting]), since + the values of m are determined by a resize policy with + possibly orthogonal considerations.

+ +

There are two cases where a ranged-hash function can be + superior. The firs is when using perfect hashing [knuth98sorting]; the + second is when the values of m can be used to estimate + the "general" number of distinct values required. This is + described in the following.

+ +

Let

+ +

s = [ s₀,..., s_{t - 1}]

+ +

be a string of t characters, each of which is from + domain S. Consider the following ranged-hash + function:

+ +

f₁(s, m) = ∑ _{i =
+ 0}^{t - 1} s_i aⁱ mod + m (5) ,

+ +

where a is some non-negative integral value. This is + the standard string-hashing function used in SGI's + implementation (with a = 5) [sgi_stl]. Its advantage is that + it takes into account all of the characters of the string.

+ +

Now assume that s is the string representation of a + of a long DNA sequence (and so S = {'A', 'C', 'G', + 'T'}). In this case, scanning the entire string might be + prohibitively expensive. A possible alternative might be to use + only the first k characters of the string, where

+ +

|S|^k ≥ m ,

+ +

i.e., using the hash function

+ +

f₂(s, m) = ∑ _{i
+ = 0}^{k - 1} s_i aⁱ mod + m , (6)

+ +

requiring scanning over only

+ +

k = log₄( m )

+ +

characters.

+ +

Other more elaborate hash-functions might scan k + characters starting at a random position (determined at each + resize), or scanning k random positions (determined at + each resize), i.e., using

+ +

f₃(s, m) = ∑ _{i =
+ r}0^{r₀ + k - 1} s_i + aⁱ mod m ,

+ +

f₄(s, m) = ∑ _{i = 0}^{k -
+ 1} s_ri a^r_i mod + m ,

+ +

respectively, for r₀,..., r_k-1 + each in the (inclusive) range [0,...,t-1].

+ +

It should be noted that the above functions cannot be + decomposed as (1) .

+ +

Implementation

+ +

This sub-subsection describes the implementation of the + above in pb_ds. It first explains range-hashing + functions in collision-chaining tables, then ranged-hash + functions in collision-chaining tables, then probing-based + tables, and, finally, lists the relevant classes in + pb_ds.

+ +

Range-Hashing and Ranged-Hashes in Collision-Chaining + Tables

+ +

cc_hash_table is + parametrized by Hash_Fn and Comb_Hash_Fn, a + hash functor and a combining hash functor, respectively.

+ +

In general, Comb_Hash_Fn is considered a + range-hashing functor. cc_hash_table + synthesizes a ranged-hash function from Hash_Fn and + Comb_Hash_Fn (see (1) + above). Figure Insert + hash sequence diagram shows an insert sequence + diagram for this case. The user inserts an element (point A), + the container transforms the key into a non-negative integral + using the hash functor (points B and C), and transforms the + result into a position using the combining functor (points D + and E).

+ +

Insert hash sequence diagram.

+ +

If cc_hash_table's + hash-functor, Hash_Fn is instantiated by null_hash_fn (see Interface::Concepts::Null + Policy Classes), then Comb_Hash_Fn is taken to be + a ranged-hash function. Figure Insert hash sequence diagram + with a null hash policy shows an insert sequence + diagram. The user inserts an element (point A), the container + transforms the key into a position using the combining functor + (points B and C).

+ +

Insert hash sequence diagram with a null hash + policy.

+ +

Probing Tables

+ +

gp_hash_table is + parametrized by Hash_Fn, Probe_Fn, and + Comb_Probe_Fn. As before, if Hash_Fn and + Probe_Fn are, respectively, null_hash_fn and null_probe_fn, then + Comb_Probe_Fn is a ranged-probe functor. Otherwise, + Hash_Fn is a hash functor, Probe_Fn is a + functor for offsets from a hash value, and + Comb_Probe_Fn transforms a probe sequence into a + sequence of positions within the table.

+ +

Pre-Defined Policies

+ +

pb_ds contains some pre-defined classes + implementing range-hashing and probing functions:

+ +

direct_mask_range_hashing + and direct_mod_range_hashing + are range-hashing functions based on a bit-mask and a modulo + operation, respectively.
linear_probe_fn, and + quadratic_probe_fn are + a linear probe and a quadratic probe function, + respectively.

Figure Hash policy class + diagram shows a class diagram. + +

+ +

Hash policy class diagram.

+ +

Resize + Policies

+ +

General Terms

+ +

Hash-tables, as opposed to trees, do not naturally grow or + shrink. It is necessary to specify policies to determine how + and when a hash table should change its size. Usually, resize + policies can be decomposed into orthogonal policies:

+ +

A size policy indicating how a hash table + should grow (e.g., it should multiply by powers of + 2).
A trigger policy indicating when a hash + table should grow (e.g., a load factor is + exceeded).

+ +

Size + Policies

+ +

Size policies determine how a hash table changes size. These + policies are simple, and there are relatively few sensible + options. An exponential-size policy (with the initial size and + growth factors both powers of 2) works well with a mask-based + range-hashing function (see Range-Hashing Policies), and is the + hard-wired policy used by Dinkumware [dinkumware_stl]. A + prime-list based policy works well with a modulo-prime range + hashing function (see Range-Hashing + Policies), and is the hard-wired policy used by SGI's + implementation [sgi_stl].

+ +

Trigger + Policies

+ +

Trigger policies determine when a hash table changes size. + Following is a description of two policies: load-check + policies, and collision-check policies.

+ +

Load-check policies are straightforward. The user specifies + two factors, α_min and + α_max, and the hash table maintains the + invariant that

+ +

α_min ≤ (number of + stored elements) / (hash-table size) ≤ + α_max (1) .

+ +

Collision-check policies work in the opposite direction of + load-check policies. They focus on keeping the number of + collisions moderate and hoping that the size of the table will + not grow very large, instead of keeping a moderate load-factor + and hoping that the number of collisions will be small. A + maximal collision-check policy resizes when the longest + probe-sequence grows too large.

+ +

Consider Figure Balls and + bins. Let the size of the hash table be denoted by + m, the length of a probe sequence be denoted by + k, and some load factor be denoted by α. We would + like to calculate the minimal length of k, such that if + there were α m elements in the hash table, a probe + sequence of length k would be found with probability at + most 1/m.

+ +

Balls and bins.

+ +

Denote the probability that a probe sequence of length + k appears in bin i by p_i, the + length of the probe sequence of bin i by + l_i, and assume uniform distribution. Then

+ +

p₁ = (3)

+ +

P(l₁ ≥ k) =

+ +

P(l₁ ≥ α ( 1 + k / α - 1 + ) ≤ (a)

+ +

e ^ ( - ( α ( k / α - 1 )² ) /2 + ) ,

+ +

where (a) follows from the Chernoff bound [motwani95random]. To + calculate the probability that some bin contains a probe + sequence greater than k, we note that the + l_i are negatively-dependent [dubhashi98neg]. Let + I(.) denote the indicator function. Then

+ +

P( exists_i + l_i ≥ k ) = (3)

+ +

P ( ∑ _{i = 1}^m + I(l_i ≥ k) ≥ 1 ) =

+ +

P ( ∑ _{i = 1}^m I ( + l_i ≥ k ) ≥ m p₁ ( 1 + 1 / (m + p₁) - 1 ) ) ≤ (a)

+ +

e ^ ( ( - m p₁ ( 1 / (m p₁) + - 1 ) ² ) / 2 ) ,

+ +

where (a) follows from the fact that the Chernoff bound can + be applied to negatively-dependent variables [dubhashi98neg]. Inserting + (2) into (3), and equating with + 1/m, we obtain

+ +

k ~ √ ( 2 α ln 2 m ln(m) ) + ) .

+ +

Implementation

+ +

This sub-subsection describes the implementation of the + above in pb_ds. It first describes resize policies and + their decomposition into trigger and size policies, then + describes pre-defined classes, and finally discusses controlled + access the policies' internals.

+ +

Resize Policies and Their Decomposition

+ +

Each hash-based container is parametrized by a + Resize_Policy parameter; the container derives + publicly from Resize_Policy. For + example:

+cc_hash_table<
+    typename Key,
+    typename Mapped,
+    ...
+    typename Resize_Policy
+    ...> :
+        public Resize_Policy
+

+ +

As a container object is modified, it continuously notifies + its Resize_Policy base of internal changes + (e.g., collisions encountered and elements being + inserted). It queries its Resize_Policy base whether + it needs to be resized, and if so, to what size.

+ +

Figure Insert + resize sequence diagram shows a (possible) sequence diagram + of an insert operation. The user inserts an element; the hash + table notifies its resize policy that a search has started + (point A); in this case, a single collision is encountered - + the table notifies its resize policy of this (point B); the + container finally notifies its resize policy that the search + has ended (point C); it then queries its resize policy whether + a resize is needed, and if so, what is the new size (points D + to G); following the resize, it notifies the policy that a + resize has completed (point H); finally, the element is + inserted, and the policy notified (point I).

+ +

Insert resize sequence diagram.

+ +

In practice, a resize policy can be usually orthogonally + decomposed to a size policy and a trigger policy. Consequently, + the library contains a single class for instantiating a resize + policy: hash_standard_resize_policy + is parametrized by Size_Policy and + Trigger_Policy, derives publicly from + both, and acts as a standard delegate [gamma95designpatterns] + to these policies.

+ +

Figures Standard + resize policy trigger sequence diagram and Standard resize policy size + sequence diagram show sequence diagrams illustrating the + interaction between the standard resize policy and its trigger + and size policies, respectively.

+ +

Standard resize policy trigger sequence + diagram.

+ +

Standard resize policy size sequence + diagram.

+ +

Pre-Defined Policies

+ +

The library includes the following + instantiations of size and trigger policies:

+ +

hash_load_check_resize_trigger + implements a load check trigger policy.
cc_hash_max_collision_check_resize_trigger + implements a collision check trigger policy.
hash_exponential_size_policy + implements an exponential-size policy (which should be used + with mask range hashing).
hash_prime_size_policy + implementing a size policy based on a sequence of primes + [sgi_stl] (which should + be used with mod range hashing

+ +

Figure Resize policy class + diagram gives an overall picture of the resize-related + classes. basic_hash_table + is parametrized by Resize_Policy, which it subclasses + publicly. This class is currently instantiated only by hash_standard_resize_policy. + hash_standard_resize_policy + itself is parametrized by Trigger_Policy and + Size_Policy. Currently, Trigger_Policy is + instantiated by hash_load_check_resize_trigger, + or cc_hash_max_collision_check_resize_trigger; + Size_Policy is instantiated by hash_exponential_size_policy, + or hash_prime_size_policy.

+ +

Resize policy class diagram.

+ +

Controlled Access to Policies' Internals

+ +

There are cases where (controlled) access to resize + policies' internals is beneficial. E.g., it is sometimes + useful to query a hash-table for the table's actual size (as + opposed to its size() - the number of values it + currently holds); it is sometimes useful to set a table's + initial size, externally resize it, or change load factors.

+ +

Clearly, supporting such methods both decreases the + encapsulation of hash-based containers, and increases the + diversity between different associative-containers' interfaces. + Conversely, omitting such methods can decrease containers' + flexibility.

+ +

In order to avoid, to the extent possible, the above + conflict, the hash-based containers themselves do not address + any of these questions; this is deferred to the resize policies, + which are easier to change or replace. Thus, for example, + neither cc_hash_table nor + gp_hash_table + contain methods for querying the actual size of the table; this + is deferred to hash_standard_resize_policy.

+ +

Furthermore, the policies themselves are parametrized by + template arguments that determine the methods they support + ([alexandrescu01modern] + shows techniques for doing so). hash_standard_resize_policy + is parametrized by External_Size_Access that + determines whether it supports methods for querying the actual + size of the table or resizing it. hash_load_check_resize_trigger + is parametrized by External_Load_Access that + determines whether it supports methods for querying or + modifying the loads. cc_hash_max_collision_check_resize_trigger + is parametrized by External_Load_Access that + determines whether it supports methods for querying the + load.

+ +

Some operations, for example, resizing a container at + run time, or changing the load factors of a load-check trigger + policy, require the container itself to resize. As mentioned + above, the hash-based containers themselves do not contain + these types of methods, only their resize policies. + Consequently, there must be some mechanism for a resize policy + to manipulate the hash-based container. As the hash-based + container is a subclass of the resize policy, this is done + through virtual methods. Each hash-based container has a + private virtual method:

+virtual void
+    do_resize
+    (size_type new_size);
+

+ +

which resizes the container. Implementations of + Resize_Policy can export public methods for resizing + the container externally; these methods internally call + do_resize to resize the table.

+ +

Policy + Interaction

+ +

Hash-tables are unfortunately especially susceptible to + choice of policies. One of the more complicated aspects of this + is that poor combinations of good policies can form a poor + container. Following are some considerations.

+ +

Probe Policies, Size + Policies, and Trigger Policies

+ +

Some combinations do not work well for probing containers. + For example, combining a quadratic probe policy with an + exponential size policy can yield a poor container: when an + element is inserted, a trigger policy might decide that there + is no need to resize, as the table still contains unused + entries; the probe sequence, however, might never reach any of + the unused entries.

+ +

Unfortunately, pb_ds cannot detect such problems at + compilation (they are halting reducible). It therefore defines + an exception class insert_error to throw an + exception in this case.

+ +

Hash Policies and Trigger + Policies

+ +

Some trigger policies are especially susceptible to poor + hash functions. Suppose, as an extreme case, that the hash + function transforms each key to the same hash value. After some + inserts, a collision detecting policy will always indicate that + the container needs to grow.

+ +

The library, therefore, by design, limits each operation to + one resize. For each insert, for example, it queries + only once whether a resize is needed.

+ +

Equivalence Functors, Storing + Hash Values, and Hash Functions

+ +

cc_hash_table and + gp_hash_table are + parametrized by an equivalence functor and by a + Store_Hash parameter. If the latter parameter is + true, then the container stores with each entry + a hash value, and uses this value in case of collisions to + determine whether to apply a hash value. This can lower the + cost of collision for some types, but increase the cost of + collisions for other types.

+ +

If a ranged-hash function or ranged probe function is + directly supplied, however, then it makes no sense to store the + hash value with each entry. pb_ds's container will + fail at compilation, by design, if this is attempted.

+ +

Size Policies and + Load-Check Trigger Policies

+ +

Assume a size policy issues an increasing sequence of sizes + a, a q, a q¹, a q², ... For + example, an exponential size policy might issue the sequence of + sizes 8, 16, 32, 64, ...

+ +

If a load-check trigger policy is used, with loads + α_min and α_max, + respectively, then it is a good idea to have:

+ +

α_max ~ 1 / q
α_min < 1 / (2 q)

+ +

This will ensure that the amortized hash cost of each + modifying operation is at most approximately 3.

+ +

α_min ~ α_max is, in + any case, a bad choice, and α_min > + α_max is horrendous.