# Notes on perfect hashing for card-set bitmasks

Given a five-card poker hand represented by a bitmask (i.e., 1 in bit position == corresponding card is present), one way to determine the hand's strength is to perform a table lookup. One project of mine spends a considerable portion of its time doing such lookups, so I was interested in ways to improve the table lookup speed.

GCC hash_map has a few undesirable properties:
1) Poor data locality, each lookup hits at least two memory locations. The key is hashed to a bucket, then the list of nodes in that bucket is traversed. The vector of buckets is about 4M entries, which doesn't fit in cache (~32MB on 64-bit machine.)
2) Hashtable collisions require chaining through potentially many nodes, must perform a key comparison on each one.
3) Inefficient use of memory, each hashtable node is 192 bits since it stores the 64-bit key, a 64-bit pointer, and the 16-bit result, but is then padded out to an 8-byte boundary.

This suggested using a perfect hash instead. In fact, I found references to others doing exactly that. Links:

The perfect hashes generated by the last link (the "Jenkins" algorithm) are of the form A^table[B] where A and B are derived from the key value in such a way that no two keys share the same (A,B). I started by looking at hashes of the form A = key % pa, B = key % pb, where pa is a large prime (bigger than 52C5) and pb is a small prime (around 2^16). Then the intermediate lookup table will fit in memory. Discovering such primes is a matter of brute force, and there are many pairs that work. Unfortunately the Jenkins generator couldn't find perfect hashes that worked, which was a puzzle. (Though I think I now understand part of the reason why. If t contains values < 2^16, then having distinct A^t[B] requires that no more than 2^16 A values hash to the same 2^16-bucket region. The primes I was using didn't have this property.)

I tried reproducing the Jenkins algorithm using my own code, which would use 32-bit table values. Having 2^16 such entries would still fit in cache, while 2^17 would equal the L2 cache size. But it turned out to be pretty hard to find table entries that produced a perfect hash. The Jenkins algorithm works better if the range of B's is larger, but doing so creates a table that no longer fits in L2 cache. Although I returned to this approach, I decided to try a different method first.

(BTW, A little experimentation with integer hashes showed that modulus by a prime was just as efficient as anything else I could find.)

(Semi-?)Perfect Hash Implementation 1A + 1B: I didn't need to store the complete range 0-65535 in the lookup table. In fact, there are only 7461 different hand strengths. So a 16-bit value could indicate whether a collision had occured, and even some information about what to do next. Or, to turn it around, if I had a two-layer hashing scheme, the first-layer hash could contain some of the values.

So this led to the following design: First hash the key to a bucket in table 0. This bucket either contains the desired answer, or an entry indicating which table the answer resided in. Tables 1, 2, 3, etc. all use different hashes. To create the perfect hash, I just used a greedy algorithm that put table 0 collisions into the first table where they did not collide.

I again performed brute force to identify primes which generated hashes with relatively few collisions. I discovered that buckets with 5 bits were a bit overrepresented--- not surprisingly, since every small 5-bit combination is present in the key set and is unchanged by the mod operation. (I later found that XORing some 1's in the low bits eliminated this effect, and could lead to a few percent improvement in collision rates, and eliminates the over-representation of 5-bit values.) So I just picked the 10 "best" primes close to 52C5 and this resulted in a 6-table solution.

I wrote two variations, one that used % with a value from a table, and the other that used a function pointer from a table to jump to a hashing function that used % with a constant value. The latter version allows the compiler to use multiplication instead of division. This is variant 1a and 1b from the previous journal entry, which were 20% to 25% faster than the original hash_map.

Probably by improving the selection of table sizes, a better placement algorithm, etc, the number of tables could be reduced--- or use many more but smaller ones--- and thus the memory footprint. It is about 30MB with the hash I have today, much smaller than the hash_map footprint. I really like this algorithm because for a substantial percentage of keys you only hit memory once, and there is no fixed L2 cache overhead.

```const unsigned int phWidth = 6;
const unsigned int phModulus[6] = {
2599501,
2600341,
2603281,
2603981,
2604439,
2606803,
};
unsigned short phTable[6][2606803]= { {
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
...
```

It would be really cool to get this technique down to just two buckets, but unfortunately the birthday problem/paradox/property makes this extremely unlikely.

Perfect Hash Implementation 2: I thought about creating a version of the Jenkins hash which interleaved the indirect table and final data table, so that we could get situations in which the same cache miss brought both necessary entries into the cache. This would require that the indirection/XOR/'t' table and the data table be about the same size, which broke the bottleneck I was experiencing before. So, back to brute force searching for a pair of 'big' primes which preserved key uniqueness--- (A,B) pairs--- as well as produced many candidates for putting data[A^t[B]] on the same cache line as t[B]. I was using 16 bits for each table.

It was easy to find examples, but I ran into the problem described above. With a maximum t-value of only 65535, each "page" of A-values could contain no more than 65536 entries. So this had to be added to the search. It still was not hard to start finding candidates, and a greedy heuristic for assigning values to the t[] array produced a perfect hash. However, this version has about 120K candidates for cache alignment (out of 2.5M keys) and succesfully placed only about 100K. Again, it's possible that bigger primes (say, around 3M or 4M) would produce better results--- or non-greedy placement--- but this version proved slightly slower than algorithm 1A or 1B. Most lookups are going to require two memory accesses, although the footprint is quite good at about 11MB (2686976 * 4 bytes) so we should see more L2 hits. Providing more alignment opportunities might come at the cost of wasting more L2 space since the table would be less dense.

I thought about applying the 16 bits in the indirection table to different parts of the key (8 bits in the middle, 8 low bits for example) but concluded this wouldn't really have a beneficial effect since the modulus is "mostly" random so I couldn't expect to generate more alignment opportunities that way.

Also, I should really compare this to a Jenkins hash without interleaving and see whether the alignment actually does any good. ;)

```const unsigned int modA = 2651837;
const unsigned long long xorA = 536870911ull;
const unsigned int modB = 2600341;
const unsigned long long xorB = 536870911ull;
const unsigned int tableSize = 2686976;
struct PerfectHashTable {
const unsigned short int tbl;
unsigned short int data;
} phTable[ tableSize ] = {
{0,0},
{1,0},
{1,0},
{13,0},
{35,0},
...
```

Tags:
• Post a new comment

#### Error

default userpic