First hash the key to a bucket (again I used bucket = ( key ^ x ) % a.) This bucket is a 32-byte structure containing a multiplier m, modulus b and 14 data entries forming a secondary hash table. The second bucket is given by (( key * m ) % b) & 0xf. We just need to search for appropriate (a, x) and per-bucket (b, m) such that there are no collisions (and no secondary buckets > 14.)
This is easily accomplished for a ~ 52C5, x = 0, m = 1, which uses two less operations to generate the hashes (eliminating the XOR and multiplication.) With m=1 fixed we can also use a 15th entry in the secondary hash table. That doesn't run any faster than previous efforts despite the secondary hash table fitting into a single cache line.
However, we can lower the table size and increase the cache hit rate. a=600071, x=0 gives a table that comes in around 1.85-1.88 seconds per 10M lookups, about 30% improvement over hash_map. I'm looking for an 'a' value that will allow m=1 (to eliminate the multiply) but haven't found one yet.