Mark Gritter (markgritter) wrote,
Mark Gritter

Curse you, mathematics

I just realized that the Birthday Problem makes one of my current mini-projects impossible. For example, with a 20-million-bucket hashtable, you're 99.995% likely to get a collision by the time you reach 20,000 entries (just .1% load)--- if the keys are randomly distributed. So brute force search among a family of hash functions is unlikely to turn up very many collision-free hashes.

Yes, I've been reading up on perfect hashing. But a lot of the algorithms are painful to implement. Some don't seem to be very cache- or memory-system aware (but in many cases the intended application will be going to disk so that's not so bad.) So far I've developed two working algorithms, but I was trying for something simpler and faster since my current best effort is only about 40% faster than a linked-list keyed hashtable (and about 5/6th of the space in the table is wasted.)

My only notable discovery so far is that modulus by a prime seems superior to all the other 64-bit integer hashes I've found or explored. It's faster than a lot of bit-shifting (if the compiler converts it to multiplication, which gcc does). And it produces fewer collisions than a lot of other examples I've tried on my data set. A secure hash implemented in hardware would be best. ;)
Tags: algorithm, mathematics, programming
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment