Mark Gritter (markgritter) wrote,
Mark Gritter

Geek miscellany

1. I mostly don't attend cons with Marissa. But I will be attending 4th Street Fantasy Convention. (Conveniently located on 34th avenue!) Maybe you should too.

2. Math philosophy makes me happy. Scott Aaronson muses on the difference between unsolved math questions which might actually impact us and those which seem less likely to do so.

3. I attended a storage conference at UMN on Tuesday and Wednesday this week. Maybe I'll blog about it a bit more later. But, I do have a small rant. One of the topics was deduplication, and the use of hashes for finding identical blocks. The prevailing attitude was, as best I can reproduce: "Doing byte-by-byte checking is a performance hit, but we have to do it, because our bank customers won't accept anything that has some chance of corrupting the data."

Well, this is ridiculous. It may be true, but it's still ridiculous. The chance that the bank's data set will contain two different blocks that hash to the same 128 bits is incredibly much smaller than the chance that the code that checks the blocks byte by byte has a software bug. With a 256-bit hash, the chance of a collision is lower than that of the bank's redundant data centers being simultaneously wiped out by asteroid strikes.

In fact, one number at the conference was that total digital data production was predicted to be 1600 exabytes by 2011. (This may be a completely bogus number, but what the heck, IDC gets paid good money to come up with it.) That's storage exabytes (powers of ten, not two) so it's 1600*10^18 = 1.6*10^21. If we use 4KB blocks, that's about 0.4*10^18 = 4*10^17 blocks per year. But 2^128 = 2^8 * (2^10)^12 * 2^8 =~ 256 * (10^3)^12 = 256 * 10^36 =~ 2*10^38 different hashes. Applying the birthday paradox means that we can expect to see up to about 10^19 blocks before encountering a collision. So, 10 years worth of worldwide data production might, if we were "lucky", produce 1 accidental hash collision. It is to eliminate the possibility of that 1 hash collision that the blocks are compared byte by byte before being "deduplicated."

Moreover, how do you possibly perform a realistic test of the code to handle a false collision? If it ever got triggered, what level of confidence could you have that it was operating correctly?

Naturally, cryptographers would be very interested in seeing a SHA-1 collision. Here's my challenge to the storage community: put in a counter on your byte-by-byte comparison code to check how many times the comparison catches a difference. (Heck, log the offending blocks too, in the interests of science.) I'll give $100 to anybody who can demonstrate that the counter is nonzero because of a valid SHA-1 collision anytime within the next 10 years. I'll pay off any other sufficiently strong 128-bit hash too.

Bonus offer: I'll pay $1000, based on an additional condition: starting today you pay me $1 every time the comparison fails, but because the data got corrupted between taking the hash value and comparing the bytes. (i.e., the blocks being compared byte-by-byte in memory do not actually have identical hash values.)

At best this is innumeracy. At worst this is deliberate obtuseness--- this possibility has to be eliminated because we can put a number on it, despite the system having all sorts of failure possibilities with fuzzy or unknown probabilities. More than one person said "well, it's multiplicative." It's still not worth the performance hit, particularly if you could figure out ways to use that CPU, memory, and/or storage bandwidth to help the real problems.

How many of these systems include quicksort, I wonder?
Tags: conferences, conventions, geek, innumeracy, math, rant, science fiction, storage
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.