Mark Gritter (markgritter) wrote,
Mark Gritter

I'm thinking of retiring this interview question

I have been experimenting with asking the following question during interviews:

Our file system compresses data before storing it to flash. In order to test this, we need to generate data that has a given level of compressibility. A block which has all the same byte will compress down to close to 0% of its original size. A block which has completely random data will stay at 100% of its original size. Can you describe a way to generate blocks that have a tunable amount of compression, within a few percent?

The experience has been one of mutual frustration so far. I thought I was giving a pretty big hint with my explanation of compression ratios. But I guess compression and traffic-generation techniques are arcane arts?

The answer I would like to hear is:

Generate N% of the block randomly, and then fill the remaining portion with all zeros.

This works pretty well; I wrote a quick example and ran it through 'zip':
Desired	Actual
0%	0.3%
10%	11.1%
20%	21.1%
30%	31.0%
40%	40.9%
50%	50.9%
60%	60.9%
70%	70.7%
80%	80.7%
90%	90.6%
100%	100.0%

Most people who try to answer start thinking of some way to adjust the randomness of each byte, by picking it from a nonuniform distribution. I give this partial credit because it shows some level of understanding, but fundamentally I don't think it works all that well. So I did an experiment to find out. Here are two concrete proposals I have gotten.

Instead of picking each byte from the range [0, 255], pick it from the range [0, 255 * N/100]

This doesn't provide a block which zip (or gzip) handles well. This is surprising to me because the "deflate" algorithm used by the Zip format builds a Huffman encoding. I could see why table-based compression algorithms wouldn't fare as well, but with only 25 distinct symbols a Huffman tree should work great!
Desired Actual
0%	0.3%
10%	62.4%
20%	72.8%
30%	79.5%
40%	84.6%
50%	87.9%
60%	91.7%
70%	94.5%
80%	96.9%
90%	98.8%
100%	100.0%

Repeat each randomly-chosen byte an average of 100/N times.

This performed so poorly on my original test that I changed it to (100/N)+1. Two identical bytes aren't enough for run-length encoding to see any benefit. Still, this method also performs poorly, although it much better than previous one:
Desired	Actual
0%	0.3%
10%	17.0%
20%	28.5%
30%	42.5%
40%	52.9%
50%	56.3%
60%	72.3%
70%	83.8%
80%	91.9%
90%	97.0%
100%	97.8%

The last row could be fixed, that just shows how little benefit you get from repeating each random byte twice in a row.

The code for this experiment can be found here:

An answer I thought I might hear, but did not, was

Run the compression algorithm on a large sample of real data, and store 100 examples, one for each percentage value.

Tags: programming
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.