Mark Gritter (markgritter) wrote,
Mark Gritter

I'm thinking of retiring this interview question

I have been experimenting with asking the following question during interviews:

Our file system compresses data before storing it to flash. In order to test this, we need to generate data that has a given level of compressibility. A block which has all the same byte will compress down to close to 0% of its original size. A block which has completely random data will stay at 100% of its original size. Can you describe a way to generate blocks that have a tunable amount of compression, within a few percent?

The experience has been one of mutual frustration so far. I thought I was giving a pretty big hint with my explanation of compression ratios. But I guess compression and traffic-generation techniques are arcane arts?

The answer I would like to hear is:

Generate N% of the block randomly, and then fill the remaining portion with all zeros.

This works pretty well; I wrote a quick example and ran it through 'zip':
Desired	Actual
0%	0.3%
10%	11.1%
20%	21.1%
30%	31.0%
40%	40.9%
50%	50.9%
60%	60.9%
70%	70.7%
80%	80.7%
90%	90.6%
100%	100.0%

Most people who try to answer start thinking of some way to adjust the randomness of each byte, by picking it from a nonuniform distribution. I give this partial credit because it shows some level of understanding, but fundamentally I don't think it works all that well. So I did an experiment to find out. Here are two concrete proposals I have gotten.

Instead of picking each byte from the range [0, 255], pick it from the range [0, 255 * N/100]

This doesn't provide a block which zip (or gzip) handles well. This is surprising to me because the "deflate" algorithm used by the Zip format builds a Huffman encoding. I could see why table-based compression algorithms wouldn't fare as well, but with only 25 distinct symbols a Huffman tree should work great!
Desired Actual
0%	0.3%
10%	62.4%
20%	72.8%
30%	79.5%
40%	84.6%
50%	87.9%
60%	91.7%
70%	94.5%
80%	96.9%
90%	98.8%
100%	100.0%

Repeat each randomly-chosen byte an average of 100/N times.

This performed so poorly on my original test that I changed it to (100/N)+1. Two identical bytes aren't enough for run-length encoding to see any benefit. Still, this method also performs poorly, although it much better than previous one:
Desired	Actual
0%	0.3%
10%	17.0%
20%	28.5%
30%	42.5%
40%	52.9%
50%	56.3%
60%	72.3%
70%	83.8%
80%	91.9%
90%	97.0%
100%	97.8%

The last row could be fixed, that just shows how little benefit you get from repeating each random byte twice in a row.

The code for this experiment can be found here:

An answer I thought I might hear, but did not, was

Run the compression algorithm on a large sample of real data, and store 100 examples, one for each percentage value.

Tags: programming
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded