Our file system compresses data before storing it to flash. In order to test this, we need to generate data that has a given level of compressibility. A block which has all the same byte will compress down to close to 0% of its original size. A block which has completely random data will stay at 100% of its original size. Can you describe a way to generate blocks that have a tunable amount of compression, within a few percent?
The experience has been one of mutual frustration so far. I thought I was giving a pretty big hint with my explanation of compression ratios. But I guess compression and traffic-generation techniques are arcane arts?
The answer I would like to hear is:
Generate N% of the block randomly, and then fill the remaining portion with all zeros.
This works pretty well; I wrote a quick example and ran it through 'zip':
Desired Actual 0% 0.3% 10% 11.1% 20% 21.1% 30% 31.0% 40% 40.9% 50% 50.9% 60% 60.9% 70% 70.7% 80% 80.7% 90% 90.6% 100% 100.0%
Most people who try to answer start thinking of some way to adjust the randomness of each byte, by picking it from a nonuniform distribution. I give this partial credit because it shows some level of understanding, but fundamentally I don't think it works all that well. So I did an experiment to find out. Here are two concrete proposals I have gotten.
Instead of picking each byte from the range [0, 255], pick it from the range [0, 255 * N/100]
This doesn't provide a block which zip (or gzip) handles well. This is surprising to me because the "deflate" algorithm used by the Zip format builds a Huffman encoding. I could see why table-based compression algorithms wouldn't fare as well, but with only 25 distinct symbols a Huffman tree should work great!
Desired Actual 0% 0.3% 10% 62.4% 20% 72.8% 30% 79.5% 40% 84.6% 50% 87.9% 60% 91.7% 70% 94.5% 80% 96.9% 90% 98.8% 100% 100.0%
Repeat each randomly-chosen byte an average of 100/N times.
This performed so poorly on my original test that I changed it to (100/N)+1. Two identical bytes aren't enough for run-length encoding to see any benefit. Still, this method also performs poorly, although it much better than previous one:
Desired Actual 0% 0.3% 10% 17.0% 20% 28.5% 30% 42.5% 40% 52.9% 50% 56.3% 60% 72.3% 70% 83.8% 80% 91.9% 90% 97.0% 100% 97.8%
The last row could be fixed, that just shows how little benefit you get from repeating each random byte twice in a row.
The code for this experiment can be found here: https://gist.github.com/anonymous/985443b83e60f82161be
An answer I thought I might hear, but did not, was
Run the compression algorithm on a large sample of real data, and store 100 examples, one for each percentage value.