Mark Gritter ([info]markgritter) wrote,
@ 2008-07-09 11:06:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:compression, graphs, math, statistics

Your Graph is Bad and You Should Feel Bad
I came across this graph in "Measuring the Compressibility of Metadata and Small Files" by Nathan Edel, Ethan Miller, Karl Brandt, and Scott Brandt. It shows the average compression ratio plotted against file size, for a sample of files from a Linux filesystem.



The graph is hard to read because of the use of different thick dashed lines for the three data sets. I'm not sure what Edward Tufte would have to say here--- the paper has to be black-and white (or else color is an obvious choice) but perhaps a thin line for each data set would be sufficient, since they do not overlap all that much.

From the paper: "We averaged files across size groups at 512-byte increments... Files between 32KB and 128KB showed a similar or slightly higher average degree of compressibility in overall, but had a degree of variation between different size groups."

I think this is probably a gross misinterpretation of their results, abetted by poor graph design. In a typical Unix filesystem, there are more small files than there are large files. If the bucket size in bytes remains constant, there are fewer files in each big-file bucket than there are in each small-file bucket. As a result, the variation at the bucket level will be larger for larger files, even if large files are actually less variable.

Here is a synthetic data set showing this effect. I generated compression ratios for 100K files from a normal distribution centered around 0.4 with standard deviation 0.1. The file sizes follow a pareto distribution with alpha=1.0 (minimum 16KB.) When the mean compression ratio is plotted in 512-byte buckets it shows the same increase in variability for larger files noted in the paper, even though there is no increase.



If we create buckets of equal sample size (3000 files, plotted at the size of the smallest file in the bucket), then the effect goes away:



(Post a new comment)

Signal Averaging
(Anonymous)
2008-07-11 05:45 pm UTC (link)
I agree, the graph that they made appears to be made by an undergraduate student. Essentially what you are doing is signal averaging (I've been a spectroscopist in my life)
I am also amused when authors report way too many sig figs. "I'm using a meter stick, but my measurements are good to the nearest micron."

(Reply to this)

Interesting what one finds ego-googling!
[info]cubicle_hermit
2008-08-05 05:24 pm UTC (link)
Although, heh, it's never bad to be cited even if it's for critique, right?

I don't at this point if the graphs for this section were made by Karl or myself (Nate) but yeah, looking back a lot of the work on that part of the project looks half-baked in retrospect. Ah well; I was a first-year MS student, and am back in industry.

Thanks for the tip with averaging buckets by numbers of files.

(Reply to this)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…