Mark Gritter (markgritter) wrote,
Mark Gritter

Your Graph is Bad and You Should Feel Bad

I came across this graph in "Measuring the Compressibility of Metadata and Small Files" by Nathan Edel, Ethan Miller, Karl Brandt, and Scott Brandt. It shows the average compression ratio plotted against file size, for a sample of files from a Linux filesystem.

The graph is hard to read because of the use of different thick dashed lines for the three data sets. I'm not sure what Edward Tufte would have to say here--- the paper has to be black-and white (or else color is an obvious choice) but perhaps a thin line for each data set would be sufficient, since they do not overlap all that much.

From the paper: "We averaged files across size groups at 512-byte increments... Files between 32KB and 128KB showed a similar or slightly higher average degree of compressibility in overall, but had a degree of variation between different size groups."

I think this is probably a gross misinterpretation of their results, abetted by poor graph design. In a typical Unix filesystem, there are more small files than there are large files. If the bucket size in bytes remains constant, there are fewer files in each big-file bucket than there are in each small-file bucket. As a result, the variation at the bucket level will be larger for larger files, even if large files are actually less variable.

Here is a synthetic data set showing this effect. I generated compression ratios for 100K files from a normal distribution centered around 0.4 with standard deviation 0.1. The file sizes follow a pareto distribution with alpha=1.0 (minimum 16KB.) When the mean compression ratio is plotted in 512-byte buckets it shows the same increase in variability for larger files noted in the paper, even though there is no increase.

If we create buckets of equal sample size (3000 files, plotted at the size of the smallest file in the bucket), then the effect goes away:

Tags: compression, graphs, math, statistics
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.