Mark Gritter (markgritter) wrote,
Mark Gritter
markgritter

Reality >>> Theory

Bianca Schroeder's talk on DRAM errors is worth watching. (Though it did remind me how annoying EE380 sessions could be--- it's generally not students asking the same question over and over, it's faculty or random people off the street.)

Old theory: Increased levels of heat cause errors, soft errors dominate and are randomly distributed. (Lab testing involves pointing a heat gun at a DRAM device...)

Actual measurements from Google's server farm:

* Heat is correlated with increased error rate but does not seem to cause it. Rather, increased utilization drives both.

* Hard errors probably dominate error count and drive most uncorrectable errors.

* Less than 20% (maybe as little as 8%) of DRAM components accounted for 95-98% of errors.

* Initial burn-in works (little infant mortality) but devices start to degrade after 10-16 months.

* Little difference in error rate across manufacturers. Increasing size affects error rate in unpredictable ways, often super-linearly.

Dr. Schroeder's other work on reliability is also interesting.
Tags: computers, hardware, measurement
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 2 comments