Old theory: Increased levels of heat cause errors, soft errors dominate and are randomly distributed. (Lab testing involves pointing a heat gun at a DRAM device...)
Actual measurements from Google's server farm:
* Heat is correlated with increased error rate but does not seem to cause it. Rather, increased utilization drives both.
* Hard errors probably dominate error count and drive most uncorrectable errors.
* Less than 20% (maybe as little as 8%) of DRAM components accounted for 95-98% of errors.
* Initial burn-in works (little infant mortality) but devices start to degrade after 10-16 months.
* Little difference in error rate across manufacturers. Increasing size affects error rate in unpredictable ways, often super-linearly.
Dr. Schroeder's other work on reliability is also interesting.