I watched Rob Eastaway's 2019 for the Royal Institute today. Everything from RI is great and worth checking out, but Eastaway delivered a statistic I hadn't come across before: 90% of all spreadsheets contain errors. Mr Eastaway himself had only come across the statistic from another source, the European Spreadsheet Risks Interest Group (or EuSpRIG for short).

This is not a trivial issue. EuSpRIG's website has a "horror stories" section that demonstrates the gravity of errors in the wrong type of spreadsheet. Even if we discard the few stories involving malware embedded in spreadsheets like the BlackEnergy power plant shutdown - for many reasons it makes sense to count and study malware separately from unintentional human and formulaic errors - the EuSpRIG lists dozens of separate incidents that involve massive financial losses. Taxes, criminal and medical records are all stored on spreadsheets. Single digit error rates have major repercussions.

Claims putting error rates for spreadsheets in the 84% to 90%+ percent range have been around for many years and in a wide variety of circumstances. Hawaii prof Roy Panko's work in this area is particularly compelling, and his website is worth reviewing even if you find the eye-catching statistic risible.

Absent further investigation we can assume that spreadsheet error rates are substantial because many spreadsheets involve an activity that researchers have long known are error-prone in humans: repetitive simple tasks, or what the literature describes as "simple nontrivial cognitive tasks". When spreadsheets are created by human input, they often involve repeatedly typing in small strings of text and digits. Rates if error for these types of task tend to be less than 5%.

But spreadsheets also have properties that make them particularly sensitive to small errors. A spreadsheet divides input into discrete cells, and allows users to perform calculations on that input by applying formula to those cells. This means that a single cell can effect any other cell in a spreadsheet. A human user might only mistake one cell's worth of input, but that mistake could impact every other cell.

The ratio of correct vs incorrect cells is not the only (or even preferable) way to determine the efficacy of a spreadsheet. The researcher might prefer to use a pre-established technique for auditing data errors and categorizing the number of errors identified by that technique in a given spreadsheet.

What does all of this mean, exactly? It never hurts to view data with a skeptical eye. We can create more accurately predictive models using spreadsheets if we account for the sort of inaccuracies we tend to find in them in addition to other sources of error in the data, such as collection methodology, et al. In this sense, spreadsheets are just like any other representation of information that human beings can create: useful in many ways, but slightly imperfect.