Real-world through the lens of data-driven models

Sabailabo
4 min readNov 15, 2019

--

Connections between real-world observations and data-driven models is always a hot topic. Allen and Tildesley (1987) introduces the connection between experiments, experience, theory and computer simulations, for how to gain and improve knowledge of a real system . Figure below displays a conceptual connection between real-world observations and data-driven models and the learning process.

They argued that the more difficult and interesting the problem, the more desirable it becomes to have as close to an exact solution to the problem as possible. For complex problems, this may only be tractable through computer simulations, and, as the computational power and finesse in modelling have increased, the engineering community has been able to develop extremely accurate and strongly predictive simulation models. The engineers recognized computer experiments (simulations) as a way of connecting the microscopic details of a system to the macroscopic properties of experimental interest.

The revolution of Big Data analytics has shown the efficacy of data-driven models when predicting individual outcomes based on large collections of observed data. This is also a good example of a bridge between individual (microscopic) entities and the real-world macroscopic observations.

Learning occurs for both us, humans, and AI or ML algorithms, on the basis of a cycle where we observe real-world data, construct a model, use the model to predict a real-world outcome, and finally improve our models by comparing the predictions with actual observations. This learning process should be done in accordance with the principles of the scientific method: allowing the models to be falsifiable, and continuously testing and improving them. In this way the current collection of knowledge is advanced.

However, the aphorism of George Box that all models are wrong, but some are useful is still important when assessing the confidence in the results from our models.

Big data analytics has shown both its worth and its potential for a negative impact in several application areas over the last decade. Some of the best known applications are targeted marketing — as illustrated in the NY Times article “How Companies Learn Your Secrets” or the recent Facebook — Cambridge Analytica scandal related to the 2016 US presidential campaign (first reported by The Guardian, 11. Dec. 2015, and more recently by NY Times, 8. Apr. 2018). Other uses are, for example, recommendation engines used by companies like Netflix and Amazon to suggest similar or related goods or services, and, of course, the algorithms that make your Google searches so “accurate” that you seldom need to go beyond page one of your search results!

The engine behind this data-driven revolution is machine learning (ML). ML is by no means a new concept — the term was coined by A.L. Samuel, 1959 — and can loosely be thought of as the class of algorithms that build models based on a statistical relationship between data; (see P. Domingos, 2012 for an informal introduction to ML methods and the statistical background, or T. Hastie et al., 2009 for a more comprehensive treatise). The goal of an ML model in a data-analytics setting is to provide accurate and reliable predictions, and its most prominent and wide-spread use is focused on consumer markets — where data are Big.

The capitalizations in these consumer markets, by the revolutionary use of ML and data analytics, are usually hallmarked by small gains that accumulate across an entire population of consumers.

The best minds of my generation are thinking about how to make people click ads. That sucks.

Jeff Hammerbacher, interviewed by Ashlee Vance
“This Tech Bubble is Different” — Bloomberg Businessweek, April 14, 2011

For high-risk systems, characterized by high-consequence and low-probability scenarios, the situation is quite different. The potentially negative impact may be huge, and erroneous predictions might lead to catastrophic consequences.

Data-driven decisions are based on three principles in data science: predictability, computation ability , and stability (B. Yu, 2017). In addition, and particularly important for safety-critical systems, the consequences of erroneous predictions need to be assessed in a decision context.

Prediction is a requisite if an AI or ML algorithm is going to be used in a decision context. Together with cross validation (CV), it makes a human-like interpretation validation of the applied model for a certain application. In order to have good prediction accuracy, it is important that the model is developed based on sufficient relevant data from the data-generating process (DGP), (i.e., the process that generates the data that one wants to predict). It is also important that the DGP is computationally feasible, i.e., that an algorithm exists that can learn the DGP from a finite set of training data (see, e.g., A. Blum and R.L. Rivest, 1988).

If the above requisites are met, and an erroneous prediction has a sufficiently negative consequence to be unwanted, the last principle is stability. This is the major problem with safety-critical decisions based on data-driven models. For valid predictions, the future scenarios need to be stable with respect to the data from which the prediction model was trained. At the same time, data related to high-consequence scenarios are, thankfully, scarce. In line with Cathy O’Neil’s argument above, this makes ordinary data-driven methods inappropriate for decision making in the context of unwanted and rare scenarios, so far.

By courtesy of Simen Eldevik.

--

--