"I need data."
This is perhaps the phrase that data scientists hear most often from stakeholders. It reflects not only a growing expectation in policy and strategy organizations that precise insights from data will drive their decisions but also uncertainty as to what data is needed and how to use it. It is this juxtaposition of precision and vagueness that make data science a nebulous discipline for many people to grasp.
Whereas many view data as a product in itself, data scientists often consider it more of an input into a research process. It can be big or small. It can be complete or missing. The key thing is that it is malleable to the purpose. It can be summarized and combined to draw inferences. It can be kept at a granular level to enable smart automation. Ultimately, data scientists must be able to frame a problem in such a way that data can be used to solve a specific problem, regardless of the form the data may take. Indeed, data science is a highly technical craft emerging from over a century of tradition and innovation. The knowledge barrier to truly master this craft is quite high. It is no wonder that pragmatists — crunched for time and hungry for clarity — are often tempted to dismiss the technical aspects as esoteric and unnecessary.
Hence that cry, “I need data.”
This is one of the core challenges I encountered when researching and writing “Data Science for Public Policy”. Co-authored with Ed Rubin (Assistant Professor of Economics, University of Oregon) and Gary Cornwall (Research Economist, U.S. Bureau of Economic Analysis), the textbook draws on our lived experience working as data scientists supporting macro-level strategies, academic-grade research, and field operations in the public sector. In many ways, the text is an evolved take on the traditional econometric toolkit that is so common in public policy, providing a broader vocabulary and a more expansive toolset needed to meet the needs of modern governance.
The “I need data” situation is inevitable in any organisation finding its way. Regardless of the context, most organisations can view data science through three mathematical symbols: x̄, β̂, and ŷi. Do not let the symbols put you off: before you click off to another article, hear me out.
For most data needs, all that is needed (or wanted) is a descriptive analysis. x̄ (“x-bar”) is the tool of choice – the noble sample mean, such as “average income”, “average life span”, “recidivism rate”, and “default rate”. Descriptive analysis is not strictly limited to the sample mean, but draws on a broad variety of sample statistics to fill in the context surrounding the following questions: What is the size of the problem space? What has changed over time? Are there anomalies? Are there pairs of variables that move together?
Notice these are questions focused on existence – the “whats” of the world – and are valid as long as we are only focused on the “whats”. Since descriptive analyses tend to be communicated through visually appealing graphs, they can arguably leave too much to the imagination. It is not uncommon for ambitious project managers to draw bold policy conclusions from a simple bar chart, attributing a shift in a metric to their efforts. This is precisely the weakness of x̄. When interpreting any value of x̄, it is like looking into a mirror – you make of it what you will. It can reflect your state of mind, your personal agenda, or anything else of interest. While descriptive analysis is the most economical and accessible of tasks in data science, there are better tools for quantifying cause and effect.
Enter causal inference and estimation. To be able to attribute a specific outcome to a policy requires expertise in causal inference – the pursuit of “why”. It turns out scientists in the past had a low bar for claiming causality, and thus the process of proving anything was causal was quite arbitrary. A notable example is that of spontaneous generation – the production of living organisms from non-living matter (e.g. that mice can spontaneously grow from dirty clothing). Fortunately, modern day empiricists have devised tried and true methods for inferring causal treatment effects. For one thing, empiricists rely on β̂ (“beta-hat”) in a regression model to quantify how X impacts Y – holding all else constant – and doing so with the added assurance that the value did not happen by chance (statistical significance).
For there to be a causal estimate, three conditions must hold true: (1) the cause is correlated with the effect, (2) the cause preceded the effect, and (3) nothing else could conceivably have caused the effect [read: need random assignment of control and treatment groups]. But finding a causal effect is truly special. A nugget of causality can drive an entire policy agenda – the treatments that we can hold true [nearly] universally. But the bar is indeed high and resource intensive as we have observed from vaccine trials, tax credit programs, minimum wage changes, etc. In practice, this means identifying a truly causal effect is often reserved for the biggest of questions, or at least ones where the stars so happen to align to satisfy the three causal conditions.
The pursuit of ŷi is a personal favourite. Like the red pill that Neo takes from Morpheus in The Matrix, ŷi unlocks an entire realm of possibilities for an empiricist that x̄ and β̂ cannot. Whereas the latter is well-suited for telling the story from a sample or population level, ŷi is a prediction targeted at a specific instance – a person, organisation, place, time, or thing. Predictions seem ordinary (“It’s Tuesday, so I bet its Taco Tuesday at the cafeteria!”), but there is nothing ordinary about anticipating what will happen with a high level of accuracy before the next moment comes to pass. This makes ŷi the tool for acting on who, where, and when. For tech companies, for example, predictions are fundamental to their business – predicting which ads and content to place in front of which customers. In public policy, predictions allow decision-makers to construct programmes and treatments at an individual level rather than treating a broad swath of the populace in the same way (e.g., targeting buildings to mitigate fire risk, getting the word out about a new program, identifying vulnerable populations).
This is an approach that is relatively new in government and ripe for exploration. Any strategist or policy advisor can bridge the gap with the data scientists on the other side of the office. The first step is to figure out how to map your problem space to a descriptive analysis, causal inference, or prediction. Rather than expressing your need in those three words or demanding a specific number devoid of context, help the data scientist across the table (or on the Zoom call) understand what you are up against and what are you trying to solve. Give the data scientist space to ask hard questions to size up the problem. If you can do this, this will form the basis of a successful collaboration and build trust in data science as a tool of modern governance.
About the author
Jeff Chen, Affiliated Researcher
Jeff Chen is a computational statistician who has led public-facing data science initiatives in over 40 fields. Currently, he is the Vice President of Data Science at Nordic Entertainment Group where he leads machine learning (ML) and data engineering for personalizing one of Europe’s leading ... Learn more