A new policy brief by Diane Coyle and Annabel Manley summarises the various methods being used in practice to value datasets and how they compare.
Talking about economic statistics rarely makes you the most fun person to be with at a party. However, working out how to measure things is a crucial part of good decision-making, with accurate metrics desirable for both policy and business strategies. Therefore, while investment in data assets by public and private sector organisations is increasing across the world, it is important to work out “how much is a dataset worth?”
This is a difficult question because data is a strange beast with a variety of economic characteristics which make it harder to value. It is non-rival, meaning that many people can use it without affecting its ability to be used by the next person. It has many externalities, both positive in that the value of two datasets combined is often greater than the sum of its parts, and negative, in that data on one person can decrease the privacy afforded to another. It also has an option value, whereby data can be valuable merely because it means an organisation has more flexibility in the future.
In particular, this question is important when decisions are marginal – when organisations are unsure whether to invest in the dataset or when they are choosing between different datasets. For example, consider the COVID-19 infection survey run by the ONS. During the height of the pandemic in 2020-2021, it was a clear that this data was worth collecting and updating up to a very high standard. However, as restrictions lifted in 2022, the value of the data generated moderated and whether the new data was worth the resources spent on it was called into question by some. Quantifying this change well would help policymakers make decisions over how much to continue investing in monitoring case rates.
Our latest policy brief summarises how different organisations have approached the question of how to value data in light of these difficulties, and the methods which have been tried and tested. These are summarised in below.
The literature on this question has grown considerably in recent years and this is likely to continue as the number of data investment decisions and evaluations increase. However, each of the methods currently developed has some drawbacks. Future work should develop methodologies that are better able to account for data’s difficult economic characteristics, though it is unlikely that a single consensus method of valuing data for all use cases will emerge soon.
Instead, a likely outcome for data valuation will be a consensus method for each use case, rather than one ‘best’ method for everything. This might mean that the measured value of personal data lost in a security breach may be non-comparable to the measured value of regional data breakdowns in the national accounts, but both measures will be more useful for contexts in which they are most typically used.
Based on the literature, there are four key considerations for which method is best for each case of use. These are best summarised as what, who, when, and why:
What is being valued can range from raw data on its own, such as the value that you could sell your data for, up to the data-informed insights and actions that they prompt.
Who values the data may take into account that data producers and end-users of the data value different aspects, whilst society as a whole has different considerations to single individuals or organisations.
When the valuation takes place is particularly relevant due to the non-rival and option characteristics of data. Methods used to project the value of a future investment have much more uncertainty over the potential value of the dataset than an evaluation of a past project.
Why, or what the purpose of the valuation is, also impacts what aspects of data should be ignored or included. For example, estimates which are conservative may be appropriate for national account bodies, but not for data trusts.
For policymakers, the key point is to appreciate that data does have considerable economic value, and its potential is much greater if ways can be found to increase access and use through suitable governance arrangements. And while none of the methods currently available is perfect, having even a rough guide to value will help inform decisions about investment in data assets.
Summary of methods for calculating the value of data
- Cost-based: Calculated by identifying the costs of creating or replacing a dataset. This is the recommended method in the UN System of National Accounts when there is no market price available for the dataset. It is well understood, relatively easy to calculate, and relatively objective as a lower bound estimate of value. However, this method excludes the value of the content of data, and recent work shows that identifying costs is not as straightforward as once thought.
- Income-based: Calculated by identifying the expected revenue streams being generated by the data. This method works well when there are clear income streams being derived from the data. However, it does not work well for data used for internal processes. In addition, data insights can improve revenue streams without being solely responsible for them, and, without a good counterfactual, the revenue attributable to the use of data can be subjective.
- Market-based: Calculated using observed market prices, and are the preferred method for valuation when they are available. Market prices can include prices from data markets, market capitalisations of firms, and global data flows. However, very few prices are available with few data markets succeeding in becoming ‘thick’ enough to be sustainable. This is because it is difficult to make markets with trustworthy data and contract enforcement.
- Experiments and surveys: Widely used methods in cases where market prices do not exist or do not reflect value including externalities. In particular, this method is useful where data is likely to never be traded. However, the surveys and experiments can be highly sensitive to survey design and much care should be taken to ensure incentive compatibility.
- Impact-based: Calculated by identifying the causal impact of data on outcomes, typically by exploiting natural experiments or simulating counterfactual outcomes. This method is particularly useful for calculating the value of data in a way that is easily communicated to stakeholders. However, counterfactuals can be difficult to calculate, and this method does not separate data from its complementary investments.
- Shapley values: Calculated for each data point used to create an AI model. The Shapley value formalizes the impact of each individual data point on the performance of the model. However, these are computationally expensive to calculate for models which use lots of data points, and only generate relative weightings rather than absolute monetary values.
- Stakeholder-based: Calculated by identifying the value of the data to a wide group of stakeholders. This accounts for all sources of value in the data, but involves subjective judgements.
- Real options analysis: This method includes the option value of a dataset, as defined by the “right but not the obligation” to generate and act upon insights from the data. This reflects the observed practice of organisations collecting data when they do not yet know what they will use it for. However, it can be difficult to observe many of the parameters needed for these calculations, leading to proxy use and subjective judgements.
Policy brief: What is the value of data? A review of empirical methods
The views and opinions expressed in this post are those of the author(s) and not necessarily those of the Bennett Institute for Public Policy.