Recent progress in AI has been dramatic, promising benefits such as productivity gains but also significant disruption. Alan Chan and Herbie Bradley make the case for a National Data Trust to ensure the public can benefit from the corporate use of massive amounts of data to build new AI models.
The stated aim of frontier AI labs such as OpenAI is to build “highly autonomous systems that outperform humans at most economically valuable work,” possibly within the coming decade. Recent innovations such as ‘generative AI’ (artificial intelligence) or ‘large language models’ (LLMs) underpinning OpenAI’s ChatGPT or Google’s Bard may bring short-term societal benefits, such as through improving worker productivity, enabling new ways to create art, and accelerating the processes of scientific discovery.
However, our livelihoods and socio-economic structures may also be threatened if such systems increasingly substitute for human labour at minimal cost. ChatGPT can already be instructed to coherently solve useful tasks at a rate of just $2 per million words. Some businesses such as law firms are already using LLMs to perform previously labour-intensive functions.
Critically, the most significant advances in LLMs currently depend upon large amounts of high-quality data. For large language models, training dataset sizes now run into trillions of words, while for image models (like OpenAI’s DALL-E) they now run into billions of images. The source of this data is the internet, a digital commons that comprises the cumulative intellectual and cultural contributions of humanity to be found online. We all contribute to and benefit from this commons, from Wikipedia and discussion forums to open-source software and YouTube videos.
The risk of upheaval to livelihoods and society from AI is a negative externality imposed upon the public. Particularly striking is that all of us who could be at risk of technological unemployment have, through our contributions to the digital commons, also been key contributors to the advances that will cause such disruption. However, this is not reflected through any equity in or control over the technology.
A national data trust for training data
To help distribute the value of AI development, our recent work proposes a national data trust for AI training data. Data trusts are a way of aggregating the bargaining power of individual data holders so they gain some benefit from the use of what they have contributed. For example, the UK’s Biobank manages health data on behalf of more than 500,000 participants, enabling essential health research while also safeguarding the privacy of its patients.
Our proposed national data trust would collect and hold data from the digital commons and control access to this data for commercial AI developers, possibly with statutory or regulatory backing. The trust would negotiate with commercial AI developers for royalties, which could be distributed as a “Digital Commons Dividend” back to the broader public as well as specific sectors most disrupted by the transition to an AI-based economy. We propose limiting the gating of data only to commercial AI developers because of clear negative externalities from their work; we are not targeting other uses of the commons such as indexing for search engines that route users to original content.
To be able to do this, the data trust would collect training data from the digital commons as well as create a licensing and verification regime. Obtaining data involves a number of tasks. The data trust would scrape the internet and also entrust user data currently held by large tech companies like Google and Meta. Data from national sources, such as the BBC and the British Library, would also be entrusted, given the public nature of such data and their usefulness for building models that are tailored to the UK.
The licensing and verification regimes would rely on technical methods to help to ensure that those who claim to have complied with the data trust have only used the data trust’s data, and no other data, in training their models. We propose a number of technical methods in our work while noting that there remain some open problems on how to make the verification work.
How might this trust work? Consider an image-to-text model such as DALL-E 2, which generates high-quality images like an oil painting or a futuristic cityscape from a user’s simple text input. DALL-E 2 is trained on the collective artistic output of humanity, including living artists who breathe life into our cultural world. Yet, at no point do the profits of DALL-E 2 help pay them. Moreover, those wanting to use images are likely to opt for a lower-cost, faster AI tool to generate outputs of comparable or even superior quality.
Our idealised data trust would gate access to high-quality images used to train models like DALL-E 2. If a company wanted to train such a model and deploy it commercially, it would have to access the data through the trust. The data trust would negotiate for a portion of model revenues to direct to a fund for artists. This fund could operate similarly to how arts councils operate, funding grants for individual artists and larger events to broadly promote the arts, or like existing royalty collection schemes for authors.
To take another example, consider a language model such as GPT-4 that can write high-quality code, provide clear documentation for it, and debug code. To impart GPT-4 and future models with such abilities, OpenAI hires contractors to generate code and documentation to serve as training data. While OpenAI pays for this data rather than obtaining it freely, use of the resulting AI system could stifle the digital commons by decreasing traffic to key discussion forums like Stack Overflow, not to mention the acceleration of unemployment in the software business.
Our data trust should ideally aim to tackle such crowdsourced data as well. Perhaps it could act as an intermediary between contractors and contracting platforms such as UpWork and Surge. Potentially through regulatory provisions, the data trust could give contractors the option to entrust their generated data. The goal is for the data trust to negotiate for all sectors that would be a target for automation from the construction of increasingly capable AI systems.
Our proposal to reclaim the commons has an important historical parallel. Henry George advocated for a land value tax as a way to fairly redistribute wealth, arguing that to “take land values for public purposes is not really to impose a tax, but to take for public purposes a value created by the community.“ The digital commons is both our natural heritage and common creation. To claim a fair share of its yield is not without precedent.
The broader AI governance landscape
Although we have focused here on governing data to influence the trajectory of AI development, another prominent policy issue is physical compute resources, given the massive amount of computational power necessary to train frontier AI systems. Indeed, empirical analysis suggests that the compute required to build more powerful AI systems increases exponentially, meaning that governments have an opportunity to intervene on the provision of this important resource to ensure that it is used for the public good. We see our proposal as complementary to efforts in these directions.
Although the disruption caused by the dramatic recent progress in AI is unavoidable, proactive policy measures – like our proposed national data trust – can help to facilitate the transition to new economic structures without hindering innovation.
Image generated with Midjourney
The views and opinions expressed in this post are those of the author(s) and not necessarily those of the Bennett Institute for Public Policy.