Published on 7 March 2023
Share Tweet  Share

What do large language models mean for productivity?

How much promise do large language models like ChatGPT hold for increasing whole-economy productivity? Lucy Hampton explores which sectors are benefitting most from their use, and whether they'll ever substitute for human labour.

Just two months after its initial release, ChatGPT has over 100 million users and is estimated to produce text every 14 days equal in volume to all the printed works of humanity.

The technology underlying ChatGPT is not new, however. It is based on GPT-3.5, which is a recent version of a large language model (LLM): a type of artificial intelligence trained on a large corpus of text that functions by generating predictions about sentence content.

Amidst the recent hype, LLMs are being integrated into a variety of projects. In February 2023, Microsoft launched a new LLM-powered chatbot that it claims will ‘reinvent’ Bing search, and announced plans to incorporate LLMs into Outlook, Word and PowerPoint. Github’s Copilot uses GPT-3.5’s precursor GPT-3 to convert natural language into code. The potential uses of LLMs are numerous, and include translation, copywriting, genetic sequencing, automating customer service, co-authoring books, scoping academic literature, and providing assistance to lawyers.

What does this mean for productivity in specific sectors?

As well as the obvious benefits of automated code to the ICT (Information, Communications & Technology) sector and software sub-sector in particular, the advancement of LLMs may drive productivity improvements in creativity or interaction-intensive sectors such as customer service, education, advertising and journalism.

Here LLMs are particularly useful as assistants for brainstorming ideas, summarising and evaluating arguments, generating titles, and proofreading. The practice of using LLMs to assist in creative or academic work is of course not uncontroversial or ethically straightforward. But despite these concerns, even a more conservative application of LLMs is likely to create productivity benefits in these sectors as the automation of time-consuming ‘micro tasks’ frees up labour that can be allocated elsewhere.

Sectoral vs whole-economy effects

Importantly, however, LLMs may have large impacts in certain sectors but fail to significantly impact measured whole-economy productivity. Even taking large LLM-driven productivity improvements in certain sectors – such as ICT or customer service – as given, the key question for the whole economy is whether these sectors experience rising or falling shares of total expenditures. The latter could occur if the price declines caused by productivity improvements in these sectors do not lead to sufficient substitution in consumption towards these sectors. In this case, the stagnating sectors make up a larger and larger proportion of the economy, a phenomenon known as Baumol stagnation.

A similar problem may apply to the supply-side. Innovation means that LLMs become cheaper relative to conventional inputs like labour and energy. However, if they are not sufficiently substitutable with these inputs, their impact on productivity growth may be limited as producers cannot simply switch to using LLMs. The classic example of this is performing arts: while there have been many innovations over the past century (such as better stage lighting), productivity is constrained by the fact that human inputs are essential and relatively hard to improve.

In sum, the whole-economy impact of LLMs depends on two things:

  • How substitutable are LLMs and conventional inputs?
  • If LLMs increase productivity in certain sectors producing specific outputs, how substitutable are these outputs with those of other sectors?

How substitutable are LLMs and conventional inputs?

Research from 2021 indicates the existence of Baumol stagnation effects, however the underlying degree of substitutability between conventional and AI inputs is changing rapidly. One reason for this is that many of LLMs’ capabilities are ‘emergent’, meaning that they are not at all present in small models, but manifest suddenly in larger models.

An example of this is arithmetic: GPT-3 has close to 0% accuracy on basic arithmetic problems until the size of the model reaches approximately 13 billion parameters[1]. After this point, accuracy jumps to around 35%. Many more capabilities may emerge with better prompt engineering. For example, asking ChatGPT to ‘reason step-by-step’ significantly improves performance on reasoning questions. An implication of this is that it is very hard to predict the future capabilities of LLMs.

However, it is unclear for how long scaling can continue to generate improvements when the cost of doing so may soon reach unaffordable levels.

And not all capabilities may be emergent. In particular, it is unclear whether ‘hallucination’ – where LLMs produce plausible-sounding but inaccurate responses – can be solved simply by scaling up, or whether a more fundamental shift in approach is needed. ChatGPT, for example, often makes up academic articles, produces incorrect biographical information and fails to explain basic technical concepts.

Other LLM-powered chatbots seem to do better than ChatGPT on this front. The aforementioned Bing chatbot is capable of searching the web and citing its sources, for example. Crucially, however, this puts the burden of fact-checking onto the user, and it is not the LLMs themselves that currently have this capacity. LLMs are also well-known to reproduce the biases present in their training data.

Hallucination and bias are major barriers to the automation of tasks by LLMs in sectors where truth is especially important or where questions are likely to be niche – i.e., for which LLMs have less training data. These include research, education and high-stakes environments like healthcare and law. For tasks in these sectors, it seems that LLMs will (and should) augment human productivity rather than substitute for it – at least in the short term.

Looking forward

Despite the difficulty of making predictions, it is important to keep track of progress in this area. For policymakers interested in whole-economy effects, it matters where LLMs are being used as well as how much. In particular, which sectors are LLMs being used in, and what is happening to their share of total expenditures? What types of production are benefitting from the use of LLMs – goods, services or the production of ideas themselves? It is maybe in the last option as ‘inventions of a method of invention’ that LLMs show the most promise for increasing whole-economy productivity, although it is far from clear that LLMs in their current state can substitute for human thought.

[1] Parameters are variables used by machine learning models to learn from data. The number of parameters correlates closely with the amount of computation used to train a model.

The views and opinions expressed in this post are those of the author(s) and not necessarily those of the Bennett Institute for Public Policy.


Lucy Hampton

Research Assistant

Lucy is a Research Assistant working on the Sectoral Productivity project, which investigates the drivers of productivity in different sectors. Her research interests include the economic impacts of artificial intelligence, research...

Back to Top