Peter Mawhorter

Peter Mawhorter (pmawhort@wellesley.edu)

August 28th, 2025

Large Language Models and Labor

This post is one of my advice & arguments pages about the harms and hazards of the AI Hype Movement.

When we use an “AI” chatbot or image generator, the company that gives us access to it usually wants to pretend that we’re just using computer resources, and that the system we’re interacting with is a computer, just like you’d use another computer program like a word processor or web browser. Of course, all programs are built by humans, so when using a web browser to some degree you’re relying on a team of developers who delivered that software to you, but their work is overwhelmingly consensual and often well-compensated. With large language models (LLMs) and other neural network systems, the story is quite a bit more complicated, and you’re actually interacting with a product that has some pretty gross labor abuses baked into each interaction.

Contributors to a Large Model

Who does the work that makes a large generative AI model possible?

To build a GenAI model, you need a few ingredients:

  1. Lots of raw data, whether written words or code or images. This data must have been produced by humans for the model to do anything very useful.
  2. Lots of filtered data. The raw data needs to be filtered by humans to remove the scummy parts before being fed into the training process, otherwise the end result is dangerously unusable.
  3. Data centers with computers to store and process the data to train the model.
  4. Data centers with computers to run the model whenever someone makes a query and deliver a result.

Ingredients 3 and 4 have a lot of local negative externalities but fewer labor issues that I’m aware of. However, ingredients 1 and 2 are produced in horribly abusive ways:

  1. The labor of the humans who created the training data is being exploited, often explicitly against their will. The fact that the AI Hype Movement seeks to sell AI products as tools that will do the jobs of these same people in their stead makes this extra nefarious. This is clearest in artistic domains, where an artist or writer’s unique expressions are being repackaged and resold by an AI company, often quite directly. Every single “cool aesthetic” an AI image generator can produce in response to a prompt was learned from an artist somewhere, often a living artist who shared their art with an explicit requirement that re-use was allowed but should be accompanied by attribution.
  2. The humans who filter the scum from the raw data are compensated very poorly, and end up accumulating psychological damage from constant exposure to the worst the internet has to offer. Without their exploitation, every user of generative AI systems would be occasionally be exposed to things like images of child abuse or graphic descriptions of torture. The (imperfect) safety with which users can interact with AI systems is bought by dumping all of the harms onto a few people. This work is also not really a one-time thing: more filtering is needed both to train newer models and to add newer information to old models. Without a continuous filtering investment, models would soon become out-of-date, and no new models could be developed.

In both cases, it’s entirely possible to imagine acquiring these services via fair compensation; it would just be much more expensive than the current practices. The argument here isn’t that one should avoid all use of AI systems (though there are many other arguments for that); it’s just an argument that we should avoid exploitatively-produced AI systems, along with the information that all of the big free easy-to-use models available right now are produced using intensely exploitative methods.