Large Language Models and Scraping

This post is one of my advice & arguments pages about the harms and hazards of the AI Hype Movement.

One of the reasons not to use large language models hosted by large tech companies (this does NOT apply to large language models in general) is that these companies are currently scraping the web in an unethical manner.

A Metaphor

Imagine the following scenario: You’re a professor, working with a student on an independent study project. Your student is building a web scraper as part of this project to feed data to an AI model. The first version they show you is just the simple recursive link-following solution, ignoring robots.txt and grabbing everything it can find with no sleeps between requests. They happen to be scraping a wiki, and they’re getting talk pages, edit histories, etc. The scraper is set to re-run every hour, to pick up any changes that might have occurred.

So you give them feedback:

  1. You’ll need to follow robots.txt if you don’t want to get banned. Also, it’s good manners to respect the conditions that hosts have for providing you with data.
  2. You probably want to be strategic about which pages you scrape. Add some rules so you don’t get a bunch of edit-history pages and just scrape the main content. Actually, following robots.txt will likely solve this problem for you anyways.
  3. You don’t need to re-run it every hour, and you should probably wait a bit between requests! Probably once a week is sufficient if you really want to keep up with changes, or at the VERY most once or twice a day. Consider the web traffic impact that your scraper will have and be moderate about it. Again, you’ll probably just get banned if you scrape too fast.

The student turns around and says:

  1. It’s okay, my dad works for Google and can give me unlimited IPs to scrape from, so my scraper won’t be banned. I also have basically unlimited bandwidth and storage, so there’s no reason not to scrape a lot.
  2. It’s simpler to write the code this way and just scrape everything. Maybe I’ll write a filter to get rid of edit-history pages later.
  3. I’m going to go ahead and use this version without any changes.

At this point, you say:

  1. Woah, you’re missing the point. Sure you might be able to evade IP bans, but the reason to follow robots.txt is not just so you won’t get banned, it’s because the host has costs for serving pages and some notion of what they’d like to keep private, and you should respect the host’s rules becuase it’s ethical to do so.
  2. A poorly-behaving web scraper is not acceptable to me. It reflects badly on my supervision of this project, and on the educational standards of our whole department and school. Plus, fixing these issues is not that difficult and will benefit the quality of the data you scrape.

The student says, fine, I’ll just do this project on my own without your supervision. I’ll withdraw from the independent study and build this thing myself. I don’t care about your stupid ethics, and I should be able to scrape any page that a host will serve, including evading IP bans.

You’ve done what you can, and the student goes ahead and builds their project, which turns out to have some cool aspects but also some pretty severe limitations. They hype it up to their friends as the next big thing and generate a lot of excitement. 6 months later, the student is close to graduating, and they ask you for a letter of recommendation. You tell them they should ask someone else: because of how they ignored your advice, you can’t really write them a good letter. When they say “but check out my project, once you see how great it is you’ll be convinced the scraping was worth it!” you say: “you could have done the same thing (or even better) with a more ethical scraper. It doesn’t matter how cool your application is because as I said before you chose to continue on your own, I care about how poor scraping ethics reflects on me and this school.”

Discussion

The professor in the scenario above behaves approximately how I would behave in this situation. In particular, I don’t think it’s okay to ignore ethical concerns just because you can, and I don’t want to be associated with people who work that way, both for the sake of my own reputation and because I actually care about the harms caused. Sadly, this imagined scenario is basically real life, if you replace a headstrong student with a multi-billion-dollar tech company. ALL of the big LLM vendors (e.g., Meta, Amazon, Google, OpenAI) are scraping this way to feed their training datasets, and they’re having a real impact on the resources of small websites, especially wikis, git repositories, and other sites that include pages where the server needs to do non-trivial work to put a response together like edit or diffs. A few examples:

Because of the real harms being done here, using LLM services from the big players that are doing this scraping is bad. It’s not after all necessary to ignore robots.txt or scrape 4 times a day, or evade IP bans. The big players could have chosen to source their data more ethically, but for profit and/or hype reasons decided not to. Here’s an example of a 500-billion word dataset that’s curated from public domain materials. The ethics here are a good reason not to use ChatGPT, Gemini, Meta AI, Copilot, or Claude. They don’t necessarily apply to every AI model or LLM, and that’s what’s most frustrating here: the big players with popular free models are behaving badly, when they’re the ones with the resources & reach to avoid these problems. But it is revealing of their attitudes: they don’t care about ethics, full stop.