TIL a huge number of AI training images came from one site without clear permission

Read a report from the University of Amsterdam. They found LAION-5B, a massive dataset, used over 5 billion images from Common Crawl. Many were personal photos from Flickr, taken without asking the photographers. Makes you wonder who really owns the data behind these models. Has anyone else seen stats on where their training data actually comes from?

4 comments

4 Comments

taylor.reese3mo ago

That Common Crawl scrape is a huge mess. I had to check my own portfolio after reading about the Getty case reed.skyler mentioned. Found a few of my old Flickr shots in a dataset audit tool. The best you can do right now is run your URLs through haveibeentrained.com to see what's been scraped.

reed.skyler3mo ago

Yeah, I saw a piece about how Getty Images is suing over this exact thing!

william8643mo ago

Wait, they used billions of photos without even asking?

dakota4151mo ago

The Common Crawl dataset has around 2.6 billion images in it @william864 and most people had no clue their stuff was in there. I checked my own flickr account after that audit tool went around and found like 40 of my photos in the training set. It feels creepy that companies just take whatever they want from the internet without asking. I think the worst part is they built billion dollar tools on our work and we get nothing for it. Even if you opt out now the damage is already done since the models already trained on your data. Its a raw deal for anyone who ever posted a photo online.