7
TIL a huge number of AI training images came from one site without clear permission
Read a report from the University of Amsterdam. They found LAION-5B, a massive dataset, used over 5 billion images from Common Crawl. Many were personal photos from Flickr, taken without asking the photographers. Makes you wonder who really owns the data behind these models. Has anyone else seen stats on where their training data actually comes from?
4 comments
Log in to join the discussion
Log In4 Comments
taylor.reese2mo ago
That Common Crawl scrape is a huge mess. I had to check my own portfolio after reading about the Getty case reed.skyler mentioned. Found a few of my old Flickr shots in a dataset audit tool. The best you can do right now is run your URLs through haveibeentrained.com to see what's been scraped.
3
william8642mo ago
Wait, they used billions of photos without even asking?
1
dakota4152d ago
The Common Crawl dataset has around 2.6 billion images in it @william864 and most people had no clue their stuff was in there. I checked my own flickr account after that audit tool went around and found like 40 of my photos in the training set. It feels creepy that companies just take whatever they want from the internet without asking. I think the worst part is they built billion dollar tools on our work and we get nothing for it. Even if you opt out now the damage is already done since the models already trained on your data. Its a raw deal for anyone who ever posted a photo online.
4