TIL a $300 lesson about training a custom AI model on the wrong data

I spent about three weeks trying to build a simple image classifier for my garden project, using a dataset I scraped from random websites. The model kept giving me weird results, like calling a tomato a 'small red car'. Turns out the dataset had a bunch of mislabeled junk mixed in. I wasted around $300 on cloud compute credits running those bad training cycles. I should have cleaned and verified my data first, or just used a smaller, verified set. Has anyone else gotten burned by a bad dataset, and how do you vet yours now?

4 comments

4 Comments

spencer_owens583mo ago

I used to skip data cleaning, but a mistake like that would totally change my mind.

dakotab933mo ago

Man, that's rough. I read a blog post from a guy who trained a model to spot manufacturing defects, but his training photos had a specific time stamp in the corner. The AI just learned to look for that timestamp, not the actual cracks. It's crazy how it picks up on the wrong stuff. I'm paranoid about my data now and try to do a manual check on a random sample before any training run.

susanb343mo agoMost Upvoted

So the AI basically became a super expensive timestamp detector? Did the guy at least get a refund on all that compute time he wasted?

gibson.avery2mo ago

The timestamp thing is close but not exactly right. In that blog post, the issue was actually that all the defective parts were photographed in the afternoon shift, so the lighting was warm and yellowish, not a timestamp in the corner. The AI learned to associate that warm light with defects instead of the actual cracks or chips in the material. So it was more of a lighting bias than a timestamp issue, but still a super common mistake people make with training data. It's wild how something as simple as different lighting conditions can totally throw off a model if you're not careful.