💡
6
c/ai-innovations•wood.ericwood.eric•1d ago

My fine-tune of a Llama 2 model crashed after 18 hours

I was running it on a rented cloud server in Virginia, trying to get it to write better product descriptions. The whole thing just stopped and the log file was full of memory errors. I had to go back, cut my training data by half, and start over, which finally worked. Anyone know a good way to guess the right data size before you start a long run like that?
3 comments

Log in to join the discussion

Log In
3 Comments
the_alice
the_alice1d ago
Wait you ran it for 18 hours before it crashed.
2
thomas_torres
Check the memory usage on your server before you start the training run. You can run a smaller test with a sample of your data to see how much memory it eats up. Multiply that by your full dataset size to get a rough idea. Also, make sure you're using gradient checkpointing and mixed precision training, those can save a ton of memory. It's a pain to guess but a quick test batch can save you days of wasted time.
1
wadew51
wadew511d ago
Come on, it's never that clean. My "quick test" always ends up taking half a day to set up, and the memory use never scales in a straight line anyway. You just end up guessing regardless.
3