My fine-tune of a Llama 2 model crashed after 18 hours

I was running it on a rented cloud server in Virginia, trying to get it to write better product descriptions. The whole thing just stopped and the log file was full of memory errors. I had to go back, cut my training data by half, and start over, which finally worked. Anyone know a good way to guess the right data size before you start a long run like that?

3 comments

3 Comments

the_alice3mo ago

Wait you ran it for 18 hours before it crashed.

thomas_torres3mo ago

Check the memory usage on your server before you start the training run. You can run a smaller test with a sample of your data to see how much memory it eats up. Multiply that by your full dataset size to get a rough idea. Also, make sure you're using gradient checkpointing and mixed precision training, those can save a ton of memory. It's a pain to guess but a quick test batch can save you days of wasted time.

wadew513mo agoTop Commenter

Come on, it's never that clean. My "quick test" always ends up taking half a day to set up, and the memory use never scales in a straight line anyway. You just end up guessing regardless.