OLCF AI Training Series: AI for Science at Scale - Part 3, Jul 11, 2024
Introduction
Held on July 11, 2024, this session is the third part of the OLCF’s AI for Science at Scale training series and is open to NERSC users.
Training large deep learning models, including large language models, is resource-intensive and requires innovative parallelization and distribution strategies. In earlier workshops, we demonstrated on Frontier how to train a deep learning model in a distributed fashion across multiple GPUs at a “small” and “intermediate” scale. For the final part of this training series, we scale up further and demonstrate how to fine-tune pre-trained networks at a larger scale on Frontier. Registered Frontier users will be able to utilize a system reservation to participate in the hands-on portion of the event.
Although this training is intended for current Frontier users, all are welcome to register and view the presentation. Additionally, no prior knowledge of Part 1 or 2 is necessary — you are encouraged to register even if you did not attend previous iterations of this series.
How to Apply
Please visit the training event page for registration information.
Time (PDT) | Topic | Speaker |
10:00 am - 10:20 am | Introduction to distributed training of LLMs | Sajal Dash (OLCF, Analytics & AI Methods at Scale) |
10:20 am - 10:50 am | Finding the best training strategies for large models | Sajal Dash |
10:50 am - 11:20 am | Fine-tuning a pre-trained model | Sajal Dash |
11:30 am - 12:00 pm | Hands-on demo using Frontier | Sajal Dash |
Training Materials
- GitHub repo for the training series: https://github.com/olcf/ai-training-series.
- Slides
- Recording