Running AI/ML workloads on NAISS systems¶
Time: 9:00 - 12:00
This session will introduce you to what is different when doing machine learning on a compute cluster. Performance considerations and best practices.
Code examples will focus on PyTorch and TensorFlow as framework, but also other utilities.
The main relevant NAISS resources is Alvis this first time and later Arrhenius GPU partition when that is in place.
Note
Click the Home (at the top) to see the date.
Prerequisites¶
To be able to follow along you should have a basic understanding of machine learning and know how to:
- Connect to an HPC cluster
- Use software on an HPC cluster with modules and/or Apptainer containers
- Run jobs on a cluster
- Program in Python
Topics¶
Note
These topics are currently preliminary and may be subject to change.
- How you run on GPUs with PyTorch and TensorFlow
- Floating point precision and GPU performance
- Performance considerations for data loading on parallel filesystems
- Profiling your ML workload
- Multi-GPU parallelism
- Conceptual overview
- Data Parallellism
- Fully Sharded Data Parallel
- Basic LLM inference with model parallelism