Running AI/ML workloads on NAISS systems

Time: 9:00 - 12:00

This session will introduce you to what is different when doing machine learning on a compute cluster. Performance considerations and best practices.

Code examples will focus on PyTorch and TensorFlow as framework, but also other utilities.

The main relevant NAISS resources is Alvis this first time and later Arrhenius GPU partition when that is in place.

Note

Click the Home (at the top) to see the date.

Prerequisites

To be able to follow along you should have a basic understanding of machine learning and know how to:

Topics

Note

These topics are currently preliminary and may be subject to change.

  • How you run on GPUs with PyTorch and TensorFlow
  • Floating point precision and GPU performance
  • Performance considerations for data loading on parallel filesystems
  • Profiling your ML workload
  • Multi-GPU parallelism
    • Conceptual overview
    • Data Parallellism
    • Fully Sharded Data Parallel
    • Basic LLM inference with model parallelism