Running AI/ML workloads on NAISS systems

Scope

  • Will be covered:
    • Introduction for running Deep Learning workloads on the main NAISS AI/ML resource
  • Will not be covered:
    • A general introduction to machine learning
    • Running classical ML or GOFAI
    • General HPC intro

NAISS GPU resources overview

  1. Alvis (End-of-life 2026-06-30)
    • NVIDIA GPUs: 332 A40s, 318 A100s, 160 T4s, 44 V100s
    • Only for AI/ML
  2. Arrhenius (to be in operation Q2 2026)
    • 1528 NVIDIA GH200s
  3. Dardel (Probably end-of-life 2026)
    • 248 AMD MI250X
  4. Bianca
    • 20 NVIDIA A100s
    • Only for sensitive data

Alvis specifics

The cluster environment

Connecting - Firewall

  • Firewall limits connections to within SUNET
  • Use a VPN if needed

Cluster firewall only allows connection from within SUNET

Log-in nodes

  • alvis1.c3se.chalmers.se has 4 T4 GPUs for light testing and debugging
  • alvis2.c3se.chalmers.se is dedicated data transfer node
  • Will be restarted from time to time
  • Login nodes are shared resources for all users:
    • don’t run jobs here,
    • don’t use up too much memory,
    • preparing jobs and
    • light testing/debugging is fine

SSH - Secure Shell

  • ssh <CID>@alvis1.c3se.chalmers.se, ssh <CID>@alvis2.c3se.chalmers.se
  • Gives command line access to do anything you could possibly need
  • If used frequently you can set-up a password protected SSH-key for convenience

Alvis Open OnDemand portal

  • https://alvis.c3se.chalmers.se
  • Browse files and see disk and file quota
  • Launch interactive apps on compute nodes
    • Desktop
    • Jupyter notebooks
    • MATLAB proxy
    • RStudio
    • VSCode
  • Launch apps on log-in nodes
    • Desktop
  • See our documentation for more

Remote desktop

  • RDP-based remote desktop solution on shared login nodes (use portal for heavier interactive jobs)
  • In-house-developed web client:
  • Can also be accessed via any desktop client supporting RDP at alvis1.c3se.chalmers.se and alvis2.c3se.chalmers.se (standard port 3389).
  • Desktop clients tend to give a better experience.
  • See the documentation for more details

Files and Storage

  • /cephyr/ and /mimer/ are parallel filesytems, accessible from all nodes
  • Backed up home directory at /cephyr/users/<CID>/Alvis
  • Project storage at /mimer/NOBACKUP/groups/<storage-name>
  • The C3SE_quota shows you all your centre storage areas, usage and quotas.
    • where-are-my-files available on /cephyr
  • File-IO is usually the limiting factor on parallel filesystems
  • Prefer a few large files over many small

Datasets

  • When allowed, we provide popular datasets at /mimer/NOBACKUP/Datasets/
  • Request additional dataset through the support form
  • It is your responsibility to make sure you comply with any licenses and limitations
    • In all cases only for non-commercial research applications
    • Citation often needed
  • Read more on the dataset page and/or the respective README files

Software

GPU hardware details

#GPUs GPUs Capability CPU Note
44 V100 7.0 Skylake
160 T4 7.5 Skylake
332 A40 8.6 Icelake No IB
296 A100 8.0 Icelake Fast Mimer
32 A100fat 8.0 Icelake Fast Mimer

SLURM specifics

  • Main allocatable resource is --gpus-per-node=<GPU type>:<no. gpus>
    • e.g. #SBATCH --gpus-per-node=A40:1
  • Cores and memory are allocated proportional to number of GPUs and related node type
  • Maximum 7 days walltime
    • Use checkpointing for longer runs
  • Jobs that don’t use allocated GPUs may be automatically cancelled

GPU cost on Alvis

Type VRAM System memory per GPU CPU cores per GPU Cost
T4 16GB 72 or 192 GB 4 0.35
A40 48GB 64 GB 16 1
V100 32GB 192 or 384 GB 8 1.31
A100 40GB 64 or 128 GB 16 1.84
A100fat 80GB 256 GB 16 2.2
  • Example: using 2xT4 GPUs for 10 hours costs 7 “GPU hours” (2 x 0.35 x 10).

Monitoring tools

  • You can SSH to nodes where you have an ongoing job
    • From where you can use CLI tools like htop, nvidia-smi, nvtop, …
  • Use job_stats.py <JOBID> to view graphs of usage
  • jobinfo -s can be used to get a summary of currently available resources

Running ML

  • Machine learning in PyTorch and TensorFlow
    • Using GPUs
    • Checkpointing

PyTorch

  • Move tensors or models to the GPU “by hand”
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
a = torch.Tensor([1, 1, 2, 3]).to(device)

PyTorch Lightning

  • Lightning is wrapper to hide PyTorch boilerplate
  • Trainer and LightningModule handles moving data/model to GPUs
import lightning as L
import torch


class LightningTransformer(lightning.LightningModule):
    def __init__(self, ...):
        super().__init__()
        self.model: torch.nn.Module = ...

    def training_step(self, batch, batch_idx): ...

    def configure_optimizers(self): ...

Pytorch and PyTorch Lightning Basic Demo

TensorFlow

import tensorflow as tf

print(tf.config.list_physical_devices("GPU"))

Performance and GPUs

  • What makes GPUs good for AI/ML?
  • And what to think about to get good performance out of it?

General-Purpose computing on GPUs

  • Single Instruction Multiple Threads
    • Massively parallel on 1000s to 10000s of threads
  • Specialised Matrix-Multiply Units (Tensor Cores)
    • Most DL architectures can be reduced to mostly GEneral Matrix Multiplications

Precision and performance (×10¹² OP/s)

Data type GH200 A100 A40 V100 T4
FP64 34 9.7 0.58 7.8 0.25
FP32 67 19.5 37.4 15.7 8.1
TF32 494*² 156*² 74.8*² N/A N/A
FP16 990*² 312*² 149.7*² 125 65
BF16 990*² 312*² 149.7*² N/A N/A
FP8 1979*² N/A N/A N/A N/A
Int8 1979*² 624*² 299.3*² 64 130
Int4 N/A 1248*² 598.7*² N/A 260

TensorCores for GEMM computations

  • FP32 mixed precision GEMM computations with TF32
  • Tensor dimensions must be multiple of 8

Automatic Mixed Precision

Tensor Core Shape Constraints

  • To use TensorCores in FP16 precision the following should be in a multiple of 8 in FP16 (source):
  1. Mini-batch
  2. Linear layer width/dimension
  3. Convolutional layer channel count
  4. Vocabulary size in classification problems (pad if needed)
  5. Sequence length (pad if needed)

Arithmetic Intensity

  • Computational work in a CUDA kernel per input byte
  • If too low you’re memory bound
  • To increase:
    • Concatenate tensors when suitable for larger inputs to layers
    • Use channels last format for conv layers
    • Wider layers (but only if it makes sense)

Don’t Forget Non-Tensor Core Operations

  • Non-Tensor Cores operations are up to 10x slower
    • Optimising/reducing these can give most overall improvement
  • Compiling models can help (JIT, XLA)

GPU monitoring

  • nvtop & nvidia-smi
    • utilization: percent of time any SM is used (not percent of SMs used)
  • job_stats.py JOBID (Alvis/Vera only)
    • power consumption as proxy for occupancy
  • See profiling section later for more detailed results

Performance and parallel filesystems

  • Performance considerations for data loading on parallel filesystems

The parallel filesystem

The parallel filesystem

Striping on parallel filesystems

Striping on parallel filesystems

Small vs big files

Metadata bound

Performance suggestions

  • Prefer a few large files over many small
    • Many good implementations: HDF5, NetCDF, Arrow, Safetensors, …
  • Containers are faster for python environments on start-up

Profiling

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.

Scalene

  • General sampling Python profiler for both CPU, GPU and memory
  • Jupyter: %load_ext scalene + %%scalene
  • Lightning: Might be buggy with Scalene
python -m scalene run my_script.py
python -m scalene view --cli

PyTorch profiler

# Plain PyTorch https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
from torch.profiler import profile
with profile(...) as prof:
  ...  # run the code you want to profile
print(prof.key_averages().table())
prof.export_trace("trace.json")

# PyTorch Lightning https://lightning.ai/docs/pytorch/stable/api_references.html#profiler
trainer = Trainer(..., profiler="pytorch")
...

TensorFlow profiler and TensorBoard

# Profile from batches 10 to 15
tb_callback = tf.keras.callbacks.TensorBoard(profile_batch="10, 15")

Multi-GPU parallelism

  • Task parallelism
    • Embarassingly parallel
  • Data parallelism, for speed-up
    • For speed-up when single GPU efficiency is already good
  • Flavours of model parallelism
    • When the model doesn’t fit on the GPU

Task parallelism

  • When little to no communication is needed
    • Inference on different data
    • Training with different set-up (e.g. hyperparameter tuning)
  • Use job-arrays or task farms

Distributed Data Parallelism

  • Copy the model to each GPU and feed them different data
    • Communicate gradient updates (all-reduce)

Data parallelism

Pipeline parallelism

Pipeline parallelism

Fully Sharded Data Parallel

  • Not available in TensorFlow
  • All parameter tensors are fully distributed (Fully Sharded)
  • Each GPU computes their own mini-batch (Data Parallel)

Fully Sharded Data Parallel

Tensor Parallelism

\[ \begin{aligned} x_{\cdot i}^{(n+1)} &= \mathrm{Act}\left(x^{(n)}l^{(n)}_{{\cdot}i} + b^{(n)}_{\cdot i}\right), \\ x^{(n+2)} &= \mathrm{Act}\left(\mathrm{AllReduce}^{\sum}_i\left( x^{(n+1)}_{{\cdot}i}l^{(n+1)}_{i\cdot}\right) + b^{(n)}\right). \end{aligned} \]

PyTorch

TensorFlow

Basic LLM inference

  • The very basics

Aside: Finding Free Ports

  • Needed for a variety of different softwares, including torchrun, vllm and ray
  • find_ports CLI utility available on Alvis
import random
import socket


def get_free_ports(num_ports=1):
    ports = list(range(2**15, 2**16))
    random.shuffle(ports)  # randomize to minimize risk of clashes
    free_ports = []

    for port in ports:
        if len(free_ports) >= num_ports:
            break

        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            try:
                s.bind(('', port))
                s.close()
                free_ports.append(port)  # if succesful, add port to list
            except OSError:
                continue # if port is in use, try another one

    if len(free_ports) < num_ports:
        raise RuntimeError("Not enough free ports.")

    return free_ports

HuggingFace Transformers Set-up

  • Used by most LLM inference engines
  • By defaults saves full models in home directory (out-of-quota)
    • Set HF_HOME or if already downloaded specify absolute paths to model snapshot directory

vLLM Inference Engine

  • vllm serve serves an LLM endpoint talking OpenAI API
    • --tensor-parallel-size="$SLURM_GPUS_ON_NODE"
    • --pipeline-parallel-size="$SLURM_NNODES"
  • Alvis documentation

Further learning on LLMs

  • NAISS LLM Workshop planned for later in 2026
    • To be announced in the NAISS Training Newsletter

Further learning