Running AI/ML workloads on NAISS systems

Scope¶

Will be covered:
- Introduction for running Deep Learning workloads on the main NAISS AI/ML resource
Will not be covered:
- A general introduction to machine learning
- Running classical ML or GOFAI
- General HPC intro

NAISS GPU resources overview¶

Alvis (End-of-life 2026-06-30)
- NVIDIA GPUs: 332 A40s, 318 A100s, 160 T4s, 44 V100s
- Only for AI/ML
Arrhenius (to be in operation Q2 2026)
- 1528 NVIDIA GH200s
Dardel (Probably end-of-life 2026)
- 248 AMD MI250X
Bianca
- 20 NVIDIA A100s
- Only for sensitive data

Alvis specifics¶

What is potentially different on Alvis?
See extended version at Alvis introduction material

The cluster environment

Connecting - Firewall¶

Firewall limits connections to within SUNET
Use a VPN if needed

Cluster firewall only allows connection from within SUNET

Log-in nodes¶

alvis1.c3se.chalmers.se has 4 T4 GPUs for light testing and debugging
alvis2.c3se.chalmers.se is dedicated data transfer node
Will be restarted from time to time
Login nodes are shared resources for all users:
- don't run jobs here,
- don't use up too much memory,
- preparing jobs and
- light testing/debugging is fine

SSH - Secure Shell¶

ssh <CID>@alvis1.c3se.chalmers.se, ssh <CID>@alvis2.c3se.chalmers.se
Gives command line access to do anything you could possibly need
If used frequently you can set-up a password protected SSH-key for convenience

Alvis Open OnDemand portal¶

https://alvis.c3se.chalmers.se
Browse files and see disk and file quota
Launch interactive apps on compute nodes
- Desktop
- Jupyter notebooks
- MATLAB proxy
- RStudio
- VSCode
Launch apps on log-in nodes
- Desktop
See our documentation for more

Remote desktop¶

RDP-based remote desktop solution on shared login nodes (use portal for heavier interactive jobs)
In-house-developed web client:
- https://alvis1.c3se.chalmers.se
- https://alvis2.c3se.chalmers.se
Can also be accessed via any desktop client supporting RDP at alvis1.c3se.chalmers.se and alvis2.c3se.chalmers.se (standard port 3389).
Desktop clients tend to give a better experience.
See the documentation for more details

Files and Storage¶

/cephyr/ and /mimer/ are parallel filesytems, accessible from all nodes
Backed up home directory at /cephyr/users/<CID>/Alvis
Project storage at /mimer/NOBACKUP/groups/<storage-name>
The C3SE_quota shows you all your centre storage areas, usage and quotas.
- where-are-my-files available on /cephyr
File-IO is usually the limiting factor on parallel filesystems
Prefer a few large files over many small

Datasets¶

When allowed, we provide popular datasets at /mimer/NOBACKUP/Datasets/
Request additional dataset through the support form
It is your responsibility to make sure you comply with any licenses and limitations
- In all cases only for non-commercial research applications
- Citation often needed
Read more on the dataset page and/or the respective README files

Software¶

Containers through Apptainer
Optimized software in modules
- Flat module scheme, load modules directly
Read the Python instructions for installing your own Python packages

GPU hardware details¶

#GPUs	GPUs	Capability	CPU	Note
44	V100	7.0	Skylake
160	T4	7.5	Skylake
332	A40	8.6	Icelake	No IB
296	A100	8.0	Icelake	Fast Mimer
32	A100fat	8.0	Icelake	Fast Mimer

SLURM specifics¶

Main allocatable resource is --gpus-per-node=<GPU type>:<no. gpus>
- e.g. #SBATCH --gpus-per-node=A40:1
Cores and memory are allocated proportional to number of GPUs and related node type
Maximum 7 days walltime
- Use checkpointing for longer runs
Jobs that don't use allocated GPUs may be automatically cancelled

GPU cost on Alvis¶

Type	VRAM	System memory per GPU	CPU cores per GPU	Cost
T4	16GB	72 or 192 GB	4	0.35
A40	48GB	64 GB	16	1
V100	32GB	192 or 384 GB	8	1.31
A100	40GB	64 or 128 GB	16	1.84
A100fat	80GB	256 GB	16	2.2

Example: using 2xT4 GPUs for 10 hours costs 7 "GPU hours" (2 x 0.35 x 10).

Monitoring tools¶

You can SSH to nodes where you have an ongoing job
- From where you can use CLI tools like htop, nvidia-smi, nvtop, ...
Use job_stats.py <JOBID> to view graphs of usage
jobinfo -s can be used to get a summary of currently available resources

Running ML¶

Machine learning in PyTorch and TensorFlow
Using GPUs
Checkpointing

PyTorch¶

Move tensors or models to the GPU "by hand"

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
a = torch.Tensor([1, 1, 2, 3]).to(device)

PyTorch Lightning¶

Lightning is wrapper to hide PyTorch boilerplate
Trainer and LightningModule handles moving data/model to GPUs

import lightning as L
import torch


class LightningTransformer(lightning.LightningModule):
    def __init__(self, ...):
        super().__init__()
        self.model: torch.nn.Module = ...

    def training_step(self, batch, batch_idx): ...

    def configure_optimizers(self): ...

Pytorch and PyTorch Lightning Basic Demo¶

Demo

TensorFlow¶

Automatically tries to use a single GPU
Will also pre-allocate GPU memory, hiding actual memory usage to external monitoring tools
https://www.tensorflow.org/guide/gpu
Demo

import tensorflow as tf

print(tf.config.list_physical_devices("GPU"))

Performance and GPUs¶

What makes GPUs good for AI/ML?
And what to think about to get good performance out of it?

General-Purpose computing on GPUs¶

Single Instruction Multiple Threads
- Massively parallel on 1000s to 10000s of threads
Specialised Matrix-Multiply Units (Tensor Cores)
- Most DL architectures can be reduced to mostly GEneral Matrix Multiplications

Precision and performance (×10¹² OP/s)¶

Data type	GH200	A100	A40	V100	T4
FP64	34	9.7	0.58	7.8	0.25
FP32	67	19.5	37.4	15.7	8.1
TF32	494*²	156*²	74.8*²	N/A	N/A
FP16	990*²	312*²	149.7*²	125	65
BF16	990*²	312*²	149.7*²	N/A	N/A
FP8	1979*²	N/A	N/A	N/A	N/A
Int8	1979*²	624*²	299.3*²	64	130
Int4	N/A	1248*²	598.7*²	N/A	260

TensorCores for GEMM computations¶

FP32 mixed precision GEMM computations with TF32
- TensorFlow (and Lightning) does this by default
- PyTorch only for convolutions by default
- torch.set_float32_matmul_precision('high') to enable for matmul
Tensor dimensions must be multiple of 8

Automatic Mixed Precision¶

Calculate with float16 when possible
Uses loss scaling to not loose small gradient values
Read more: source, NVIDIA, PyTorch, TensorFlow

Tensor Core Shape Constraints¶

To use TensorCores in FP16 precision the following should be in a multiple of 8 in FP16 (source):
Mini-batch
Linear layer width/dimension
Convolutional layer channel count
Vocabulary size in classification problems (pad if needed)
Sequence length (pad if needed)

Arithmetic Intensity¶

Computational work in a CUDA kernel per input byte
If too low you're memory bound
To increase:
- Concatenate tensors when suitable for larger inputs to layers
- Use channels last format for conv layers
- Wider layers (but only if it makes sense)

Don't Forget Non-Tensor Core Operations¶

Non-Tensor Cores operations are up to 10x slower
- Optimising/reducing these can give most overall improvement
Compiling models can help (JIT, XLA)

GPU monitoring¶

nvtop & nvidia-smi
- utilization: percent of time any SM is used (not percent of SMs used)
job_stats.py JOBID (Alvis/Vera only)
- power consumption as proxy for occupancy
See profiling section later for more detailed results

Performance and parallel filesystems¶

Performance considerations for data loading on parallel filesystems

The parallel filesystem¶

The parallel filesystem

Striping on parallel filesystems¶

Striping on parallel filesystems

Small vs big files¶

Metadata bound

Performance suggestions¶

Prefer a few large files over many small
Many good implementations: HDF5, NetCDF, Arrow, Safetensors, ...
Containers are faster for python environments on start-up

Profiling¶

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.

Donald Knuth, 1974

Print "profiling"¶

First thing to try, print what you want to know
Run with python -u for unbuffered mode

import time

t0 = time.time()
...  # your code here
print(f"ran ... in {time.time() - t0} s")

Scalene¶

General sampling Python profiler for both CPU, GPU and memory
Jupyter: %load_ext scalene + %%scalene
Lightning: Might be buggy with Scalene

python -m scalene run my_script.py
python -m scalene view --cli

PyTorch profiler¶

# Plain PyTorch https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
from torch.profiler import profile
with profile(...) as prof:
  ...  # run the code you want to profile
print(prof.key_averages().table())
prof.export_trace("trace.json")

# PyTorch Lightning https://lightning.ai/docs/pytorch/stable/api_references.html#profiler
trainer = Trainer(..., profiler="pytorch")
...

use https://ui.perfetto.dev/ to view JSON trace files

TensorFlow profiler and TensorBoard¶

Install tensorboard_plugin_profile
Use TensorBoard callback

# Profile from batches 10 to 15
tb_callback = tf.keras.callbacks.TensorBoard(profile_batch="10, 15")

Multi-GPU parallelism¶

Task parallelism
Embarassingly parallel
Data parallelism, for speed-up
For speed-up when single GPU efficiency is already good
Flavours of model parallelism
When the model doesn't fit on the GPU

Task parallelism¶

When little to no communication is needed
Inference on different data
Training with different set-up (e.g. hyperparameter tuning)
Use job-arrays or task farms

Distributed Data Parallelism¶

Copy the model to each GPU and feed them different data
Communicate gradient updates (all-reduce)

Data parallelism

Pipeline parallelism¶

Pipeline parallelism

Fully Sharded Data Parallel¶

Not available in TensorFlow
All parameter tensors are fully distributed (Fully Sharded)
Each GPU computes their own mini-batch (Data Parallel)

Fully Sharded Data Parallel

Tensor Parallelism¶

Megatron LM paper paired row and column parallel layers

\[ \begin{aligned} x_{\cdot i}^{(n+1)} &= \mathrm{Act}\left(x^{(n)}l^{(n)}_{{\cdot}i} + b^{(n)}_{\cdot i}\right), \\ x^{(n+2)} &= \mathrm{Act}\left(\mathrm{AllReduce}^{\sum}_i\left( x^{(n+1)}_{{\cdot}i}l^{(n+1)}_{i\cdot}\right) + b^{(n)}\right). \end{aligned} \]

PyTorch¶

TensorFlow¶

MultiWorkerMirroredStrategy

Basic LLM inference¶

The very basics

Aside: Finding Free Ports¶

Needed for a variety of different softwares, including torchrun, vllm and ray
find_ports CLI utility available on Alvis

import random
import socket


def get_free_ports(num_ports=1):
    ports = list(range(2**15, 2**16))
    random.shuffle(ports)  # randomize to minimize risk of clashes
    free_ports = []

    for port in ports:
        if len(free_ports) >= num_ports:
            break

        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            try:
                s.bind(('', port))
                s.close()
                free_ports.append(port)  # if succesful, add port to list
            except OSError:
                continue # if port is in use, try another one

    if len(free_ports) < num_ports:
        raise RuntimeError("Not enough free ports.")

    return free_ports

HuggingFace Transformers Set-up¶

Used by most LLM inference engines
By defaults saves full models in home directory (out-of-quota)
Set HF_HOME or if already downloaded specify absolute paths to model snapshot directory

vLLM Inference Engine¶

vllm serve serves an LLM endpoint talking OpenAI API
--tensor-parallel-size="$SLURM_GPUS_ON_NODE"
--pipeline-parallel-size="$SLURM_NNODES"
Alvis documentation

Further learning on LLMs¶

NAISS LLM Workshop planned for later in 2026
To be announced in the NAISS Training Newsletter

Further learning¶

Alvis tutorial (slightly dated)
NAISS Training Events: https://www.naiss.se/training/
Questions related to training events: https://supr.naiss.se/support/?problem_type=training_event