

alvis1.c3se.chalmers.se has 4 T4 GPUs for light testing
and debuggingalvis2.c3se.chalmers.se is dedicated data transfer
nodessh <CID>@alvis1.c3se.chalmers.se,
ssh <CID>@alvis2.c3se.chalmers.sealvis1.c3se.chalmers.se and
alvis2.c3se.chalmers.se (standard port 3389)./cephyr/ and /mimer/ are parallel
filesytems, accessible from all nodes/cephyr/users/<CID>/Alvis/mimer/NOBACKUP/groups/<storage-name>C3SE_quota shows you all your centre storage areas,
usage and quotas.
where-are-my-files available on
/cephyr/mimer/NOBACKUP/Datasets/| #GPUs | GPUs | Capability | CPU | Note |
|---|---|---|---|---|
| 44 | V100 | 7.0 | Skylake | |
| 160 | T4 | 7.5 | Skylake | |
| 332 | A40 | 8.6 | Icelake | No IB |
| 296 | A100 | 8.0 | Icelake | Fast Mimer |
| 32 | A100fat | 8.0 | Icelake | Fast Mimer |
--gpus-per-node=<GPU type>:<no. gpus>
#SBATCH --gpus-per-node=A40:1| Type | VRAM | System memory per GPU | CPU cores per GPU | Cost |
|---|---|---|---|---|
| T4 | 16GB | 72 or 192 GB | 4 | 0.35 |
| A40 | 48GB | 64 GB | 16 | 1 |
| V100 | 32GB | 192 or 384 GB | 8 | 1.31 |
| A100 | 40GB | 64 or 128 GB | 16 | 1.84 |
| A100fat | 80GB | 256 GB | 16 | 2.2 |
htop,
nvidia-smi, nvtop, …job_stats.py <JOBID> to view graphs of
usagejobinfo -s can be used to get a summary of currently
available resourcesTrainer and LightningModule handles moving
data/model to GPUs| Data type | GH200 | A100 | A40 | V100 | T4 |
|---|---|---|---|---|---|
| FP64 | 34 | 9.7 | 0.58 | 7.8 | 0.25 |
| FP32 | 67 | 19.5 | 37.4 | 15.7 | 8.1 |
| TF32 | 494*² | 156*² | 74.8*² | N/A | N/A |
| FP16 | 990*² | 312*² | 149.7*² | 125 | 65 |
| BF16 | 990*² | 312*² | 149.7*² | N/A | N/A |
| FP8 | 1979*² | N/A | N/A | N/A | N/A |
| Int8 | 1979*² | 624*² | 299.3*² | 64 | 130 |
| Int4 | N/A | 1248*² | 598.7*² | N/A | 260 |
torch.set_float32_matmul_precision('high')
to enable for matmulnvtop & nvidia-smi
job_stats.py JOBID (Alvis/Vera only)
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.
python -u for unbuffered mode%load_ext scalene +
%%scalene# Plain PyTorch https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
from torch.profiler import profile
with profile(...) as prof:
... # run the code you want to profile
print(prof.key_averages().table())
prof.export_trace("trace.json")
# PyTorch Lightning https://lightning.ai/docs/pytorch/stable/api_references.html#profiler
trainer = Trainer(..., profiler="pytorch")
...tensorboard_plugin_profile\[ \begin{aligned} x_{\cdot i}^{(n+1)} &= \mathrm{Act}\left(x^{(n)}l^{(n)}_{{\cdot}i} + b^{(n)}_{\cdot i}\right), \\ x^{(n+2)} &= \mathrm{Act}\left(\mathrm{AllReduce}^{\sum}_i\left( x^{(n+1)}_{{\cdot}i}l^{(n+1)}_{i\cdot}\right) + b^{(n)}\right). \end{aligned} \]
torchrun, vllm and rayfind_ports CLI utility available on Alvisimport random
import socket
def get_free_ports(num_ports=1):
ports = list(range(2**15, 2**16))
random.shuffle(ports) # randomize to minimize risk of clashes
free_ports = []
for port in ports:
if len(free_ports) >= num_ports:
break
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
try:
s.bind(('', port))
s.close()
free_ports.append(port) # if succesful, add port to list
except OSError:
continue # if port is in use, try another one
if len(free_ports) < num_ports:
raise RuntimeError("Not enough free ports.")
return free_portsHF_HOME or if already downloaded specify absolute
paths to model snapshot directoryvllm serve serves an LLM endpoint talking OpenAI API
--tensor-parallel-size="$SLURM_GPUS_ON_NODE"--pipeline-parallel-size="$SLURM_NNODES"