First simulations on ORCUS: example with GPU partition
On Orcus, several GPU partitions are available. Here we present the basic commands to run a job on GPU partitions of ORCUS.
For compilation, a tutorial is given on ORCUS (CEA-ISAS-DM2S): compilation on single- and multi-GPU - doc updated August 28th, 2025 for partition gpuq_h100
equipped with 4 GPU H100. For cmake
the flag -DKokkos_ARCH_HOPPER90=ON
will be used for H100. The test is also performed with problem NSAC_Comp
.
Connexion and disks on ORCUS
For training session: connexion on ORCUS
loginname@is247529:~$ ssh -XC orcusloginamd2
returns
Bienvenue sur le cluster
____ _____ _____ _ _ _____
/ __ \| __ \ / ____| | | |/ ____|
| | | | |__) | | | | | | (___
| | | | _ /| | | | | |\___ \
| |__| | | \ \| |____| |__| |____) |
\____/|_| \_\\_____|\____/|_____/
Support : https://codev-tuleap.intra.cea.fr/projects/InfoSc
Doc : evince /product/documentations/users/user_manual.pdf
Guide ACL : vim +"set syntax=markdown" /product/documentations/users/README_ACL.md
Frontale ROCKY Linux 9
Last login: Wed Nov 27 09:10:54 2024 from 132.166.148.113
loginname@orcusloginamd2:~$
For training session: if first connexion on Orcus
Copy folder run_training_lbm
in your directory (content on Practice of two-phase flows with test cases of run_training_lbm):
$ cp -r /tmpformation/LBM_Saclay/LBM_Saclay_Rech-Dev/run_training_lbm .
Disks on ORCUS
Several disks are available
HOME
: configuration files, source files of LBM_Saclay, compilation and binary$ cd ~
SCRATCH
for running and output files (.vti
and.h5
)$ cd $SCRATCH
/tmpformation/LBM_Saclay
: shared directory for compiled versions of LBM_Saclay and slurm scripts$ cd /tmpformation/LBM_Saclay
Get information about ORCUS
Check the number of nodes
Check the number of nodes (for CPU and GPU partitions):
loginname@orcusloginamd2:~$ sinfo
returns (only GPU partitions)
PARTITION NODES STATE CPUS SOCKETS CORES MEMORY NODELIST gpuq_v100 2 idle 32 2 16 376000 orcus-n[3001-3002] gpuq_a100 1 mixed 32 2 16 485000 orcus-n3201 gpuq_a100 1 idle 32 2 16 485000 orcus-n3202 gpuq_h100 1 idle 72 2 36 990000 orcus-n3401
- meaning that five GPU partitions are available on Orcus:
2 nodes of one GPU V100 (2 GPUs V100)
2 nodes of one GPU A100 (2 GPUs A100)
1 node of four GPUs H100. The partition
gpuq_h100
contains 4 GPUs H100.
To check the number of graphic cards on that node
You can make a reservation in interactive mode with:
loginname@orcusloginamd2:~$ srun -p gpuq_h100 -n 1 --gres=gpu:1 --pty bash -i
If the node is free, you will be able to connect and open a terminal. The command
loginname@orcus-n3401:~$ nvidia-smi
returns
Tue Nov 26 10:17:49 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:4E:00.0 Off | 0 | | N/A 39C P0 73W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H100 80GB HBM3 On | 00000000:5F:00.0 Off | 0 | | N/A 37C P0 70W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 | | N/A 38C P0 71W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 | | N/A 38C P0 72W / 700W | 1MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
We can check that four NVIDIA H100 80GB HBM3
are available.
Submit your job with slurm script
For LBM training session
Three slurm scripts are available in folder /tmpformation/LBM_Saclay/
to submit one job on Orcus.
JOB_H100_GPU.slurm
to submit one job on gpuq_h100
JOB_A100_GPU.slurm
to submit one job on gpuq_a100
JOB_V100_GPU.slurm
to submit one job on gpuq_v100
For training session: run on ORCUS
Submit a job and run
In your home on ORCUS
, Go to one test case folder (e.g. TestCase05_Spinodal-Decomposition2D
):
$ cd ~/run_training_lbm/TestCase05_Spinodal-Decomposition2D
Submit your test case on one partition: gpuq_h100, gpuq_a100 or gpuq_v100. For example for gpuq_h100:
$ sbatch /tmpformation/LBM_Saclay/JOB_H100_GPU.slurm TestCase05_Spinodal-Decomposition_CH.ini
For training session: check your submission
Check that your job is running: the following command
$ squeue --meYou can alternatively write, where
1
must be replaced by your own login number:$ squeue -u S-SAC-DM2S-train1
returns
JOBID PARTITION QOS PRIORITY NAME USER ST TIME START_TIME TIME_LIMIT CPUS NODES NODELIST(REASON) 533013 gpuq_h100 1jour 0.000000 myjob ac165432 R 0:01 2025-02-16T11:32:25 12:00:00 18 1 orcus-n3401
To stop your job
$ scancel 533013
where 533013 corresponds to your JOBID
Transfer your output files on your local computer
For training session: from your local desktop
Once the job is complete, transfer the output files on your local computer:
$ scp -r S-SAC-DM2S-train1@orcusloginamd2:~/run_training_lbm/TestCase05_Spinodal-Decomposition2D .
Post-process with paraview
$ paraview11&
Section author: Alain Cartalade