.. _Simulations-GPU:

First simulations on ORCUS: example with GPU partition
======================================================

On Orcus, several GPU partitions are available. Here we present the basic commands to run a job on GPU partitions of ORCUS.

For compilation, a tutorial is given on :ref:`ORCUS-DM2S` for partition ``gpuq_h100`` equipped with 4 GPU H100. For ``cmake`` the flag ``-DKokkos_ARCH_HOPPER90=ON`` will be used for H100. The test is also performed with problem ``NSAC_Comp``.

Connexion and disks on ORCUS
----------------------------

.. tab-set::

   .. tab-item:: Connexion

      .. admonition:: For training session: connexion on ORCUS
         :class: error

            .. code-block:: shell

               loginname@is247529:~$ ssh -XC orcusloginamd2

      returns

      .. code-block:: shell

         Bienvenue sur le cluster

                   ____  _____   _____ _    _  _____
                  / __ \|  __ \ / ____| |  | |/ ____|
                 | |  | | |__) | |    | |  | | (___
                 | |  | |  _  /| |    | |  | |\___ \
                 | |__| | | \ \| |____| |__| |____) |
                  \____/|_|  \_\\_____|\____/|_____/


         Support :      https://codev-tuleap.intra.cea.fr/projects/InfoSc
         Doc :          evince /product/documentations/users/user_manual.pdf
         Guide ACL :    vim +"set syntax=markdown" /product/documentations/users/README_ACL.md

                      Frontale ROCKY Linux 9

         Last login: Wed Nov 27 09:10:54 2024 from 132.166.148.113
         loginname@orcusloginamd2:~$

      .. admonition:: For training session: if first connexion on Orcus
         :class: error

         Copy folder ``run_training_lbm`` in your directory (content on :ref:`Run_Training-LBM`):

            .. code-block:: shell

               $ cp -r /tmpformation/LBM_Saclay/LBM_Saclay_Rech-Dev/run_training_lbm .
   

   .. tab-item:: Disks on ORCUS

      .. admonition:: Disks on ORCUS

         Several disks are available

         - ``HOME``: configuration files, source files of LBM_Saclay, compilation and binary

            .. code-block:: shell

               $ cd ~

         - ``SCRATCH`` for running and output files (``.vti`` and ``.h5``)

            .. code-block:: shell

               $ cd $SCRATCH

         - ``/tmpformation/LBM_Saclay``: shared directory for compiled versions of LBM_Saclay and slurm scripts

            .. code-block:: shell

               $ cd /tmpformation/LBM_Saclay

   .. tab-item:: Get information

      **Get information about ORCUS**

      .. admonition:: Check the number of nodes
         :class: hint

         Check the number of nodes (for CPU and GPU partitions):

            .. code-block:: shell

               loginname@orcusloginamd2:~$ sinfo

      returns (only GPU partitions)

         .. code-block:: ruby

            PARTITION NODES      STATE CPUS SOCKETS CORES MEMORY NODELIST

            gpuq_v100     2       idle   32       2    16 376000 orcus-n[3001-3002]
            gpuq_a100     1      mixed   32       2    16 485000 orcus-n3201
            gpuq_a100     1       idle   32       2    16 485000 orcus-n3202
            gpuq_h100     1       idle   72       2    36 990000 orcus-n3401

      meaning that five GPU partitions are available on Orcus:
       - 2 nodes of one GPU V100 (2 GPUs V100)
       - 2 nodes of one GPU A100 (2 GPUs A100)
       - 1 node of four GPUs H100. The partition ``gpuq_h100`` contains 4 GPUs H100.

      .. admonition:: To check the number of graphic cards on that node
         :class: hint

         You can make a reservation in interactive mode with:

            .. code-block:: shell

               loginname@orcusloginamd2:~$ srun -p gpuq_h100 -n 1 --gres=gpu:1 --pty bash -i

         If the node is free, you will be able to connect and open a terminal. The command

            .. code-block:: shell

               loginname@orcus-n3401:~$ nvidia-smi

      returns

         .. code-block:: ruby

            Tue Nov 26 10:17:49 2024       
            +-----------------------------------------------------------------------------------------+
            | NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
            |-----------------------------------------+------------------------+----------------------+
            | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
            | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
            |                                         |                        |               MIG M. |
            |=========================================+========================+======================|
            |   0  NVIDIA H100 80GB HBM3          On  |   00000000:4E:00.0 Off |                    0 |
            | N/A   39C    P0             73W /  700W |       1MiB /  81559MiB |      0%      Default |
            |                                         |                        |             Disabled |
            +-----------------------------------------+------------------------+----------------------+
            |   1  NVIDIA H100 80GB HBM3          On  |   00000000:5F:00.0 Off |                    0 |
            | N/A   37C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
            |                                         |                        |             Disabled |
            +-----------------------------------------+------------------------+----------------------+
            |   2  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
            | N/A   38C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
            |                                         |                        |             Disabled |
            +-----------------------------------------+------------------------+----------------------+
            |   3  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
            | N/A   38C    P0             72W /  700W |       1MiB /  81559MiB |      0%      Default |
            |                                         |                        |             Disabled |
            +-----------------------------------------+------------------------+----------------------+

            +-----------------------------------------------------------------------------------------+
            | Processes:                                                                              |
            |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
            |        ID   ID                                                               Usage      |
            |=========================================================================================|
            |  No running processes found                                                             |
            +-----------------------------------------------------------------------------------------+

      We can check that four ``NVIDIA H100 80GB HBM3`` are available.

Submit your job with slurm script
---------------------------------

.. tab-set::

   .. tab-item:: Submit your job

      .. admonition:: For LBM training session
         :class: important

         Three slurm scripts are available in folder ``/tmpformation/LBM_Saclay/`` to submit one job on Orcus.

            - ``JOB_H100_GPU.slurm`` to submit one job on gpuq_h100
            - ``JOB_A100_GPU.slurm`` to submit one job on gpuq_a100
            - ``JOB_V100_GPU.slurm`` to submit one job on gpuq_v100

      .. admonition:: For training session: run on ORCUS
         :class: error

         **Submit a job and run**

         In your ``home on ORCUS``, Go to one test case folder (e.g. ``TestCase05_Spinodal-Decomposition2D``):

            .. code-block:: shell

               $ cd ~/run_training_lbm/TestCase05_Spinodal-Decomposition2D

         Submit your test case on one partition: gpuq_h100, gpuq_a100 or gpuq_v100. For example for gpuq_h100:

            .. code-block:: shell

               $ sbatch /tmpformation/LBM_Saclay/JOB_H100_GPU.slurm TestCase05_Spinodal-Decomposition_CH.ini

   .. tab-item:: Check your submission

      .. admonition:: For training session: check your submission
         :class: hint

         Check that your job is running: the following command

            .. code-block:: shell

               $ squeue --me

            You can alternatively write, where ``1`` must be replaced by your own login number:

            .. code-block:: shell

               $ squeue -u S-SAC-DM2S-train1

         returns

            .. code-block:: ruby

               JOBID   PARTITION       QOS PRIORITY         NAME     USER ST       TIME          START_TIME   TIME_LIMIT CPUS    NODES NODELIST(REASON)
               533013   gpuq_h100     1jour 0.000000        myjob ac165432  R       0:01 2025-02-16T11:32:25     12:00:00 18          1 orcus-n3401

         To stop your job

            .. code-block:: shell

               $ scancel 533013
   
         where 533013 corresponds to your JOBID 


Transfer your output files on your local computer
-------------------------------------------------

.. admonition:: For training session: from your local desktop
   :class: error

   Once the job is complete, transfer the output files on your local computer:

      .. code-block:: shell

         $ scp -r S-SAC-DM2S-train1@orcusloginamd2:~/run_training_lbm/TestCase05_Spinodal-Decomposition2D .

   Post-process with paraview

      .. code-block:: shell

         $ paraview11&

.. sectionauthor:: Alain Cartalade