Material for training held on 20.11.2024
What is Chimera
- University HPC (high-performance computing) cluster
- Anyone with SIS login can access the cluster
- Cca 50 servers, ~2300 CPU cores, 2 GPUs
The huge extension is planned for the next year, 2026! - Servers are running Rocky Linux 9 OS (EL9 Linux)
- Documentation:
When to use Chimera
- It is a medium-sized cluster.
- It is unsuitable for jobs requiring enormous resources and is not an alternative to supercomputers or the LHC Grid.
- It is suitable for medium-sized tasks:
- Up to a few thousand parallel processes
- Up to a few Terabytes of data
- Jobs that need fast turn-around, e.g.
- final stages of data analysis
- preparation and debugging of the SW
- student projects
Logging in to the cluster
In the terminal window (on Mac or Linux) or PowerShell (on Windows), use the ssh command:
ssh <sis_login>@hpc.troja.mff.cuni.cz
- Where
<sis_login>
is your user name in the university’s LDAP database (i.e the user name you use in SIS, CIS, and other university applications) - This will log you into the head node of the cluster.
- Your home folder:
/home/<sis_login>
- Feel free to look around:
- The command
ls
lists the content of your home folder - The command
df -h
shows you available storage devices - The command
squeue
will show you all running jobs on the cluster
- The command
- Do not use the head node to run CPU-intensive processes.
We will use jobs for this (see following sections).
Chimera partitions
- The cluster is subdivided into partitions, serving as queues for jobs with different priorities.
- All users have access to the following partitions:
- Free-for-all partitions:
ffa
,ffa-preempt
,ffa-check
- Low priority → it may take some time for your jobs to start. Depends on the cluster occupancy.
- However, it offers access to the highest number of nodes – most cluster nodes are included in this partition.
- The “
preempt
” and “check
” partition jobs can be killed and resubmitted when fighting for resources with a higher priority job → your jobs must be able to recover from this if you want to use “ffa-preempt / ffa-check
” partition. - Free-for-all partitions are limited for jobs running up to 1.5 days.
- Education partition:
edu
- Meant to be used for teaching
- High-priority partition, but only limited to two cluster nodes (mff-a2-01 and mff-a2-01).
- Limited to 3h jobs
- GPU partition:
ffa-gpu
- For running GPU jobs (see later)
- Free-for-all partitions:
- In addition, if you are a member of “
ucjf
” account, you gain access to theucjf
partition:- High priority
- Unlimited time
- Only five server nodes bought by our department:
ucjf-asus1, ucjf-asus2, ucjf-asusb1, ucjf-asusb2, ucjf-a4-01
- To be added to the UCJF account, please contact
Daniel.Scheirich@matfyz.cuni.cz
IMPORTANT NOTE: You must be associated with an “account” to submit jobs. When you log in for the first time via JupyterHub (see below), you will be automatically added to the “FFA” account, and you will be able to use FFA and edu partitions.
Submitting an interactive job
To submit and manage jobs (computing tasks), the cluster uses the SLURM batch system.
- Interactive job: gives you terminal access to the worker node (behaves like ssh session). When you disconnect your terminal, the job is killed.
- Batch job: your job is executed and runs until it is done. You cannot interact with your job.
Submitting an interactive job:
srun -p ucjf --mem 4G --cpus-per-task 2 --pty bash -i
It should give an output like this. Note that it may take some time for the job to start:
srun: job 29318806 queued and waiting for resources srun: job 29318806 has been allocated resources sched3am@ucjf-asusb1:~$
- The interactive job gives you access to the worker node. You can run any CPU-intensive tasks there.
- Option “
-p ucjf
” specifies the partition where the job is executed. You must be a member of the UCJF account to be allowed to use this partition (see above). - Option “
--mem 4G
” specifies how much memory is allocated to your job. If the job runs out of memory, it is killed. - Option “
--cpus-per-task 2
” specifies how many CPUs should be given to your job. Note that hyperthreads are counted as CPU, so for most processors, one core equals two CPUs. - The options “
--reservation ucjf_58 -A ucjf
” allow the job to use the reservation created for this training. Please do not use them outside of this training. - If you are not in the UCJF account, you can try to submit to the
edu
partition:
srun -p edu --mem 4G --cpus-per-task 2 --pty bash -i
NOTE: Unlike SSH connections, interactive jobs do not provide a tunnel for displaying application windows. You need to use a workaround if you want to run GUI applications.
Running GUI applications
- Sometimes, it is helpful to run programs that produce windows
- For example:
- ROOT produces windows when displaying plots
- Running Mathematica in GUI mode
- Allowing applications to display windows must be done in two steps:
- Start the interactive job following the instructions in the previous section
- Open another terminal window on your laptop and connect to the head node with the “
-Y
” option. Once you are logged in, create another ssh connection to the server where your interactive job is running:
ssh <sis_login>@hpc.troja.mff.cuni.cz ssh -Y <name_of_the_node_where_your_job_is_running>
- SLURM will not allow you to ssh to the node where you have no running job. Therefore, your interactive job must run the entire time you work with the GUI application.
- Your laptop OS must be capable of displaying X11 windows forwarded from the Linux OS. If you have Linux on your laptop, it works automatically. You need third-party applications such as XQuartz (Mac) or Xming (Windows) if you use Windows or Mac.
Example: running Mathematica in an interactive job
1st terminal window:
ssh <sis_login>@hpc.troja.mff.cuni.cz srun -p ucjf --mem 4G --cpus-per-task 2 --pty bash -i
2nd terminal window:
ssh -Y <sis_login>@hpc.troja.mff.cuni.cz ssh -Y <name_of_the_node_where_your_job_is_running> module load Mathematica; mathematica
Example: running ROOT GUI in an interactive job
The two terminals were opened the same way before. In the 2nd terminal:
source /singularity/ucjf/root-6.28.12-x86_64-el9-gcc13-opt-LCG_104d_ATLAS_22/thisroot.sh root TBrowser b
Submitting batch jobs
- Batch jobs run without interaction with the user.
- Job outputs are forwarded to the log files.
- One can submit a large number of jobs using job arrays.
- We must create an executable submit shell script to submit a batch job. Here is an example:
- In your home folder, create a “
tutorial_chimera
” folder and change the directory:
mkdir tutorial_chimera
cd tutorial_chimera
- Using your favourite editor (nano, vim, emacs), create a submit script:
nano submit.sh
- In your home folder, create a “
#!/bin/bash #SBATCH --array 0-9 #SBATCH --mem 500M #SBATCH --cpus-per-task 2 #SBATCH --job-name=test #SBATCH --error=test.%a.log #SBATCH --output=test.%a.log #SBATCH --open-mode append # Here, do the real work # Just a simple example: python test.py $SLURM_ARRAY_TASK_ID
- The slurm options are specified in the
#SBATCH
comment. When submitting the job, these options can also be set from the command line. - Option
--array
is used when running multiple jobs in parallel. The index of the sub-job is stored in the environment variable$SLURM_ARRAY_TASK_ID
and%a
placeholder. - Options
--error
and--output
set the name(s) of the output log file(s). Use the%a
placeholder when submitting job arrays. - Option
--open-mode append
is useful when submitting into the “preempt” queues. When used, the log files are not overwritten by resubmitted jobs. - Now, create the
test.py
script. In our example, just a simple “Hello world” in Python:
import sys print("Hello world", sys.argv[1] if len(sys.argv)>1 else "") # wait for 20 seconds so that the job is not too fast import time for i in range(20): print("Working for", i, "seconds") time.sleep(1) print("Done")
- Finally, we need to make the shell script executable and submit it with the
sbatch
command:
chmod +x submit.sh sbatch -p ucjf submit.sh
- Your job was now submitted to the “
ucjf
” queue. - You can monitor the job’s progress using the
squeue
command:
squeue --me
Proper use of different queues
Cons and pros of different queues:
Partition | pros | cons |
ucjf |
– high priority – unlimited run time – usually short wait times |
– only five servers – competing with your colleagues for resources |
ffa |
– large number of nodes – less competition for resources |
– lower priority – sometimes longer wait times |
ffa-preempt |
– the entire cluster – even less competition for resources |
– the same priority as ffa – your jobs can be killed and resubmitted → they must be able to cope with this. |
Partition decision tree:
Please be nice to your colleagues
- Do not fill up the entire
ucjf
partition with long jobs - If you are submitting many parallel jobs, try to use job arrays. Job arrays can restrict number of sub-jobs running in parallel:
#SBATCH --array 0-999%20
- Only 20 jobs from this array will run at once
JupyterHub
- Alternative access to the cluster is available via the web browser.
- In your browser, go to the following URL:
https://hpc.troja.mff.cuni.cz:8000
- Note the “:8000” suffix. It must be there!
- You will be asked for credentials. Put in your SIS login name and password:
- Specify the parameters of your interactive job:
- The JupyterHub launches the interactive job according to your specifications.
- Once the job starts, you can use a terminal and jupyter notebooks from your browser window.
Storage
- The command
df -h
shows you available storage devices and their occupancy.
- /home: your home folder.
- You put your code, work, logs, and small data files here (~few GB).
- Do not put large data here. Like seriously, don’t!
- /work: large storage with magnetic HDDs.
- This is where you should put your large data.
- The storage is connected via a fast network (InfiniBand), but HDDs have limited parallel access capability.
- /archive: long-term storage.
- Put here data you do not want to delete yet but do not use daily.
- Connected only by an ethernet network (slower than IB)
- /scratch: for smaller data (from 100s of GB units of TB)
- Connected via InfiniBand
- Based on solid-state drives (SSD). They are much better at handling parallel access.
- /singularity: small disk we use for storing sw containers and virtual environments—writable from head-node, read-only from worker nodes.
Where to put your data
- /work/tmp: accessible to everyone, but content can be deleted anytime without warning
- /work/<your_user_name>: accessible to you
- /work/ucjf-atlas: accessible for everyone in the hpc-atlas group.
- If you want to have your own group (e.g.
hpc-na64
) and your shared folder on the/work
storage, don’t hesitate to contact the cluster admin and provide a list of users to be included in the group.
- If you want to have your own group (e.g.
- /scratch/tmp: the same as
/work/tmp
- /scratch/ucjf-atlas: the same as
/work/ucjf-atlas
. - There are no automatic user folders on
/scratch
. You have to request to have a folder (or a group folder) on /scratch
Optimizing storage parallel access
- There is no absolute rule. You should test how your jobs are doing when reading data from many jobs in parallel.
- You submit e.g. 10 jobs that all read your data to a single node (make sure it’s not fully occupied):
sbatch -p ucjf test_job_array.sh - Remember the job ID
- Once jobs are over, run “seff” command to measure CPU efficiency of your job.
- If jobs are running at >50% of CPU, you are probably OK. You can try to increase the number of parallel sub-jobs or keep it the same.
- If jobs are using <<50%, your jobs spend most of the time waiting in I/O sleep. You should reduce the number of parallel sub-jobs.
- You submit e.g. 10 jobs that all read your data to a single node (make sure it’s not fully occupied):
- The job efficiency can also depend on other users.
- Please remember you are not the only user of the cluster!
- General advice: It’s usually worth shrinking your data by creating reduced derived datasets (e.g., filtering events that do not pass the selection, removing unused variables, etc.).
Example: optimizing the number of parallel jobs
- From
/home/sched3am/tutorial_chimera
copy the following two files:
submit_storage_test.sh
storage_test.py
- Open the
submit_storage_test.sh
file and cf. the second line:#SBATCH --array 0-245%100
It says that 245 sub-jobs will be executed in parallel and that, at most, 100 sub-jobs are allowed to run at once. Each sub-job reads one file from some large ROOT dataset. - Submit the job:
sbatch -p ucjf
- Check if your jobs are running:
squeue --me
- Use ssh to connect to the any worker node where your jobs are running and check how well the jobs are doing:
ssh <your_worker-node>
htop -u <your_user_name>
- You will probably see most of the jobs in the “
I
” state (uninterruptible sleep), meaning they are stuck waiting for I/O. This is very inefficient. Also, it will block other people from using the/work
storage efficiently. - Kill your job and try again, reducing the number of running jobs to 10:
scancel --me
Insubmit_storage_test.sh
replace
#SBATCH --array 0-245%100
by
#SBATCH --array 0-245%10
- Try again. Is it better?
- Once the jobs are over, you can get an exact estimate of the CPU efficiency by running
seff <job_ID>
Containers and virtual environments
- The cluster is (currently) running Rocky Linux 9 (EL9)
- You can use containers if you need a different OS or additional software not installed on the cluster.
- Chimera uses the “apptainer” (formerly singularity) containers. They have an extension “.sif”
- Some ~useful containers are already downloaded in the
/singularity/ucjf
folder (mostly thanks to Pavel Reznicek).- cc7.sif: CERN CentOS linux 7. (Legacy Linux used at CERN)
- tensorflow_v2.15.0-gpu.sif: CUDA libraries + TensorFlow 2.15
- tensorflow_v2.17.0-gpu.sif: CUDA libraries + TensorFlow 2.15
-
ubuntu_v22.04.5_roots.sif: ubuntu + ROOT (note that ROOT is also directly installed on the cluster, so you do not need to use containers)
- To activate the container, run the following command (first, run the interactive job):
srun -p ucjf --mem 4G --cpus-per-task 2 --reservation ucjf_58 -A ucjf --pty bash -i apptainer exec --bind=/home --bind=/work --bind=/scratch --bind /singularity/ucjf:/singularity_ucjf /singularity/ucjf/cc7.sif /bin/bash
- If pre-installed containers do not suit you, you can get a new container from a docker hub.
For example: (may take some time to download)
# We need more memory! srun -p ucjf --mem 50G --cpus-per-task 2 --reservation ucjf_58 -A ucjf --pty bash -i export APPTAINER_CACHEDIR=/scratch/tmp/<user_id>/apptainer_tmp export APPTAINER_TMPDIR=/scratch/tmp/<user_id>/apptainer_tmp apptainer pull docker://pytorch/pytorch apptainer exec --bind=/home --bind=/work --bind=/scratch --bind /singularity/ucjf:/singularity_ucjf pytorch_latest.sif /bin/bash
- When using containers in batch mode, you need to execute your program inside the container.
apptainer exec --bind=/home --bind=/work --bind=/scratch --bind /singularity/ucjf:/singularity_ucjf pytorch_latest.sif <script_to_execute>
NOTE: You cannot install new software into the container unless you have special privileges.
- If you need extra SW in your base containers, you can ask Pavel Reznicek for help (he has the special privileges)
- You can create the .sif container on your computer (laptop, office desktop, …) where you have superuser privileges and then copy the container file into the cluster.
Python environments
- Bare Python is available on the cluster, but it does not have advanced libraries installed (e.g., pandas, TensorFlow, Sympy, etc.).
- You can use python virtual environments to configure your python:
python -m venv <path_where_venv_is_store> # Usually, the folder is named "venv": python -m venv venv # activate the environment source venv/bin/activate # install your libraries using pip pip install pandas
- NOTE: “venv” folders cannot be copied (for some reason, all paths inside are absolute). So, if you need to share a Python environment with your colleagues, you can put it in the
/singularity/ucjf
folder - For example:
/singularity/ucjf/venv_4top
used by the 4-top analysis team/singularity/ucjf/venv_htt
used by the 4-top H->tautau analysis/singularity/ucjf/venv_tf_217
used in the ML class
- Activating the “shared” venv:
source /singularity/ucjf/venv_htt/bin/activate
Integrating Visual Studio Code with the cluster
- Run the Visual Studio Code app (called just “code” on Linux)
- In the left panel, click “Extensions” and install the “Remote-SSH” extension.
- In the bottom left corner click on the icon:
- The menu will pop up. Choose “connect a current window to host” and then put in the cluster URL:
ssh <your_user_name>@hpc.troja.mff.cuni.cz
- Now, you can work with the files on the cluster as if you have them locally.
- WARNING: The Visual Studio Code runs a server application on the cluster’s head node. Do not execute Python code (or jupyter notebooks) directly from the Visual Studio because they would run on the head node!
Connecting Visual Studio Code to JupyterHub
- Start the JupyterHub interactive job (see the JupyterHub section)
- Connect the Visual Studio code to the head node (as described in the previous section)
- Install the JupyterHub extension
- Create a new Jupyter Notebook file (e.g. “test.ipynb”)
- In the top-right corner, click on the “Select Kernel” button
- Choose “Existing Jupyter Hub server…” and “Enter the URL of he running JupyterHub Server…”
- Add URL:
https://hpc.troja.mff.cuni.cz:8000
- Enter your username and password
- Name the session (e.g. JupyterHub1)
- Choose the Python Kernell (e.g. plain Python 3)
- Now, the content of your notebook will be executed on the worker node allocated by JupyterHub rather than by the head node.
NOTE: in case you have problems connecting, you can try to check the following option in the “settings” menu of the Visual Studio Code:
GPU
- Two nVidia L40 GPUs are available on chimera (more will come next year).
- You must use special partition and slurm option to get access to the GPU:
srun -p gpu-ffa --gres "mps:5" --mem 10G --cpus-per-task 2 --reservation ucjf_58 -A ucjf --pty bash -i
- Option -p gpu-ffa specifies to use the partion with the GPU
- Option –gres “mps:5” specifies you want to use 5% of the GPU capacity
- To use the full GPU (typical use case) specify –gres “gpu:1”
- Sharing GPU among multiple users is practical e.g. for teaching
- To use the GPU one needs special library (CUDA) which is not installed in the base system. You need to use the container to get the libraries.
- There is a pre-installed container & python venv with CUDA + TensorFlow on /singularity/ucjf disk
# note the extra -nv option! apptainer run --bind=/home --bind=/work --bind=/scratch --bind /singularity/ucjf:/singularity_ucjf --nv /singularity/ucjf/tensorflow_v2.17.0-gpu.sif source /singularity_ucjf/venv_tf_217/bin/activate
- Now, you can check if the GPU is visible. Execute
python
and copy the following code into the terminal:
import tensorflow as tf print("Available GPUs:", tf.config.list_physical_devices('GPU'))
- You should see if GPU is available
GPU and Jupyter Notebooks
- C.f. section Connecting Visual Studio Code to JupyterHub
- You can choose from several pre-installed kernels when selecting Kernel for your notebook.
- You can add your own kernel into your home folder:
~/.local/share/jupyter/kernels/
- We have prepared a custom kernel that gives you access to CUDA libraries and TensorFlow. You can install it into your home folder easily like this:
cd ~/.local/share/jupyter/kernels/ ln -s /singularity/ucjf/ipykernels/tf/
- Now you can start the JupyterHub job specifying GPU resources:
- You should have a “tf gpu” kernel available both on the JupyterHub launcher web page and when you connect the Visual Studio Code to the JupyterHub.