HPC Clusters⚓︎
From node ID to xname⚓︎
Component names (xnames) identify the geolocation for hardware components in the HPE Cray EX system.
On the node, you can run the following command to get the xname:
The xname has the following format:
| Field | Description |
|---|---|
| x | Cabinet number |
| c | Chassis number |
| s | Slot number |
| b | Card number |
| n | Node number |
For CSCS GH200 system, the node number is always 0 (n0).
The card number can be either 0 or 1, for the two nodes in the compute blade.
Power measurements⚓︎
On Cray EX systems, power measurements can be obtained from the pm_counters in
pm_counters for a GH200 node on Alps
$ ls -l /sys/cray/pm_counters/
total 0
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel0_energy
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel0_power
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel0_power_cap
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel1_energy
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel1_power
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel1_power_cap
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel2_energy
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel2_power
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel2_power_cap
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel3_energy
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel3_power
-r--r--r-- 1 root root 65536 Nov 21 10:48 accel3_power_cap
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu0_energy
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu0_power
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu0_temp
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu1_energy
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu1_power
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu1_temp
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu2_energy
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu2_power
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu2_temp
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu3_energy
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu3_power
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu3_temp
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu_energy
-r--r--r-- 1 root root 65536 Nov 21 14:38 cpu_power
-r--r--r-- 1 root root 65536 Nov 21 14:38 energy
-r--r--r-- 1 root root 65536 Nov 21 14:38 freshness
-r--r--r-- 1 root root 65536 Nov 21 14:38 generation
-r--r--r-- 1 root root 65536 Nov 21 10:45 power
-r--r--r-- 1 root root 65536 Nov 21 10:45 power_cap
-r--r--r-- 1 root root 65536 Nov 21 14:38 raw_scan_hz
-r--r--r-- 1 root root 65536 Nov 21 14:38 startup
-r--r--r-- 1 root root 65536 Nov 21 14:38 version
Measuring power consumption
The following script samples relevant pm_contters:
#!/bin/bash
while [ ! -f stop_monitor ];
do
cat /sys/cray/pm_counters/power >> node_power.txt
cat /sys/cray/pm_counters/accel0_power >> gpu0_power.txt
cat /sys/cray/pm_counters/accel1_power >> gpu1_power.txt
cat /sys/cray/pm_counters/accel2_power >> gpu2_power.txt
cat /sys/cray/pm_counters/accel3_power >> gpu3_power.txt
cat /sys/cray/pm_counters/cpu0_power >> cpu0_power.txt
cat /sys/cray/pm_counters/cpu1_power >> cpu1_power.txt
cat /sys/cray/pm_counters/cpu2_power >> cpu2_power.txt
cat /sys/cray/pm_counters/cpu3_power >> cpu3_power.txt
sleep 5
done
It can run alongside an application to log power consumption over time.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --time=00:01:00
#SBATCH --partition=debug
#SBATCH --uenv=prgenv-gnu/25.6:v2
#SBATCH --view=default
rm -f stop_monitor
date
srun --overlap -n1 monitor.sh &
pid=$! # Get PID of most recent background process
srun --overlap -n1 ./bindgpu0.sh hwloc-bind --cpubind core:0-7 -- node-burn/build/burn -ggemm,5000 -cstream,500000 -d30 &
pidj1=$!
srun --overlap -n1 ./bindgpu1.sh hwloc-bind --cpubind core:72-79 -- node-burn/build/burn -ggemm,5000 -cstream,500000 -d30 &
pidj2=$!
wait $pidj1
wait $pidj2
sleep 10
touch stop_monitor
date
trap does not play nice with Slurm
Using trap to stop the monitoring script does not work well with Slurm jobs, and and error is produced.
Instead, a sentinel file (stop_monitor) is used to signal the monitoring script to stop.