-1

I am trying to run a benchmark on some family of algorithms.

I have multiple algorithms, each of them with one hyperparameter, and I want to test them with multiple data sizes. Each run takes ~60 seconds, but there is a high cardinality, hence the need for a cluster.

At the moment, I am submitting one job with one task for each benchmark run, but I don't know if that is a good practice. The number of runs is way higher than the number of jobs I can have currently on the queue.

Perhaps I should submit multiple "runs" in one job even if they have different hyperparameters? Should I do it then as multiple tasks in that job?

1 Answer 1

3

60 seconds for a job is very short-lived ; you should probably "pack" benchmarks run together in a single submission script, for instance by algorithm, with a submission script like this (4 CPUs used for each benchmarks) :

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --ncpus-per-task=4
#SBATCH --mem-per-cpu=...
#SBATCH ...

module load ...

algorithm="thealgorithm"
hyperparametersvalues=(0.1 1 10 1000)
files=(data/*)

for hyper in hyperparametersvalues
do
  for file in files; do
      ./benchmak_script $algorithm --hyperparameter=$hyper $file
  done
done

If you have access to GNU parallel, you can rewrite it like this, that easily allows running benchmarks for the same algorithm in parallel (on a single node):

#!/bin/bash

#SBATCH --ntasks=10
#SBATCH --ncpus-per-task=4
#SBATCH --nodes=1-1
#SBATCH --mem-per-cpu=...
#SBATCH ...

module load ...

algorithm="thealgorithm"
hyperparametersvalues=(0.1 1 10 1000)
files=(data/*)

parallel -P $SLURM_NTASKS ./benchmak_script $algorithm --hyperparameter={1} {2} ::: ${hyperparametersvalues[@]} ::: ${files[@]}

If you do not have parallel, you can achieve the same with the & and wait in the loop.

You can also use multiple nodes and release the --nodes=1-1 constraint by inserting a srun --exact ... in the argument of parallel.

You can also create a job array with the algorithm as parameter:

#!/bin/bash

#SBATCH --ntasks=10
#SBATCH --ncpus-per-task=4
#SBATCH --nodes=1-1
#SBATCH --mem-per-cpu=...
#SBATCH ...
#SBATCH --array=0-2

module load ...

algorithms=(thealgorithm thesecondalgorithm thethirdalgorithm)
algorithm=${algorithms[$SLURM_ARRAY_TASK_ID]}
hyperparametersvalues=(0.1 1 10 1000)
files=(data/*)

parallel -P $SLURM_NTASKS parallel ./benchmak_script $algorithm --hyperparameter={1} {2} ::: ${hyperparametersvalues[@]} ::: ${files[@]}
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! I think this is what I was looking for. I tried the job array before but it counts as different job to the "jobs in the queue count".
You are welcome. Feel free to accept my answer then so others know this question is answered.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.