I’m relatively new to cluster computing and slurm. I’ve previously posted that I’m currently working on a project that requires 100,000 runs of a memory-light software. The advice I received suggested I make several slurm job arrays (which can hold up to 2500 jobs) and I did so. Each individual run within the arrays should only take a few minutes, but they are taking hours, sometimes even days, to complete. I would like to know if I am doing this correctly? Here is one of my slurm array scripts, as well as the accompanying text file (hold the parameters that the array needs).
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=300M
#SBATCH --time=0-12:00:00
#SBATCH --array=1-2499
#SBATCH --job-name=hops_run_12
#SBATCH --partition=batch
DATE=$(date)
HOST=$(hostname)
NCORES=$(nproc)
echo " "
echo "running on host: $HOST"
echo "$NCORES cores requested"
echo "job submitted: $DATE"
echo "job STDOUT follows:"
echo " "
# Change to the directory where the job was submitted from
cd $SLURM_SUBMIT_DIR
# Activate the Conda environment
conda init bash
source /opt/linux/rocky/8.x/x86_64/pkgs/miniconda3/py39_4.12.0/etc/profile.d/conda.sh
conda activate fsc_wrapper_env
# List the Conda environments for debugging
conda info --envs
echo "activating my env"
conda info --envs | grep '^*'
# Calculate the parameters based on the array index
INDEX=$SLURM_ARRAY_TASK_ID
PARAMS=$(sed -n "${INDEX}p" params_12.txt)
# Extract individual parameters
output_dir=$(echo $PARAMS | cut -d ' ' -f 1)
project_path=$(echo $PARAMS | cut -d ' ' -f 2)
prefix=$(echo $PARAMS | cut -d ' ' -f 3)
i=$(echo $PARAMS | cut -d ' ' -f 4)
fsc_wrapper_py="${project_path}/cluster_main.py"
# Run your Python script with the parameters
echo python3 $fsc_wrapper_py $output_dir $project_path $prefix $i
python3 $fsc_wrapper_py $output_dir $project_path $prefix $i
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 90
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 91
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 92
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 93
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 94
...
I would love if I could get a second opinion on this and see if I’ve set this up correctly.
Additionally, I believe that my problem could perhaps be solved by using something like torque. I just need something that would accomplish this: If I were to have 1000 jobs, one right after the other in a single shell script, and I want to request x processors and x nodes and I want them to essentially schedule themselves.
Please let me know if there is anything I can do differently. Thanks.