Problems with Low-Memory Runs

I’m relatively new to cluster computing and slurm. I’ve previously posted that I’m currently working on a project that requires 100,000 runs of a memory-light software. The advice I received suggested I make several slurm job arrays (which can hold up to 2500 jobs) and I did so. Each individual run within the arrays should only take a few minutes, but they are taking hours, sometimes even days, to complete. I would like to know if I am doing this correctly? Here is one of my slurm array scripts, as well as the accompanying text file (hold the parameters that the array needs).

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=300M
#SBATCH --time=0-12:00:00
#SBATCH --array=1-2499
#SBATCH --job-name=hops_run_12
#SBATCH --partition=batch

DATE=$(date)
HOST=$(hostname)
NCORES=$(nproc)

echo " "
echo "running on host: $HOST"
echo "$NCORES cores requested"
echo "job submitted: $DATE"
echo "job STDOUT follows:"
echo " "

# Change to the directory where the job was submitted from
cd $SLURM_SUBMIT_DIR

# Activate the Conda environment
conda init bash
source /opt/linux/rocky/8.x/x86_64/pkgs/miniconda3/py39_4.12.0/etc/profile.d/conda.sh
conda activate fsc_wrapper_env

# List the Conda environments for debugging
conda info --envs

echo "activating my env"
conda info --envs | grep '^*'

# Calculate the parameters based on the array index
INDEX=$SLURM_ARRAY_TASK_ID
PARAMS=$(sed -n "${INDEX}p" params_12.txt)

# Extract individual parameters
output_dir=$(echo $PARAMS | cut -d ' ' -f 1)
project_path=$(echo $PARAMS | cut -d ' ' -f 2)
prefix=$(echo $PARAMS | cut -d ' ' -f 3)
i=$(echo $PARAMS | cut -d ' ' -f 4)

fsc_wrapper_py="${project_path}/cluster_main.py"

# Run your Python script with the parameters
echo python3 $fsc_wrapper_py $output_dir $project_path $prefix $i
python3 $fsc_wrapper_py $output_dir $project_path $prefix $i
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 90
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 91
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 92
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 93
/bigdata/armstronglab/respl001/output/fsc_output/hops_random_model_275 /rhome/respl001/Projects/fscWrapper hops 94
...

I would love if I could get a second opinion on this and see if I’ve set this up correctly.

Additionally, I believe that my problem could perhaps be solved by using something like torque. I just need something that would accomplish this: If I were to have 1000 jobs, one right after the other in a single shell script, and I want to request x processors and x nodes and I want them to essentially schedule themselves.

Please let me know if there is anything I can do differently. Thanks.

Looking over the sbatch script, nothing looks obviously misconfigured.

One thing that could come to mind is filesystem IO for all of the runs in parallel. It looks like you’re running ~339 jobs in parallel, and if they are all reading/writing files in parallel then that could be a bottleneck for performance (even if they are running on different nodes).

While looking into the usage for fsc28 [1], I also notice that the -B/--numBatches parameter works best when set to the same number of cores (the -c flag). When -c isn’t set, it defaults to 1, which is OK in this situation as you allocate only 1 core, though the default for -B is 12. Something to try could be setting -B 1 to match the cores and batches.

Out of curiosity, you might want to try manually running a one-off job that took a while to see how it performs. As a somewhat random example, I see that job 4436748_1480 took 1d14h45m to complete. If possible, it might be worth manually running that run (possibly in an srun session) to see how long it takes. I’m not too sure how fsc28 works, but could it be possible that different data causes the program to take longer to run/process?

[1] https://cmpg.unibe.ch/software/fastsimcoal2/man/fastsimcoal28.pdf

If any specific run is light in memory and run why not use parallel?

I would consider allocating a job with a large number of processors and running with parallel so each array job run would handle 1000 jobs (depending on what the runtime is- I like to large scale job sets that fit in 2hr runs but you can get like 128 cores on a single node and then have these 128 run in parallel.

I can help write a longer example but like this:

or one with an example where you write a function that gets called to parallel

So that could be the place you do your lookup for the line index for the parameters.

LMK if you need more concrete example?