Bus error (core dumped) on trying to run a python script

blee098 · February 29, 2024, 7:52am

Hello, I am encountering a strange error on one of my jobs:
Bus error (core dumped)
The error arises on a line that runs a Python script, and it seems to happen randomly. It is a batch job, and one time that I run it, for example, jobs 4,6, and 10 will fail with this error. The next time I run it, 4, 6, and 10 might work but other jobs fail. Any idea what might be happening?
The internet seems to think that this is an issue with trying to access memory that it cannot access, but I am not sure how that applies since this error happens even if I allocate far more memory than the job should require.

ejaco020 · February 29, 2024, 3:24pm

Hello,

I believe that this could still be a problem caused by running out of memory, so in short attempt increasing the --mem flag of your job.

It’s hard to say for certain without any JobIDs, but looking at past jobs I was able to find a few occurrences of the following error in the node’s logs:

[XX.batch] task/cgroup: task_cgroup_memory_check_oom: StepId=XX.batch hit memory+swap limit at least once during execution. This may or may not result in some failure.

Additionally, running seff JOBID yielded the following:

Memory Utilized: 42.70 GB
Memory Efficiency: 85.40% of 50.00 GB

Despite being under the 50GB limit, peak memory spikes can sometimes go unnoticed by Slurm, but the OS will still catch them and kill your job.