Skip to content

cpu count wrong on slurm compute nodes #2972

@gfdb

Description

@gfdb

Describe the bug

Slurm compute nodes (at least on compute canada), return the cpu count of the login node instead of the compute node. This means that in functions like speechbrian.utils.parallel.parallel_map, over 100 processes are spawned, causing cpu contention. Not sure if this is an "us" problem or not, but it is slowing down the recipes, particularly the csv prep.

Example:

[gfdb@rorqual4 speechbrain]$ salloc --time=0:30:0 --ntasks=1 --cpus-per-task=4 --mem=2G --account=def-ravanelm --nodes=1
salloc: NOTE: Your memory request of 2048.0M was likely submitted as 2.0G. Please note that Slurm interprets memory requests denominated in G as multiples of 1024M, not 1000M.
salloc: Pending job allocation 1924377
salloc: job 1924377 queued and waiting for resources
salloc: job 1924377 has been allocated resources
salloc: Granted job allocation 1924377
salloc: Nodes rc12503 are ready for job
[gfdb@rc12503 speechbrain]$ source ~/envs/sb-env/bin/activate
(sb-env) [gfdb@rc12503 speechbrain]$ python
Python 3.11.5 (main, Sep 19 2023, 16:07:22) [GCC 12.3.1 20230526] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import multiprocessing
>>> os.cpu_count()
192
>>> multiprocessing.cpu_count()
192
>>> 

Expected behaviour

In the above example, the expected number of cpus is 4 but is 192.

To Reproduce

No response

Environment Details

No response

Relevant Log Output

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions