-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Slurm compute nodes (at least on compute canada), return the cpu count of the login node instead of the compute node. This means that in functions like speechbrian.utils.parallel.parallel_map, over 100 processes are spawned, causing cpu contention. Not sure if this is an "us" problem or not, but it is slowing down the recipes, particularly the csv prep.
Example:
[gfdb@rorqual4 speechbrain]$ salloc --time=0:30:0 --ntasks=1 --cpus-per-task=4 --mem=2G --account=def-ravanelm --nodes=1
salloc: NOTE: Your memory request of 2048.0M was likely submitted as 2.0G. Please note that Slurm interprets memory requests denominated in G as multiples of 1024M, not 1000M.
salloc: Pending job allocation 1924377
salloc: job 1924377 queued and waiting for resources
salloc: job 1924377 has been allocated resources
salloc: Granted job allocation 1924377
salloc: Nodes rc12503 are ready for job
[gfdb@rc12503 speechbrain]$ source ~/envs/sb-env/bin/activate
(sb-env) [gfdb@rc12503 speechbrain]$ python
Python 3.11.5 (main, Sep 19 2023, 16:07:22) [GCC 12.3.1 20230526] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import multiprocessing
>>> os.cpu_count()
192
>>> multiprocessing.cpu_count()
192
>>>
Expected behaviour
In the above example, the expected number of cpus is 4 but is 192.
To Reproduce
No response
Environment Details
No response
Relevant Log Output
Additional Context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working