Async calls to CLI tool in separate folders

Question

I like to call a command line utility from R and parallelise the tool by calling several copies of it in different folders. Each call can have quite different runtimes, so I'd like to do this in an async way where the next call triggers as soon as one of the previous ones has resolved.

Async frameworks like the mirai package seem ideal for this, but I do not know, how I can "fix" the workers to the individual and predefined folders where the CLI tool waits for the next call.

For example, if tool1 in folder1 is still running and tool2 in folder2 has just finished, the dispatcher should assign the next call to tool2 in folder2. Maybe it finishes again very quickly and tool1 is still running, so the third call should also go to tool2 in folder2 and so on.

Would anyone have an idea that could get me on track to develop a solution for this?

Ideally, I could just use purrr:map functions, maybe with the new mirai parallelization in the purrr development version. mirai::mirai_map might also be a way, but again, how to dynamically distribute the tasks to the correct tool in the correct folder?

Sounds like you are trying to reinvent a resource manager and launcher like slurm. — Ada Lovelace
– Ada Lovelace, Commented Jun 22 at 19:11
Is each tool# a different executable and/or option set? You mention "tool1 in folder1" and "tool2 in folder2", are tool1/tool2 different other than the working directory? Is there order to the tasks that need to be run, or is it a random-access set of things that need to be done? — r2evans
– r2evans, Commented Jun 22 at 19:39
@Ada Lovelace: Maybe yes, I do not know such tools and what exactly they do. Are they available for Windows? My CLI tool is windows-only. — user3460194
– user3460194, Commented Jun 23 at 5:23
@r2evans: Each tool# is a copy of the same executable stored in different folders, which could be subfolders to the working directory. There is no order to the tasks. As soon as tool2 has finished its computation for a given task, it can take whatever next task is waiiting in the queue. — user3460194
– user3460194, Commented Jun 23 at 5:25
Why copy executables to different directories? The only thing I can think of that might benefit from this is DLL loading, where you are instructing the OS to load out of the working directory (which some consider a security risk). I agree that it seems you are in essence reinventing a resource manager/launcher. Is the subdirectory specific to a particular task, or are you using the subdirs to partition the tool copies? — r2evans
– r2evans, Commented Jun 23 at 11:59

user3460194 · Accepted Answer · 2025-06-28 11:52:23Z

I think I found a simple way that works for me.

I set up the persisent daemons with mirai, ask for their pids once, assign them to my individual folders and then for each task I can ask the pid and assign it to the correct folder to call the tool from.

Something like this:

library(mirai)
#> Warning: package 'mirai' was built under R version 4.4.3

#set 2 workers
daemons(2)
#> [1] 2

#fn to get pids of workers
myfunc <- function(folder){
  
  pid <- Sys.getpid()
  
  res <- setNames(folder, pid)
  
  return(res)
  
}

#folders to call the tool from
folders <- c("dir1", "dir2")

#dictionary to assign worker pid to specific folder
dict <-
  mirai_map(folders, myfunc)[.flat]

dict
#>   7760  13868 
#> "dir1" "dir2"

#fn to perform task with variable computing times
mytask <- function(id, arg){
  
  pid <- as.character(Sys.getpid())
  folder <- dict[pid]
  
  Sys.sleep(arg)
  
  print(paste0("task", id, " done in folder ", folder))
  
}

#task IDs and corresponding arguments for mytask fn
#task1 will run 5 sec, others 1 sec only
tasks <-
  data.frame(id = 1:5, arg = c(5, rep(1, 4)))

#run all on 2 workers
#5 sec total run time, i.e. load balancing works
#folders correct as well
system.time(print(mirai_map(tasks, mytask, dict = dict)[]))
#> [[1]]
#> [1] "task1 done in folder dir1"
#> 
#> [[2]]
#> [1] "task2 done in folder dir2"
#> 
#> [[3]]
#> [1] "task3 done in folder dir2"
#> 
#> [[4]]
#> [1] "task4 done in folder dir2"
#> 
#> [[5]]
#> [1] "task5 done in folder dir2"
#>    user  system elapsed 
#>     0.0     0.0     5.1

Created on 2025-06-28 with reprex v2.1.1

Collectives™ on Stack Overflow

Async calls to CLI tool in separate folders

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related