Skip to content

DPL: optimize to use ipc:// on the same node#2517

Merged
ktf merged 2 commits intoAliceO2Group:devfrom
ktf:ipc-support
Oct 22, 2019
Merged

DPL: optimize to use ipc:// on the same node#2517
ktf merged 2 commits intoAliceO2Group:devfrom
ktf:ipc-support

Conversation

@ktf
Copy link
Copy Markdown
Member

@ktf ktf commented Oct 21, 2019

No description provided.

@ktf ktf requested a review from a team as a code owner October 21, 2019 09:58
@ktf ktf closed this Oct 21, 2019
@ktf ktf reopened this Oct 21, 2019
@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 21, 2019

@knopers8 @sawenzel @matthiasrichter @shahor02 this now should optimise local DPL workflows to use (managed) shared memory.

@knopers8 could you try to repeat your QC benchmark?

@ktf ktf mentioned this pull request Oct 21, 2019
3 tasks
@ktf ktf closed this Oct 21, 2019
@ktf ktf reopened this Oct 21, 2019
@knopers8
Copy link
Copy Markdown
Collaborator

I will run them. Just to make sure, this will work also for messages created with standard ctx.outputs().make<char>(...);, or I have to allocate them in some other way?

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 21, 2019

It should be completely transparent.

@aalkin
Copy link
Copy Markdown
Member

aalkin commented Oct 21, 2019

@ktf I see that some different workflows fail in each build, so the issue is most likely is with shm in a container. What are the settings for those currently?

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 21, 2019

Indeed. I use the defaults which I think allows for 64MB of shared memory.

* Optimize to use ipc:// on the same node
* Allow specifying resources on the command line. Currently works
  only for a single host.
* Foundations to be able to run the same workflow in multinode
  distributed manner.
@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 22, 2019

It actually seems to work with a multinode setup as well!! To try it:

# On host1
stage/tests/o2-test-framework-ParallelPipeline --hostname host1 --resources host1:1:4000:22000:23000,host2:4:4000:24000:25000
# On host2
stage/tests/o2-test-framework-ParallelPipeline --hostname host2 --resources host1:1:4000:22000:23000,host2:4:4000:24000:25000

Checking if using it is actually causing the tests to randomly fail.
@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 22, 2019

@aalkin ok, I think I understand the issue. There is a unique id "--session" which needs to be passed to FairMQ to allow different shared memory pools. We can simply put back uniqueWorkflowId as an argument of prepareArguments.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 22, 2019

The gpu error seems unrelated. I will merge this to have the improved resource scheduling and address the actual usage of shmem as transport in a separate PR.

@davidrohr
Copy link
Copy Markdown
Collaborator

ok, but could we check what is going on with the GPU CI? It is just failling constantly at random recipes. Doesn't make much sense this way.

@ktf ktf merged commit f4dd053 into AliceO2Group:dev Oct 22, 2019
@ktf ktf deleted the ipc-support branch October 22, 2019 21:53
@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 22, 2019

I am looking into it.

@knopers8
Copy link
Copy Markdown
Collaborator

payload size nb prod. Messages/s (100% dispatched) Std. Dev. Messages/s (0% dispatched) Std. Dev.
256 1 16181 141 16236 326
256 2 32741 349 32768 154
256 4 60700 522 64796 692
256 8 47250 4595 101797 9760
256 16 44849 316 71685 3057
256 32 42150 416 64341 3123
2097152 1 1872 66 16242 104
2097152 2 1710 232 32757 233
2097152 4 1698 157 64534 280
2097152 8 1798 49 116275 2289
2097152 16 #DIV/0! #DIV/0! 95642 1274
2097152 32 #DIV/0! #DIV/0! 92389 #DIV/0!
Payload size [B] Messages/s (100% passed) std dev 100% Messages/s (0% passed) std dev 0% Data throughput [MB/s]
1 61292.38 1565.82 114345.2 1353.90 0.06
256 53659.42 4043.48 111420 4289.47 13.10
1k 52589.68 3847.85 114958.6 1209.56 51.36
4k 49870.56 3210.02 115072.8 2566.04 194.81
16k 41745.88 3582.63 114624.6 2002.28 652.28
64k 29352.38 1347.49 114196 1623.91 1834.52
256k 13368.88 851.94 115871 1784.82 3342.22
1M 2034.426 194.63 116179.6 630.18 2034.43
4M 892.0678 94.98 116134 2683.21 3568.27
16M 224.3374 4.83 115177.6 1459.27 3589.40
64M 29.82828 1.77 116270.8 1441.24 1909.01
256M 4.736418 0.08 114726.4 1657.43 1212.52
1G #DIV/0! #DIV/0! #DIV/0! #DIV/0! #DIV/0!

Some of the results for larger payloads and number of producers are missing, because something crashes inside FairMQ/shmem/boost. It helps a bit if I reduce the buffers size from standard 1000 to something smaller, but with rates high enough the problem appears anyway. I guess it is connected with https://alice.its.cern.ch/jira/browse/O2-879

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 23, 2019

By default FairMQ has 2GB shared memory segment. I suppose that is what is crashing the 1GB test. You should be able to change it with --shm-segment-size.

@aalkin
Copy link
Copy Markdown
Member

aalkin commented Oct 23, 2019

There seems to be a clear throughput maximum around 4M-16M payload size, which is far enough from 2GB limit. It would be interesting to look at a heatmap-like throughput plot with number of producers and payload sizes at the axes.

@knopers8
Copy link
Copy Markdown
Collaborator

knopers8 commented Oct 23, 2019

By default FairMQ has 2GB shared memory segment. I suppose that is what is crashing the 1GB test. You should be able to change it with --shm-segment-size.

I've been using that for these tests, it helped only to some extent.

There seems to be a clear throughput maximum around 4M-16M payload size, which is far enough from 2GB limit. It would be interesting to look at a heatmap-like throughput plot with number of producers and payload sizes at the axes.

Yes, that might be interesting to see, I will let it run later.

@knopers8
Copy link
Copy Markdown
Collaborator

I have tested it on previous 'force-pushed' version (be1a810). Would the new one have something different regarding these crashes?

@aalkin
Copy link
Copy Markdown
Member

aalkin commented Oct 23, 2019

I would suggest to test with 4428d07 (has shmem transport) and 6fab7d9 (shmem transport removed, so the default zeromq is used).

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Oct 23, 2019

@knopers8 could you try #2531 to see if numbers improve? This should remove the hash table lookup.

carlos-soncco pushed a commit to carlos-soncco/AliceO2 that referenced this pull request Oct 28, 2019
* Optimise to use ipc:// on the same node. This will allow us to get rid of the FreePortFinder.
* Allow specifying resources on the command line. For the moment devices are allocated to resources in a naive way (each device gets roughtly 1/N of the total resources). Notice that there is not any actual QoS happening, this is just to evenly subdivide device across available resources.
* Foundations to be able to run the same workflow in a multinode
  distributed manner. It actually works (tested with a two node setup), however there
  are still issues like the fact that not all the workflows will quit correctly in such setup (i.e. `QuitRequest::All` cannot be propagated to other hosts, somewhat by design).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants