Add celerity blockchain for task divergence checking#217
Add celerity blockchain for task divergence checking#217
Conversation
|
Check-perf-impact results: (5a19ced85f862a00d0114dd241122462) ❓ No new benchmark data submitted. ❓ |
ae68e51 to
af692d0
Compare
af692d0 to
2dd8a09
Compare
2dd8a09 to
6c3f128
Compare
ef86cd1 to
9ae8356
Compare
PeterTh
left a comment
There was a problem hiding this comment.
Regarding the overall design, I mostly like it, but one thing I noticed is that there is a lot of code that's purely for testing which is in the actual "main logic" part. So far, I believe we mostly try to keep testing-only code in testing helpers, which makes it easier to see at a glance what is actually happening in a production run vs. what is done only in testing. We might want to move the testing-related functionality out of the main code.
d5d2e90 to
b88b60f
Compare
bb94e68 to
4af1341
Compare
fknorr
left a comment
There was a problem hiding this comment.
Thanks! This is becoming a valuable on-boarding debug feature.
I have not looked closely at the algorithm so far, but some more documentaiton / clarification on what the workflow around the state and the member functions of abstract_block_chain is would help a lot.
4af1341 to
252ad19
Compare
|
Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72) ❓ No new benchmark data submitted. ❓ |
252ad19 to
8a27177
Compare
psalz
left a comment
There was a problem hiding this comment.
Thanks! I did a first pass and added some comments, mostly about comprehensibility / naming.
388c399 to
b06ad9e
Compare
|
Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17) ❓ No new benchmark data submitted. ❓ |
fc46cfc to
ebf4f9a
Compare
psalz
left a comment
There was a problem hiding this comment.
Things are coming together! GitHub decided to post some of my comments early for some reason... Well, here's the rest :P
By the way, we have a section on diverging execution in ours docs / website, it would be great if you could add a sentence there mentioning this check and how it can be enabled!
| std::transform(div_test.begin(), div_test.end(), std::back_inserter(sizes), [](auto& div) { return div->m_task_records.size(); }); | ||
| auto [min, max] = std::minmax_element(sizes.begin(), sizes.end()); | ||
|
|
||
| std::vector<per_node_task_hashes> extended_lifetime_hashes; |
There was a problem hiding this comment.
I don't understand what this is for..?
There was a problem hiding this comment.
The extended_lifetime_hashes are needed, because of how the single node divergence_test_communicator works. The idea is following:
- each simulated node calls allgather sequentially. The send/receive buffer is saved internally as simple pointers.
- When all simulated nodes called the allgather the data is copied.
To prevent those send/receive buffers for per_node_task_hashes from going out of scope we need to "extend" their lifetime. For this, I included the extended_lifetime_hashes vector.
There was a problem hiding this comment.
Okay so I took a look at this again and this does not work: divergence_block_chain::collect_hashes returns a per_node_task_hashes by value, which itself contains a std::vector<task_hash>. So extended_lifetime_hashes[0].data() is no longer the same pointer as stored in divergence_test_communicator::m_gather_data, and you're effectively reading already free'd memory. If you run the tests with AddressSanitizier (and without mimalloc), you will get an error for it.
I think the only way the collective operations can be properly mocked (that is, to have them block until all "ranks" have called them), is to put each block chain into a separate thread. The upside is that you won't need any of the pre_check / post_check hackery. I would suggest to abstract all of that into a divergence_check_test_context (which would subsume the divergence_test_communicator_provider) that also wraps a number of divergence_block_chains, each in their own thread, and possibly even the task_test_context for each.
ebf4f9a to
3fe7ae2
Compare
3fe7ae2 to
3adceaa
Compare
3adceaa to
6eb38e4
Compare
6eb38e4 to
1a6e3ee
Compare
| std::unique_ptr<communicator> m_communicator; | ||
|
|
||
| void divergence_out(const divergence_map& check_map, const int task_num); | ||
| void reprot_divergence(const divergence_map& check_map, const int task_num); |
There was a problem hiding this comment.
| void reprot_divergence(const divergence_map& check_map, const int task_num); | |
| void report_divergence(const divergence_map& check_map, const int task_num); |
| MPI_Comm_dup(MPI_COMM_WORLD, &comm); | ||
| m_divergence_check = std::make_unique<divergence_checker>(*m_task_recorder, std::make_unique<mpi_communicator>(comm), m_test_mode); | ||
| #endif | ||
| // if (m_cfg->is_recording()) { |
| // // Sychronize all nodes before reseting shuch that we don't get into a deadlock | ||
| // // With this barrier we can be shure that every node is fully finished and all mpi operations are done. (Sending ...) | ||
| // MPI_Barrier(MPI_COMM_WORLD); | ||
| // m_divergence_check.reset(); |
PeterTh
left a comment
There was a problem hiding this comment.
LGTM, other than a few typos and commented out code.
| } | ||
|
|
||
| #if CELERITY_DIVERGENCE_CHECK | ||
| // Sychronize all nodes before reseting shuch that we don't get into a deadlock |
There was a problem hiding this comment.
A few typos "Sychronize" "reseting" "shuch"
| CELERITY_FEATURE_UNNAMED_KERNELS=$<BOOL:${CELERITY_FEATURE_UNNAMED_KERNELS}> | ||
| CELERITY_DETAIL_HAS_NAMED_THREADS=$<BOOL:${CELERITY_DETAIL_HAS_NAMED_THREADS}> | ||
| CELERITY_ACCESSOR_BOUNDARY_CHECK=$<BOOL:${CELERITY_ACCESSOR_BOUNDARY_CHECK}> | ||
| CELERITY_DIVERGENCE_CHECK=$<BOOL:${CELERITY_DIVERGENCE_CHECK}> |
There was a problem hiding this comment.
Also needs to be added to cmake/celerity-config.cmake.in!
| #if CELERITY_DIVERGENCE_CHECK | ||
| // divergence checker needs recording | ||
| m_recording = true; | ||
| #else | ||
| m_recording = parsed_and_validated_envs.get_or(env_recording, false); | ||
| #endif |
There was a problem hiding this comment.
Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.
There was a problem hiding this comment.
Agreed, maybe we should just get rid of it in a small follow-up PR.
fknorr
left a comment
There was a problem hiding this comment.
A few more comments!
Also since the clang-tidy check for this seems broken in CI: Please go over all new function definitions and make sure parameters that can be const are const.
|
|
||
| private: | ||
| std::thread m_thread; | ||
| bool m_is_running = false; |
There was a problem hiding this comment.
m_is_running must be protected by a mutex!
| divergence_block_chain& operator=(const divergence_block_chain&) = delete; | ||
| divergence_block_chain& operator=(divergence_block_chain&&) = delete; | ||
|
|
||
| bool check_for_divergence(); |
There was a problem hiding this comment.
Needs a comment on what a true / false return value means. It reads like this would return true when there was divergence, but the function actually throws in that case!
| m_thread = std::thread(&divergence_checker::run, this); | ||
| m_is_running = true; |
There was a problem hiding this comment.
It feels like there is a race between setting m_is_running = true here and the check for m_is_running in run(). I suggest you reverse the order to fix this.
| m_thread = std::thread(&divergence_checker::run, this); | |
| m_is_running = true; | |
| m_is_running = true; | |
| m_thread = std::thread(&divergence_checker::run, this); |
| #if CELERITY_DIVERGENCE_CHECK | ||
| // divergence checker needs recording | ||
| m_recording = true; | ||
| #else | ||
| m_recording = parsed_and_validated_envs.get_or(env_recording, false); | ||
| #endif |
There was a problem hiding this comment.
Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.
| if(min_hash_count == 0) { | ||
| if(max_hash_count != 0 && m_local_nid == 0) { | ||
| check_for_deadlock(); | ||
| } else if(max_hash_count == 0) { | ||
| return true; | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Comment: What is happening here?
| m_hashes_added = m_task_records.size(); | ||
| } | ||
|
|
||
| void divergence_block_chain::clear(const int min_progress) { |
There was a problem hiding this comment.
Naming: This does not clear the chain (similar to what vector::clear would do. Maybe something along the lines of erase_front, prune_leading, or similar?
|
Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:
|
This pull request adds a divergence checking mechanism for tasks.
It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:
Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):
All of this is automatically turned on by running the program with task recording enabled.