1

I want to set up a watchdog that checks whether the io_context workers can pick up tasks within a reasonable time and are not stuck running long or blocking operations.

To achieve this, I've implemented a check that verifies the io_context queue is functioning properly by scheduling a task every 30 seconds. This task simply sets a flag to true so we can confirm that the queue is still responsive.

using PeriodicTask = BasicScheduledTask<boost::asio::steady_timer, true>;

std::shared_ptr<PeriodicTask> io_context_alive_task_ = std::make_shared<PeriodicTask> 
         (io_context_,
          [this](const auto& ec) {
            if (ec) {
              print_error("Could not report io_context as alive: {}", ec.message());
              return;
            }
            print_debug("Marking io_context as alive");
            is_context_alive_ = true;
          }, 30s));

The watchdog runs in its own independent thread outside io_context and checks every 2 minutes whether the flag has been set to true.

std::unique_ptr<std::thread> context_watchdog_ = std::make_unique<std::thread>([this] {
        while (!io_context_.stopped()) {
                io_context_alive_cv_.wait_for(lock, 2min, [this] { return io_context_.stopped(); });
                if (!is_context_alive_) {
                        print_critical("io_context is not responding");
                        std::abort();
                }
                print_debug("io_context is ok, setting back to false");
                is_context_alive_ = false;
        }
        print_debug("io_context stopped. stopping thread");
});

I've notice that I get some false alarms when the system wakes from sleep.

This happens since the 30 second periodic task that marks the context as alive does not run for more than 2 minutes. As a result, the watchdog assumes the io_context is unresponsive and attempts to abort the service.

I wonder if the io_context_alive_cv_ which is from type std::condition_variable is ticking during sleep mode while the boost based time of the keep alive task is idle in this time. If so, perhaps you can suggest me a way to resolve it ?

Thanks

5
  • It's not a false alarm. When you suspend the computer it will not be able to process handlers on the io_context. Commented Nov 29 at 2:00
  • Exactly, I wish to solve the problem when the conditional variable signal is ticking but the periodic task (handler) is not - which lead to the indication that the io_context is stuck, but I want to ignore idleness during sleep. Commented Nov 29 at 7:23
  • Better: avoid any blocking handlers. That should not be very hard. And you will never get stuck. Commented Nov 29 at 12:07
  • @sehe, what if I'm not the owner of the entire code. I need to detect cases where the io_context is not completely stuck, but it doesn't responsive for rather long time (it can give me indication if I need to add more threads that run from io_context for example) . So I want to get indication of such a case. any idea if io_context contain some internal watchdog, or you have a simpler way to check this scenario ? Commented Nov 30 at 15:34
  • That's still the same design problem. I'd change the interface so that you control what's posted, make sure that "untrusted" handlers run on separate worker threads, which you then monitor the queue depth for. Of course, if you're doing this purely as a "courtesy" to users of your library/framework that should know what they're doing, it should be optional (#define ENABLE_MONITOR_EXECUTION_CONTEXT_PROGRESS 1) and your "best effort" diagnostics are fine! If they put their system into sleep during testing, they will figure it out? Commented Nov 30 at 18:33

1 Answer 1

3

Check the actual time spent in wait_for. If it is significantly more than what you asked for, assume the system slept and ignore the flag.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, do you mean I should do something like auto start = std::chrono::steady_clock::now(); io_context_alive_cv_.wait_for(lock, 2min, [this] { return io_context_.stopped(); }); auto end = std::chrono::steady_clock::now(); auto actual_wait = end - start; ?
Essentially, yes.
but what if the sleep period was ~ 2 minutes, I could I assume if sleep happened in the case ?
@Zohar81 presumably the PeriodicTask which sets the flag will execute immedately upon waking (since the delay has expired while sleeping). And so even if the watchdog doesn't notice the sleep, it will see the flag is set and carry on. The only way you can get a false positive is if the system sleeps for almost exactly the same duration which you use as the threshold for deciding it was sleeping, and the scheduler decides to resume the watchdog before the setter.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.