changeset 6565:2c2dbfc332ba

Try to handle multiple connections better. The session database is a hot spot. When multiple requests (e.g. 20) come in at the same time session database contention can get great. The original code didn't retry session database access when the open failed. This resulted in errors at the client. The second pass delayed 0.01 seconds and retried. It was better but we still had multiple second stalls. I think the first request got in, everybody else backed up and then retried at the same time. Again they stepped on each other. With logging I would see many counters go all the way to low single digits or to -1 indicating falure. This pass uses randomint to generate delays from 0-.125 seconds in 5ms increments. This performs better in testing. I rarely saw a counter less than 13 (2 failed retries). Current logging starts after 6 failures and counts down until success or failure.
author John Rouillard <rouilj@ieee.org>
date Thu, 16 Dec 2021 20:02:00 -0500
parents 21c7c2041a4b
children 8f1fddb71422
files CHANGES.txt roundup/backends/sessions_dbm.py
diffstat 2 files changed, 11 insertions(+), 4 deletions(-) [+]
line wrap: on
line diff
--- a/CHANGES.txt	Wed Dec 15 23:52:25 2021 -0500
+++ b/CHANGES.txt	Thu Dec 16 20:02:00 2021 -0500
@@ -59,6 +59,10 @@
 - handle configparser.InterpolationSyntaxError raised if value
   has a single %. Seems to afect python 3 only. Reported by
   nomicon on IRC. (John Rouillard)
+- add random delay to session database retry code between 0 and .125
+  seconds. This seems to help reduce stalled connections when a
+  number of connections are made at the same time. Log remaining
+  retries once 5 of them have been used. (John Rouillard)
 
 Features:
 
--- a/roundup/backends/sessions_dbm.py	Wed Dec 15 23:52:25 2021 -0500
+++ b/roundup/backends/sessions_dbm.py	Thu Dec 16 20:02:00 2021 -0500
@@ -6,7 +6,7 @@
 """
 __docformat__ = 'restructuredtext'
 
-import os, marshal, time
+import os, marshal, time, logging, random
 
 from roundup.anypy.html import html_escape as escape
 
@@ -132,21 +132,24 @@
         dbm = __import__(db_type)
 
         retries_left = 15
+        logger = logging.getLogger('roundup.hyperdb.backend.sessions')
         while True:
             try:
                 handle = dbm.open(path, mode)
                 break
-            except OSError:
+            except OSError as e:
                 # Primarily we want to catch and retry:
                 #   [Errno 11] Resource temporarily unavailable retry
                 # FIXME: make this more specific
+                if retries_left < 10:
+                    logger.warning('dbm.open failed, retrying %s left: %s'%(retries_left,e))
                 if retries_left < 0:
                     # We have used up the retries. Reraise the exception
                     # that got us here.
                     raise
                 else:
-                    # delay retry a bit
-                    time.sleep(0.01)
+                    # stagger retry to try to get around thundering herd issue.
+                    time.sleep(random.randint(0,25)*.005)
                     retries_left = retries_left - 1
                     continue  # the while loop
         return handle

Roundup Issue Tracker: http://roundup-tracker.org/