diff roundup/backends/sessions_dbm.py @ 6565:2c2dbfc332ba

Try to handle multiple connections better. The session database is a hot spot. When multiple requests (e.g. 20) come in at the same time session database contention can get great. The original code didn't retry session database access when the open failed. This resulted in errors at the client. The second pass delayed 0.01 seconds and retried. It was better but we still had multiple second stalls. I think the first request got in, everybody else backed up and then retried at the same time. Again they stepped on each other. With logging I would see many counters go all the way to low single digits or to -1 indicating falure. This pass uses randomint to generate delays from 0-.125 seconds in 5ms increments. This performs better in testing. I rarely saw a counter less than 13 (2 failed retries). Current logging starts after 6 failures and counts down until success or failure.
author John Rouillard <rouilj@ieee.org>
date Thu, 16 Dec 2021 20:02:00 -0500
parents bef1e42be04c
children b4d0b48b3096
line wrap: on
line diff
--- a/roundup/backends/sessions_dbm.py	Wed Dec 15 23:52:25 2021 -0500
+++ b/roundup/backends/sessions_dbm.py	Thu Dec 16 20:02:00 2021 -0500
@@ -6,7 +6,7 @@
 """
 __docformat__ = 'restructuredtext'
 
-import os, marshal, time
+import os, marshal, time, logging, random
 
 from roundup.anypy.html import html_escape as escape
 
@@ -132,21 +132,24 @@
         dbm = __import__(db_type)
 
         retries_left = 15
+        logger = logging.getLogger('roundup.hyperdb.backend.sessions')
         while True:
             try:
                 handle = dbm.open(path, mode)
                 break
-            except OSError:
+            except OSError as e:
                 # Primarily we want to catch and retry:
                 #   [Errno 11] Resource temporarily unavailable retry
                 # FIXME: make this more specific
+                if retries_left < 10:
+                    logger.warning('dbm.open failed, retrying %s left: %s'%(retries_left,e))
                 if retries_left < 0:
                     # We have used up the retries. Reraise the exception
                     # that got us here.
                     raise
                 else:
-                    # delay retry a bit
-                    time.sleep(0.01)
+                    # stagger retry to try to get around thundering herd issue.
+                    time.sleep(random.randint(0,25)*.005)
                     retries_left = retries_left - 1
                     continue  # the while loop
         return handle

Roundup Issue Tracker: http://roundup-tracker.org/