-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
refactor the global analytics bus to use a generic async batching util #13279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Test Results - Preflight, Unit22 363 tests ±0 20 613 ✅ ±0 16m 15s ⏱️ +29s Results for commit b89dd37. ± Comparison against base commit b4c510a. This pull request removes 14 and adds 14 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
a9fd893 to
3037d97
Compare
Test Results (amd64) - Integration, Bootstrap 5 files 5 suites 2h 39m 23s ⏱️ Results for commit b89dd37. ♻️ This comment has been updated with latest results. |
3037d97 to
080a8b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition and refactor here!
Just some questions regarding usage of the utility, suggestions for how we could integrate the current Batcher into the AsyncBatcher, and some additional context around the initial implementation.
| return list(filter(lambda _: condition, items)) | ||
|
|
||
|
|
||
| def iter_chunks(items: list[_E], chunk_size: int) -> Generator[list[_E], None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not have identical behaviour to itertools.batched?
(sidenote: now that we're on Python 3.13 we can finally use this utility which makes me so happy 🤓)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, i did not know this existed. i added this before 3.13 :-) i think it's worth keeping until the only thing we support is 3.13+ or until we use this code exclusively for the runtime
| max_batch_size: int | ||
| handler: BatchHandler[T] | ||
|
|
||
| _buffer: list[T] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not use the original Batcher utlity to collect the batches instead of re-implementing the logic? There are quite a few edge-cases that the original batcher handles that could be useful here IMO
i.e
batcher: Batcher[T]
def add(self, item: T):
with self._flush_lock:
if self._closed:
raise ValueError("Batcher is stopped, can no longer add items")
# This call triggers the batcher's policy, meaning we should flush.
if self.batcher.add(item):
self._flush_lock.notify_all()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my main reason was to not entangle the two implementations. the batching logic is ultimately relatively simple and this way it's clearer when reading the code what exactly happens.
| self._flush_lock = threading.Condition() | ||
| self._closed = False | ||
|
|
||
| def add(self, item: T): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I think it'd be useful to know whether a call has triggered a batch, or perhaps some signal indicating that a batch has been sent. Else, how would I know whether the function has been called?
vals = []
buffer = AsyncBatcher(lambda x: vals.append(x), max_batch_size=2, max_flush_interval=1000)
for data in stream_generator():
if not (is_triggered := buffer.add(data)):
continue
# The BatchHandler should have been called
do_stuff(vals)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some signal indicating that a batch has been sent
the invocation of the handler is the signal. this is fundamentally the difference to Batcher. AsyncBatcher calls the handler for you.
in your case, i would just pass do_stuff as a handler, as it seems that's what you want. no need to collect values first.
AsyncBatcher(do_stuff, max_batch_size=2, max_flush_interval=1000)| # we can call the processor outside the lock so we can continue adding items into the next batch without | ||
| # waiting on the processor to return. | ||
| try: | ||
| self.handler(batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the handler fails, that means we lose the entire batch of data. Is this desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suppose an exception handler would be a nice mechanism to give more flexibility on what should happen with batches that fail. this is not so easy to generalize in a way that is flexible, and would introduce a lot of additional complexity that there's currently no use case for. the assumption therefore is currently that fault tolerance is the handler's responsibility.
localstack-core/localstack/services/lambda_/event_source_mapping/pollers/stream_poller.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I'm going to see if I can use this batcher for some of our ESM pollers. Since it's quite a complicated use-case, it'll aid some of our discussions on usage of the new utlity 🙂
Motivation
While in the process of fixing an issue with the way aws request events are batched from the pro version, I cleaned up some of the implementation by pulling out a generic batching utility, and re-use that across both the GlobalAnalyticsBus as well as the
ProServiceRequestAggregator. I also noticed that the CLI tests wold break when importing batcher because of the pydantic requirement that I found was not really necessary if we just used a regular constructor forBatcher. Basically just cleaning up the implementation and de-duplicating some code, nothing major.Changes
PublisherBufferconcept from the analytics client into anAsyncBatcherutilitybatch_policy.pytobatching.pyto be more consistent with our utils namingBatcherand removed the pydantic dependency