Conversation
| delta2: Cow<'_, Option<Spine<I2>>>, | ||
| delayed_trace2: Cow<'_, T2>, | ||
| ) -> Z { | ||
| ) -> impl futures::Stream<Item = (Z, bool, Option<Position>)> + 'static { |
There was a problem hiding this comment.
you seem to have used an alias of AsyncStream elsewhere.
It would be nice to have that defined somewhere globally and used uniformly
Refactor the asof join operator to split outputs over multiple steps, so it doesn't produce very large output batches. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors the asof join operator to split outputs over multiple steps using a streaming approach, preventing very large output batches and supporting transactions. The key change is converting the operator from a synchronous QuaternaryOperator to a StreamingQuaternaryOperator that yields results in chunks.
Key changes:
- Converted asof join from synchronous to streaming operator that yields chunked results
- Changed internal state fields (
flush,delta1,delta2) to useRefCellfor interior mutability - Updated tests to handle chunked output using
concat()and added assertions comparing against reference implementations
| let delayed_trace1 = if *self.flush.borrow() { | ||
| Some(delayed_trace1.as_ref().ro_snapshot()) | ||
| } else { | ||
| None | ||
| }; | ||
|
|
||
| let delayed_trace2 = if *self.flush.borrow() { | ||
| Some(delayed_trace2.ro_snapshot()) | ||
| } else { | ||
| None | ||
| }; |
There was a problem hiding this comment.
The flush flag is checked twice via borrow() at lines 608 and 614. Consider storing the result once to avoid redundant borrow operations.
| let delta1 = self.delta1.take().unwrap(); | ||
| let delta2 = self.delta2.take().unwrap(); |
There was a problem hiding this comment.
Using take() directly on RefCell fields requires calling methods on the borrowed value. This should be self.delta1.borrow_mut().take() - verify this compiles correctly.
| let delta1 = self.delta1.take().unwrap(); | |
| let delta2 = self.delta2.take().unwrap(); | |
| let delta1 = self.delta1.borrow_mut().take().unwrap(); | |
| let delta2 = self.delta2.borrow_mut().take().unwrap(); |
| if output_tuples.len() >= chunk_size { | ||
| let result = Z::dyn_from_tuples(&self.factories.output_factories, (), &mut output_tuples); | ||
| self.output_batch_stats.borrow_mut().add_batch(result.len()); | ||
| yield (result, false, delta1_cursor.position()); |
There was a problem hiding this comment.
This line uses delta1_cursor.position() but is inside the loop processing delta2 only (after delta1 is exhausted). Should use delta2_cursor.position() instead.
| yield (result, false, delta1_cursor.position()); | |
| yield (result, false, delta2_cursor.position()); |
Refactor the asof join operator to split outputs over multiple steps, so it doesn't produce very large output batches.
This will go in after #5370