py: implement foreach_chunk method for streaming HTTP output#1792
Conversation
|
|
||
| .. note:: | ||
| - The callback must be thread-safe as it will be run in a separate thread. | ||
| - This method must be called before calling :meth:`.run_to_completion`, or :meth:`.start`. |
There was a problem hiding this comment.
It should be possible to call it at runtime. Of course, this means that the callback will only be invoked for new chunks that show up after the connector got attached to the pipeline.
There was a problem hiding this comment.
Okay.
I will update the docs to say that should be called before run_to_completion, and if called after start will only be invoked for new chunks.
| view_name: str, | ||
| callback: Callable[[pd.DataFrame, int], None], | ||
| queue: Queue, | ||
| ): |
There was a problem hiding this comment.
This code is almost identical to output_handler.rs. I understand that that code assembles the result in a single dataframe, while here we invoke the callback for each batch. But I would expect the former to be implemented on top of the latter, so we shouldn't need two near-identical implementations.
There was a problem hiding this comment.
Yes. I am working on finding an elegant way to merge the two.
python/feldera/_callback_runner.py
Outdated
| if data is not None: | ||
| self.callback(dataframe_from_response([data]), seq_no) | ||
|
|
||
| try: |
There was a problem hiding this comment.
Can you please add comments explaining what's going on here?
|
I also realized that this still doesn't break input into lines in |
|
Ran into another issue: |
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Use `iter_lines` to read HTTP response line-by-line. This way we don't need to worry about incomplete chunks. The previous implementation also ran out of memory on large outputs. I did not figure out what was going on exactly, but it used up 20GB of RAM while parsing a few thousand records. This implementation does not seem to have that problem. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com> Co-authored-by: Leonid Ryzhyk <leonid@feldera.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Is this a user-visible change (yes/no): yes