Skip to content

py: implement foreach_chunk method for streaming HTTP output#1792

Merged
gz merged 5 commits intomainfrom
foreach_chunk
May 28, 2024
Merged

py: implement foreach_chunk method for streaming HTTP output#1792
gz merged 5 commits intomainfrom
foreach_chunk

Conversation

@abhizer
Copy link
Copy Markdown
Contributor

@abhizer abhizer commented May 27, 2024

Is this a user-visible change (yes/no): yes

@abhizer abhizer added the python-sdk Issues related to the feldera python sdk label May 27, 2024
@abhizer abhizer requested a review from ryzhyk May 27, 2024 13:55
@abhizer abhizer mentioned this pull request May 27, 2024
16 tasks

.. note::
- The callback must be thread-safe as it will be run in a separate thread.
- This method must be called before calling :meth:`.run_to_completion`, or :meth:`.start`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to call it at runtime. Of course, this means that the callback will only be invoked for new chunks that show up after the connector got attached to the pipeline.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

I will update the docs to say that should be called before run_to_completion, and if called after start will only be invoked for new chunks.

view_name: str,
callback: Callable[[pd.DataFrame, int], None],
queue: Queue,
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is almost identical to output_handler.rs. I understand that that code assembles the result in a single dataframe, while here we invoke the callback for each batch. But I would expect the former to be implemented on top of the latter, so we shouldn't need two near-identical implementations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I am working on finding an elegant way to merge the two.

if data is not None:
self.callback(dataframe_from_response([data]), seq_no)

try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add comments explaining what's going on here?

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented May 27, 2024

I also realized that this still doesn't break input into lines in listen_to_pipeline.

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented May 28, 2024

Ran into another issue: SQLContext.start doesn't process self.views_tx and self.http_input_buffer, so foreach_chunk doesn't work when using it.

abhizer and others added 5 commits May 28, 2024 12:42
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Use `iter_lines` to read HTTP response line-by-line.  This way we don't
need to worry about incomplete chunks.  The previous implementation also
ran out of memory on large outputs.  I did not figure out what was
going on exactly, but it used up 20GB of RAM while parsing a few
thousand records.  This implementation does not seem to have that
problem.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Co-authored-by: Leonid Ryzhyk <leonid@feldera.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
@gz gz force-pushed the foreach_chunk branch from 320c284 to 32a5fc2 Compare May 28, 2024 19:42
@gz gz merged commit bc3337c into main May 28, 2024
@gz gz deleted the foreach_chunk branch May 28, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python-sdk Issues related to the feldera python sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants