py: implement `SQLContext.wait_for_completion` by abhizer · Pull Request #1872 · feldera/feldera

abhizer · 2024-06-14T13:47:41Z

Is this a user-visible change (yes/no): yes

Also adds an enum PipelineStatus to represent the current state of the pipeline.
Adds a flush parameter to connect_source_pandas that when set, immediately sends the data to the backend.

snkas · 2024-06-14T15:19:20Z

python/feldera/enums.py

+                Failed
+   """
+
+    UNINITIALIZED = 1, """


NON_EXISTENT / DOES_NOT_EXIST / NOT_CREATED / NOT_FOUND or so is more appropriate naming, as both SHUTDOWN and PROVISIONING are states that are "uninitialized".

snkas · 2024-06-14T16:11:33Z

python/feldera/enums.py

@@ -1,33 +1,177 @@
 from enum import Enum


I don't see as much the need of adding extra code specifically to document the enumeration variants. They can just be specified in the docstring of the enumeration as a list like:

- SERVER_DEFAULT: ... - DEV: ... - UNOPTIMIZED: ... - OPTIMIZED: ...

Which would still be nearby (e.g., when someone adds/edits an enumeration, editing the corresponding documentation is nearby), and the enumerations will likely not see much change over time.

ryzhyk · 2024-06-14T19:29:04Z

python/feldera/sql_context.py

        self.tables[name] = SQLTable(name, ddl)

-    def connect_source_pandas(self, table_name: str, df: pandas.DataFrame):
+    def connect_source_pandas(self, table_name: str, df: pandas.DataFrame, flush: bool = False):


In the hindsight, this design where we connect Pandas inputs to the pipeline before it's running was a mistake. I suggest that we nuke the run_to_completion method. Instead the user must call start, then feed their dataframes and finally call run_to_completion. Feeding a dataframe to a pipeline that is not running is an error. Do you see any downsides to this approach?

I think so as well, then there are the setup phase and the data phase of a pipeline. It also helps that then not everything needs to be loaded in advance, especially if it turns out that the pipeline does not start. Can't think of a particular downside, maybe losing some ability to schedule the different data to push? Like, one table-at-time, or all-table-in-parallel, it depends on the query which one makes sense.

ryzhyk · 2024-06-14T19:41:56Z

python/feldera/sql_context.py

+        :param flush: If True, the data will be pushed to the pipeline immediately. Defaults to False.
        """

+        if flush and self.pipeline_status() != PipelineStatus.RUNNING:


We may want to allow pushing data in the PAUSED state with the force flag. This could be a good debugging tool: pause the pipeline, push a small change manually, see what happens.

You could add an optional force argument to this function for that. Ok to make that a separate future PR.

ryzhyk · 2024-06-14T19:50:43Z

python/feldera/sql_context.py

        .. _run_to_completion:

        Runs the pipeline to completion, waiting for all input records to be processed.
+        Will block indefinitely if the source is streaming.


Suggested change

Will block indefinitely if the source is streaming.

Will block indefinitely if one of the input connectors is a streaming connector that does not emit an end-of-input notification, e.g., Kafka.

ryzhyk · 2024-06-14T19:51:31Z

python/feldera/sql_context.py

+        :param delete_connectors: If True, also deletes the connectors associated with the pipeline. False by default.
+        """
+
+        if self.pipeline_status() != PipelineStatus.SHUTDOWN:


It should be ok to delete a failed pipeline?

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

* `input_pandas` must now be called after starting a pipeline Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer · 2024-06-24T12:47:54Z

The CI fails with:

no field `skip_schema_id` on type `pipeline_types::format::avro::AvroEncoderConfig`

But it exists?

feldera/crates/pipeline-types/src/format/avro.rs

Line 23 in 10940d8

pub skip_schema_id: bool,

snkas

Looks good! Separating start() and waiting for completion makes sense.

snkas · 2024-06-24T12:37:59Z

python/docs/examples.rst


 .. warning::
-    Kafka is a streaming data source, therefore running: :meth:`.SQLContext.run_to_completion` will run forever.
+    Kafka is a streaming data source, therefore running: :meth:`.SQLContext.wait_for_completion` will block forever.


Ideally, not this PR, an error should be thrown if a streaming data source is defined and the user tries to call wait_for_completion().

python/docs/examples.rst

snkas · 2024-06-24T12:40:43Z

python/docs/examples.rst

+To listen for response from feldera, in the form of DataFrames
+call :meth:`.SQLContext.listen`.
+To ensure all data is received start listening before calling
+:meth:`.SQLContext.start`.


This could also be enforced via a state check in the listen() command?

You can also start listening at some arbitrary point after starting a pipeline.

Or listen to an already running pipeline.

True, that's possible, my question is more whether a Python user with SQLContext ever wants specifically that, as there are no guarantees at what point of the stream listening will start. Probably out of scope for this PR.

python/docs/introduction.rst

python/feldera/enums.py

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

ryzhyk

I'll make small changes to comments and merge.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

abhizer requested a review from snkas June 14, 2024 13:47

snkas reviewed Jun 14, 2024

View reviewed changes

ryzhyk requested changes Jun 14, 2024

View reviewed changes

ryzhyk mentioned this pull request Jun 18, 2024

Python API todos #1776

Closed

16 tasks

abhizer added 4 commits June 24, 2024 17:41

py: implement SQLContext.wait_for_completion

e1d8333

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

py: remove run_to_completion

6711340

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

py: check pipeline state before starting pipeline

b240040

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

py: refactor(connect_input_pandas) -> input_pandas

10940d8

* `input_pandas` must now be called after starting a pipeline Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer force-pushed the wait_for_completion branch from 4ac96e1 to 10940d8 Compare June 24, 2024 12:33

snkas approved these changes Jun 24, 2024

View reviewed changes

suggested fixes to docs

f89d127

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer requested a review from ryzhyk June 24, 2024 13:25

ryzhyk approved these changes Jun 24, 2024

View reviewed changes

Leonid Ryzhyk added 2 commits June 24, 2024 10:42

py: Improve doc comments.

7fb7507

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

clippy

3ae2484

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

ryzhyk merged commit 18dd9fb into main Jun 24, 2024

ryzhyk deleted the wait_for_completion branch June 24, 2024 19:04

	Will block indefinitely if the source is streaming.
	Will block indefinitely if one of the input connectors is a streaming connector that does not emit an end-of-input notification, e.g., Kafka.

Conversation

abhizer commented Jun 14, 2024

Uh oh!

snkas Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhizer commented Jun 24, 2024

Uh oh!

snkas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ryzhyk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snkas Jun 14, 2024 •

edited

Loading