Skip to content

py: implement SQLContext.wait_for_completion#1872

Merged
ryzhyk merged 7 commits intomainfrom
wait_for_completion
Jun 24, 2024
Merged

py: implement SQLContext.wait_for_completion#1872
ryzhyk merged 7 commits intomainfrom
wait_for_completion

Conversation

@abhizer
Copy link
Copy Markdown
Contributor

@abhizer abhizer commented Jun 14, 2024

Is this a user-visible change (yes/no): yes

Also adds an enum PipelineStatus to represent the current state of the pipeline.
Adds a flush parameter to connect_source_pandas that when set, immediately sends the data to the backend.

@abhizer abhizer requested a review from snkas June 14, 2024 13:47
Failed
"""

UNINITIALIZED = 1, """
Copy link
Copy Markdown
Contributor

@snkas snkas Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NON_EXISTENT / DOES_NOT_EXIST / NOT_CREATED / NOT_FOUND or so is more appropriate naming, as both SHUTDOWN and PROVISIONING are states that are "uninitialized".

@@ -1,33 +1,177 @@
from enum import Enum
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see as much the need of adding extra code specifically to document the enumeration variants. They can just be specified in the docstring of the enumeration as a list like:

- SERVER_DEFAULT: ...
- DEV: ...
- UNOPTIMIZED: ...
- OPTIMIZED: ...

Which would still be nearby (e.g., when someone adds/edits an enumeration, editing the corresponding documentation is nearby), and the enumerations will likely not see much change over time.

self.tables[name] = SQLTable(name, ddl)

def connect_source_pandas(self, table_name: str, df: pandas.DataFrame):
def connect_source_pandas(self, table_name: str, df: pandas.DataFrame, flush: bool = False):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the hindsight, this design where we connect Pandas inputs to the pipeline before it's running was a mistake. I suggest that we nuke the run_to_completion method. Instead the user must call start, then feed their dataframes and finally call run_to_completion. Feeding a dataframe to a pipeline that is not running is an error. Do you see any downsides to this approach?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so as well, then there are the setup phase and the data phase of a pipeline. It also helps that then not everything needs to be loaded in advance, especially if it turns out that the pipeline does not start. Can't think of a particular downside, maybe losing some ability to schedule the different data to push? Like, one table-at-time, or all-table-in-parallel, it depends on the query which one makes sense.

:param flush: If True, the data will be pushed to the pipeline immediately. Defaults to False.
"""

if flush and self.pipeline_status() != PipelineStatus.RUNNING:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to allow pushing data in the PAUSED state with the force flag. This could be a good debugging tool: pause the pipeline, push a small change manually, see what happens.

You could add an optional force argument to this function for that. Ok to make that a separate future PR.

.. _run_to_completion:

Runs the pipeline to completion, waiting for all input records to be processed.
Will block indefinitely if the source is streaming.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Will block indefinitely if the source is streaming.
Will block indefinitely if one of the input connectors is a streaming connector that does not emit an end-of-input notification, e.g., Kafka.

:param delete_connectors: If True, also deletes the connectors associated with the pipeline. False by default.
"""

if self.pipeline_status() != PipelineStatus.SHUTDOWN:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok to delete a failed pipeline?

@ryzhyk ryzhyk mentioned this pull request Jun 18, 2024
16 tasks
abhizer added 4 commits June 24, 2024 17:41
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
* `input_pandas` must now be called after starting a pipeline

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
@abhizer abhizer force-pushed the wait_for_completion branch from 4ac96e1 to 10940d8 Compare June 24, 2024 12:33
@abhizer
Copy link
Copy Markdown
Contributor Author

abhizer commented Jun 24, 2024

The CI fails with:

no field `skip_schema_id` on type `pipeline_types::format::avro::AvroEncoderConfig`

But it exists?

pub skip_schema_id: bool,

Copy link
Copy Markdown
Contributor

@snkas snkas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Separating start() and waiting for completion makes sense.


.. warning::
Kafka is a streaming data source, therefore running: :meth:`.SQLContext.run_to_completion` will run forever.
Kafka is a streaming data source, therefore running: :meth:`.SQLContext.wait_for_completion` will block forever.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, not this PR, an error should be thrown if a streaming data source is defined and the user tries to call wait_for_completion().

To listen for response from feldera, in the form of DataFrames
call :meth:`.SQLContext.listen`.
To ensure all data is received start listening before calling
:meth:`.SQLContext.start`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be enforced via a state check in the listen() command?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also start listening at some arbitrary point after starting a pipeline.

Or listen to an already running pipeline.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, that's possible, my question is more whether a Python user with SQLContext ever wants specifically that, as there are no guarantees at what point of the stream listening will start. Probably out of scope for this PR.

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
@abhizer abhizer requested a review from ryzhyk June 24, 2024 13:25
Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make small changes to comments and merge.

Leonid Ryzhyk added 2 commits June 24, 2024 10:42
Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
@ryzhyk ryzhyk merged commit 18dd9fb into main Jun 24, 2024
@ryzhyk ryzhyk deleted the wait_for_completion branch June 24, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants