py: implement SQLContext.wait_for_completion#1872
Conversation
python/feldera/enums.py
Outdated
| Failed | ||
| """ | ||
|
|
||
| UNINITIALIZED = 1, """ |
There was a problem hiding this comment.
NON_EXISTENT / DOES_NOT_EXIST / NOT_CREATED / NOT_FOUND or so is more appropriate naming, as both SHUTDOWN and PROVISIONING are states that are "uninitialized".
| @@ -1,33 +1,177 @@ | |||
| from enum import Enum | |||
There was a problem hiding this comment.
I don't see as much the need of adding extra code specifically to document the enumeration variants. They can just be specified in the docstring of the enumeration as a list like:
- SERVER_DEFAULT: ...
- DEV: ...
- UNOPTIMIZED: ...
- OPTIMIZED: ...
Which would still be nearby (e.g., when someone adds/edits an enumeration, editing the corresponding documentation is nearby), and the enumerations will likely not see much change over time.
python/feldera/sql_context.py
Outdated
| self.tables[name] = SQLTable(name, ddl) | ||
|
|
||
| def connect_source_pandas(self, table_name: str, df: pandas.DataFrame): | ||
| def connect_source_pandas(self, table_name: str, df: pandas.DataFrame, flush: bool = False): |
There was a problem hiding this comment.
In the hindsight, this design where we connect Pandas inputs to the pipeline before it's running was a mistake. I suggest that we nuke the run_to_completion method. Instead the user must call start, then feed their dataframes and finally call run_to_completion. Feeding a dataframe to a pipeline that is not running is an error. Do you see any downsides to this approach?
There was a problem hiding this comment.
I think so as well, then there are the setup phase and the data phase of a pipeline. It also helps that then not everything needs to be loaded in advance, especially if it turns out that the pipeline does not start. Can't think of a particular downside, maybe losing some ability to schedule the different data to push? Like, one table-at-time, or all-table-in-parallel, it depends on the query which one makes sense.
python/feldera/sql_context.py
Outdated
| :param flush: If True, the data will be pushed to the pipeline immediately. Defaults to False. | ||
| """ | ||
|
|
||
| if flush and self.pipeline_status() != PipelineStatus.RUNNING: |
There was a problem hiding this comment.
We may want to allow pushing data in the PAUSED state with the force flag. This could be a good debugging tool: pause the pipeline, push a small change manually, see what happens.
You could add an optional force argument to this function for that. Ok to make that a separate future PR.
python/feldera/sql_context.py
Outdated
| .. _run_to_completion: | ||
|
|
||
| Runs the pipeline to completion, waiting for all input records to be processed. | ||
| Will block indefinitely if the source is streaming. |
There was a problem hiding this comment.
| Will block indefinitely if the source is streaming. | |
| Will block indefinitely if one of the input connectors is a streaming connector that does not emit an end-of-input notification, e.g., Kafka. |
python/feldera/sql_context.py
Outdated
| :param delete_connectors: If True, also deletes the connectors associated with the pipeline. False by default. | ||
| """ | ||
|
|
||
| if self.pipeline_status() != PipelineStatus.SHUTDOWN: |
There was a problem hiding this comment.
It should be ok to delete a failed pipeline?
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
* `input_pandas` must now be called after starting a pipeline Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
4ac96e1 to
10940d8
Compare
|
The CI fails with: But it exists? |
snkas
left a comment
There was a problem hiding this comment.
Looks good! Separating start() and waiting for completion makes sense.
|
|
||
| .. warning:: | ||
| Kafka is a streaming data source, therefore running: :meth:`.SQLContext.run_to_completion` will run forever. | ||
| Kafka is a streaming data source, therefore running: :meth:`.SQLContext.wait_for_completion` will block forever. |
There was a problem hiding this comment.
Ideally, not this PR, an error should be thrown if a streaming data source is defined and the user tries to call wait_for_completion().
| To listen for response from feldera, in the form of DataFrames | ||
| call :meth:`.SQLContext.listen`. | ||
| To ensure all data is received start listening before calling | ||
| :meth:`.SQLContext.start`. |
There was a problem hiding this comment.
This could also be enforced via a state check in the listen() command?
There was a problem hiding this comment.
You can also start listening at some arbitrary point after starting a pipeline.
Or listen to an already running pipeline.
There was a problem hiding this comment.
True, that's possible, my question is more whether a Python user with SQLContext ever wants specifically that, as there are no guarantees at what point of the stream listening will start. Probably out of scope for this PR.
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
ryzhyk
left a comment
There was a problem hiding this comment.
I'll make small changes to comments and merge.
Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Is this a user-visible change (yes/no): yes
Also adds an enum
PipelineStatusto represent the current state of the pipeline.Adds a
flushparameter toconnect_source_pandasthat when set, immediately sends the data to the backend.