py: properly serialize DataFrames with Timestamp columns by abhizer · Pull Request #1846 · feldera/feldera

abhizer · 2024-06-07T09:27:26Z

Also does the following things:

chunk dataframes into smaller groups of 1000 rows per request while ingesting data
avoids adding empty dataframes to output buffer
ignores the index while concatenating output dataframes

Is this a user-visible change (yes/no): no

snkas · 2024-06-07T09:47:02Z

python/feldera/_helpers.py

+    Yield successive n-sized chunks from the given dataframe.
+    """
+
+    for i in range(0, len(df), chunk_size):


Interesting that iloc does not throw any errors when selecting a range beyond its size.

snkas

The chunking of input and the fixes are nice additions! Regarding serialization, I think it'd be useful to consider the client interface and how generic the push_to_pipeline function should be (or if it should be restructured / renamed / other tailored functions added). For instance, send_request should be kept as generic as possible, either requiring body to be bytes or having an optional serialization function passed that turns it into bytes.

python/feldera/output_handler.py

snkas · 2024-06-07T09:51:36Z

python/feldera/rest/_httprequests.py

        :param content_type: The value for `Content-Type` HTTP header. "application/json" by default.
        :param params: The query parameters part of this request.
        :param stream: True if the response is expected to be a HTTP stream.
+        :param dont_serialize: True if the body is already serialized.


The negative seems unnecessary with the default value, why not have it serialize: bool = True?

snkas · 2024-06-07T09:54:05Z

python/feldera/rest/feldera_client.py

            array: bool = False,
            force: bool = False,
            update_format: str = "raw",
+            dont_serialize: bool = False,


This function based on signature supports both JSON and CSV as the data format, but it seems the fields are tailored towards JSON?

Yeah. We use JSON most of the time, maybe they should be two different functions.

ryzhyk

Looks good, we can work on the Feldera-compatible timestamp encoding in another PR

ryzhyk · 2024-06-07T22:13:02Z

@abhizer , does Pandas support Date, Time, and Decimal types? If so, we will also need to make sure we encode those correctly.

abhizer · 2024-06-09T17:19:58Z

I don't think there are Date and Time separate types in Pandas. Even if it is only just the date, it seems to be DateTime and Decimals seem to be serialized as Double.

Fixes: #1840 Also does the following things: * chunk dataframes into smaller groups of 1000 rows per request while ingesting data * avoids adding empty dataframes to output buffer * ignores the index while concatenating output dataframes Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

Introduces a new JSON dialect that matches how Pandas encodes timestamp types as millis since epoch. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer added bug Something isn't working python-sdk Issues related to the feldera python sdk labels Jun 7, 2024

abhizer requested review from ryzhyk and snkas June 7, 2024 09:27

snkas reviewed Jun 7, 2024

View reviewed changes

ryzhyk approved these changes Jun 7, 2024

View reviewed changes

ryzhyk mentioned this pull request Jun 7, 2024

Python API todos #1776

Closed

16 tasks

ryzhyk force-pushed the issue1840 branch from 7711f46 to 308436c Compare June 7, 2024 22:51

abhizer and others added 4 commits June 10, 2024 21:32

py: Testing instructions.

1b47094

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

py: Encode Pandas timestamps as epoch.

1525fa8

Introduces a new JSON dialect that matches how Pandas encodes timestamp types as millis since epoch. Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>

py: rename dont_serialize to serialize in push_to_pipeline

ac41410

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

abhizer force-pushed the issue1840 branch from 308436c to ac41410 Compare June 10, 2024 16:35

abhizer merged commit 1d225ac into main Jun 10, 2024

abhizer deleted the issue1840 branch June 10, 2024 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py: properly serialize DataFrames with Timestamp columns#1846

py: properly serialize DataFrames with Timestamp columns#1846
abhizer merged 4 commits intomainfrom
issue1840

abhizer commented Jun 7, 2024

Uh oh!

snkas Jun 7, 2024

Uh oh!

snkas left a comment

Uh oh!

Uh oh!

snkas Jun 7, 2024

Uh oh!

snkas Jun 7, 2024

Uh oh!

abhizer Jun 7, 2024

Uh oh!

ryzhyk left a comment

Uh oh!

ryzhyk commented Jun 7, 2024

Uh oh!

abhizer commented Jun 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abhizer commented Jun 7, 2024

Uh oh!

snkas Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

snkas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snkas Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

snkas Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

abhizer Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

ryzhyk left a comment

Choose a reason for hiding this comment

Uh oh!

ryzhyk commented Jun 7, 2024

Uh oh!

abhizer commented Jun 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants