Skip to content

py: properly serialize DataFrames with Timestamp columns#1846

Merged
abhizer merged 4 commits intomainfrom
issue1840
Jun 10, 2024
Merged

py: properly serialize DataFrames with Timestamp columns#1846
abhizer merged 4 commits intomainfrom
issue1840

Conversation

@abhizer
Copy link
Copy Markdown
Contributor

@abhizer abhizer commented Jun 7, 2024

Fixes: #1840

Also does the following things:

  • chunk dataframes into smaller groups of 1000 rows per request while ingesting data
  • avoids adding empty dataframes to output buffer
  • ignores the index while concatenating output dataframes

Is this a user-visible change (yes/no): no

@abhizer abhizer added bug Something isn't working python-sdk Issues related to the feldera python sdk labels Jun 7, 2024
@abhizer abhizer requested review from ryzhyk and snkas June 7, 2024 09:27
Yield successive n-sized chunks from the given dataframe.
"""

for i in range(0, len(df), chunk_size):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that iloc does not throw any errors when selecting a range beyond its size.

Copy link
Copy Markdown
Contributor

@snkas snkas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunking of input and the fixes are nice additions! Regarding serialization, I think it'd be useful to consider the client interface and how generic the push_to_pipeline function should be (or if it should be restructured / renamed / other tailored functions added). For instance, send_request should be kept as generic as possible, either requiring body to be bytes or having an optional serialization function passed that turns it into bytes.

:param content_type: The value for `Content-Type` HTTP header. "application/json" by default.
:param params: The query parameters part of this request.
:param stream: True if the response is expected to be a HTTP stream.
:param dont_serialize: True if the body is already serialized.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative seems unnecessary with the default value, why not have it serialize: bool = True?

array: bool = False,
force: bool = False,
update_format: str = "raw",
dont_serialize: bool = False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function based on signature supports both JSON and CSV as the data format, but it seems the fields are tailored towards JSON?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We use JSON most of the time, maybe they should be two different functions.

Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, we can work on the Feldera-compatible timestamp encoding in another PR

@ryzhyk ryzhyk mentioned this pull request Jun 7, 2024
16 tasks
@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented Jun 7, 2024

@abhizer , does Pandas support Date, Time, and Decimal types? If so, we will also need to make sure we encode those correctly.

@abhizer
Copy link
Copy Markdown
Contributor Author

abhizer commented Jun 9, 2024

I don't think there are Date and Time separate types in Pandas. Even if it is only just the date, it seems to be DateTime and Decimals seem to be serialized as Double.

abhizer and others added 4 commits June 10, 2024 21:32
Fixes: #1840

Also does the following things:

* chunk dataframes into smaller groups of 1000 rows per request while
  ingesting data
* avoids adding empty dataframes to output buffer
* ignores the index while concatenating output dataframes

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Introduces a new JSON dialect that matches how Pandas encodes timestamp
types as millis since epoch.

Signed-off-by: Leonid Ryzhyk <leonid@feldera.com>
Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
@abhizer abhizer merged commit 1d225ac into main Jun 10, 2024
@abhizer abhizer deleted the issue1840 branch June 10, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python-sdk Issues related to the feldera python sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

py: DataFrames with Timestamp columns fail in connect_source_pandas

3 participants