-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Expected Behavior
Currently, we have a push source with Redshift Offline Store and DynamoDb Online Store.
We built our view with more than 500 columns. Around 750 columns.
We expected to ingest data in dynamo and redshift when we run
fs.push("push_source", df, to=PushMode.ONLINE_AND_OFFLINE)
Current Behavior
Push command raise an issue like [ERROR] ValueError: The input dataframe has columns ..
This issue come from get_table_column_names_and_types method in write_to_offline_store method.
In the method, we check if if set(input_columns) != set(source_columns) and raise the below issue if there are diff.
In case with more than 500 columns we get a diff because source_columns come from get_table_column_names_and_types method result where the result is define by MaxResults parameters.
Steps to reproduce
entity= Entity(
name="entity",
join_keys=["entity_id"],
value_type=ValueType.INT64,
)
push_source = PushSource(
name="push_source",
batch_source=RedshiftSource(
table="fs_push_view",
timestamp_field="datecreation",
created_timestamp_column="created_at"),
)
besoin_embedding_push_view = FeatureView(
name="push_view",
entities=[entity],
schema=[Field(name=f"field_{dim}", dtype=types.Float64) for dim in range(768)],
source=push_source
)
fs.push("push_source", df, to=PushMode.ONLINE_AND_OFFLINE)
Specifications
- Version: 0.25.0
- Platform: AWS
- Subsystem:
Possible Solution
In my mind, we have two solutions:
- Set higher MaxResults in describe_table method
- Use NextToken to iterate through results