-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Open
Labels
community-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityquestionJust a question :)Just a question :)triageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)usability
Description
Description
Ray data support checkpoint based on using primary key to filter out processed rows.
But currently, the mechanism only support inject checkpointfilter on source side(plan_op_read inject it into readtasks).
But in some envronment, we can't get primary key in source side. Allowing user to set inject place of checkpointfilter will be useful in those scenarios.
Use case
In our offline inference scenario, we get files based on sample stradegy, and then split one compressed file into many samples. So the primary key will be file_name + smaple_id, and we should inject the checkpointfilter after extracting samples.
And in a RAG pipeline, a pdf file may be split to many chunks, we may use file_name + chunk_id to identify a chunk.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
community-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityquestionJust a question :)Just a question :)triageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)usability