Skip to content

[Data] support inject the checkpointfilter/checkpointwriter on custom place. #60704

@my-vegetable-has-exploded

Description

Description

Ray data support checkpoint based on using primary key to filter out processed rows.

But currently, the mechanism only support inject checkpointfilter on source side(plan_op_read inject it into readtasks).

But in some envronment, we can't get primary key in source side. Allowing user to set inject place of checkpointfilter will be useful in those scenarios.

Use case

In our offline inference scenario, we get files based on sample stradegy, and then split one compressed file into many samples. So the primary key will be file_name + smaple_id, and we should inject the checkpointfilter after extracting samples.

Image

And in a RAG pipeline, a pdf file may be split to many chunks, we may use file_name + chunk_id to identify a chunk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    community-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilityquestionJust a question :)triageNeeds triage (eg: priority, bug/not-bug, and owning component)usability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions