1

I have clustered clickhouse instance and observing following behavior: when I delete data on every node of clickhouse using for example alter table db.tb on cluster cl1 delete where event_date = 20231212 , when I run this script a delete statement is performed very fast but when I select data from table with event_date = 20231212 I see that data is still in table (despite it is slowly reduced when i repeat check of row count right after previous check). Right after deletion I need to load data with the same event_date = 20231212 but I don't see any mechanism to avoid me from data that is being deleted but that not deleted yet so this can end with mixed data from deleted/inserted transactions. Is there to avoid such potential problem ?

1 Answer 1

1

Assuming you are using the *MergeTree table engine, Deletions and Updates (aka mutations) in ClickHouse are performance asynchronously in the background upon merging [https://clickhouse.com/docs/en/sql-reference/statements/alter#mutations]

If you expect the ALTER TABLE...DELETE to occur immediately, then using the mutations_sync option would probably be recommended [https://clickhouse.com/docs/en/operations/settings/settings#mutations_sync]

Use the SETTINGS mutations_sync = 1; or SETTINGS mutation_sync = 2; (depending on single replica or multiple replicas) option with your DELETE statement to ensure that deletions are being performed synchronously

If running asynchronous mutations, you can also check the system.mutations table to determine if the mutations are complete:

SELECT mutation_id,*
FROM clusterAllReplicas('default',system.mutations)
WHERE is_done = 0;
Sign up to request clarification or add additional context in comments.

4 Comments

DELETE FROM removes rows from the table [db.]table that match the expression expr. The deleted rows are marked as deleted immediately and will be automatically filtered out of all subsequent queries. Cleanup of data happens asynchronously in the background. This feature is only available for the MergeTree table engine family.
This is info from official docs When I run DELETE * from T1 ON CLUSTER C1 - transaction complete succesfully fastly When I run SELECT COUNT(*) FROM T1 - i still see deleted rows , its count being reduced slowly but per official info these rows should be marked as "deleted" and filtered out for all subsequent queries but as I can see this does not work Per my understanding i don't need to use mutations_sync because it does not matter for me when this data that marked as "deleted" will be physically deleted , main goal is to avoid capturing of deleted data by any subsequent queries.
And yes I use standard MergeTree engine
The "DELETE FROM..." statement is for Lightweight Deletes clickhouse.com/docs/en/sql-reference/statements/delete - the actual deletes still happen asynchronously in the background. If your cluster is behind on mutations, the operation that marks the rows as deleted may still be processing. Checking the system.mutations table per the above should help determine this - you may have mutations still in queue. You can also test by running a count() or SELECT() using the PREWHERE _row_exists method described here: clickhouse.com/docs/en/sql-reference/statements/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.