feat: optimize truncate(col) == value to startsWith predicate#289
feat: optimize truncate(col) == value to startsWith predicate#289shangxinli wants to merge 1 commit intoapache:mainfrom
Conversation
a831452 to
5cbeca0
Compare
5cbeca0 to
ab63064
Compare
|
There are some merge conflicts... |
dfb4504 to
8076d44
Compare
|
Fixed |
src/iceberg/expression/predicate.cc
Outdated
| /// This matches the behavior of TruncateUtils::TruncateUTF8. | ||
| /// | ||
| /// NOTE: This counts code points, not grapheme clusters (user-perceived characters). | ||
| /// Per the Iceberg spec, combining marks count as separate code points. |
There was a problem hiding this comment.
Iceberg spec does not have this. We should not mislead readers.
src/iceberg/expression/predicate.cc
Outdated
| /// | ||
| /// \param str The UTF-8 encoded string | ||
| /// \return The number of code points (not bytes, not graphemes) | ||
| inline int32_t CountUTF8CodePoints(std::string_view str) { |
There was a problem hiding this comment.
Perhaps this should be moved to string_util.h or truncate_util.h
| auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get()); | ||
| if (!transform_term) { | ||
| // Should never happen after kind check, but be defensive | ||
| return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term), | ||
| std::move(literal)); | ||
| } |
There was a problem hiding this comment.
| auto* transform_term = dynamic_cast<BoundTransform*>(bound_term.get()); | |
| if (!transform_term) { | |
| // Should never happen after kind check, but be defensive | |
| return std::make_shared<BoundLiteralPredicate>(BASE::op(), std::move(bound_term), | |
| std::move(literal)); | |
| } | |
| auto* transform_term = internal::checked_cast<BoundTransform*>(bound_term.get()); |
Let's reuse the existing pattern for cast.
src/iceberg/expression/predicate.cc
Outdated
| !literal.IsNull()) { // Null safety: skip null literals | ||
|
|
||
| // Extract width parameter using type-safe API | ||
| auto width_opt = transform_term->transform()->param(); |
There was a problem hiding this comment.
I think we can greatly simplify this line and below.
We can add transform_func like below.
class ICEBERG_EXPORT BoundTransform : public BoundTerm {
public:
const std::shared_ptr<TransformFunction>& transform_func() const { return transform_func_; }
};Then we just need to call transform_term->transform_func()->Transform(literal) and check if its return value is same as literal. We don't even need CountUTF8CodePoints above.
src/iceberg/transform.h
Outdated
| /// For non-parameterized transforms (identity, year, etc.), returns std::nullopt. | ||
| /// | ||
| /// \return The parameter if present, otherwise std::nullopt | ||
| std::optional<int32_t> param() const { |
There was a problem hiding this comment.
Can we remove this function? This makes it harder to evolve the variant in the future.
0aad922 to
1d01343
Compare
Implements optimization to rewrite truncate equality predicates as startsWith for better predicate pushdown and index usage. The optimization applies when: - Operation is equality - Term is a truncate transform on string column - Literal has exactly the truncate width in UTF-8 code points Implementation uses transform_func()->Transform() to validate that: 1. truncate(literal) == literal (literal is compatible) 2. truncate(literal + 'x') == literal (literal has exact width) This approach leverages the transform function without duplicating UTF-8 code point counting logic.
1d01343 to
7c51a61
Compare
Rewrite truncate(col) == "value" predicates to col STARTS_WITH "value" for string columns. This enables better predicate pushdown to storage formats and efficient use of prefix indexes.
The optimization only applies when:
Added tests to verify correct application and edge cases.