Added a join-free query to compute cumulative statistics#4001
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| ArraysByVariant AS ( | ||
| SELECT | ||
| variant_name, | ||
| groupArray(period_end) AS periods, | ||
| groupArray(merged_mean_state) AS mean_states, | ||
| groupArray(merged_var_state) AS var_states, | ||
| groupArray(period_count) AS counts | ||
| FROM AggregatedFilteredFeedbackByVariantStatistics |
There was a problem hiding this comment.
Order period arrays before cumulative reduction
The new cumulative query relies on groupArray(...) to build ordered time series per variant and then uses arrayEnumerate + arraySlice(..., 1, i) to compute cumulative means, variances, and counts. However, groupArray does not guarantee row order when aggregating—values are accumulated in whatever order the execution engine processes each group. If ClickHouse delivers rows for a variant out of chronological order (e.g. across partitions or after parallel aggregation), the slices will accumulate statistics in an arbitrary sequence, and the final rows sorted by period_end will report incorrect cumulative values. Use an ordered aggregate such as groupArray(period_end ORDER BY period_end) (and the same for the other arrays) or groupArraySorted so the arrays are sorted before applying the cumulative reductions.
Useful? React with 👍 / 👎.
* removed variant disabling from prepare_candidate_variants * wip * wip * set up new variant config loading * refactored initialization to set up samplers * prod implementation seems correct, need to refactor tests too * forgot a merge * refactored tests into `experimentation` * small fix to `prepare_candidate_variants` * improved error handling for experimentation * fixed tag version in experimentation * refactored VariantSampler trait to be simpler * fixed clippy * cleanup * fixed typing issues * added test that samples from config * config test should sample many times * Add draft function for estimating optimal sampling probabilities * Add function to check stopping condition * Fix constraint bugs, set solver to non-verbose. Two constraint bugs: - For the simplex constraint, had 1s everywhere instead of just at indices corresponding to the weights - Signs were backwards in the SOCP constraints * Add guard for edge case of equal means and variances * Add unit tests * Add TODOs to comments * added a comment * Add test cases with known solutions * Add test cases to check_stopping * added more informative comments * added validation * Change function signatures, refactor constraint construction. Change functions to accept vectors of FeedbackByVariants structs, rather than a struct of vectors. Construct the quadratic coefficient P matrix directly as a sparse matrix rather than by creating a dense matrix and then converting it to sparse format. Update tests to match new signatures. * Removed unneeded argument struct * refactored inference handler to pull infer_variant into a separate function * refactored batch handler as well * short circuited inference in the pinned / dynamic case * fixed failing tests due to error handling improvement * run merge queue tests * removed merge queue tests * wip * set up config * sketch is done * added spawn * wip * added config loading logic * Return probabilities in hashmap with variant names * wip * oops * Slight change to avoid String cloning * wip * pulled sampling into helper function * added postgres handling * built bindings * forgot a file * Refactor function args into a struct, add tie handling for leader arm, add docstrings * Refactor to use struct for function arg, add docstrings * Rename ridge_variance to variance_floor for clarity * Add TODO for choosing epsilon * Log warning when arms are tied * Rename arg and error structs with full function name * Raise floor on pairwise info rate to avoid degeneracy, add tests to catch degeneracy * Add comprehensive unit tests, log info and warnings about experiment state * Add arguments validation for track and stop config * Make sleep duration configurable * Add integration test files * Add helper functions and initial tests * Clean up imports * Add unit tests for convergence of estimated optimal probabilities. As samples means and variances converge, the estimated optimal probabilities should converge to the true optimum. This convergence may be nonmonotone in the convergence of the sample statistics due to the nonlinear optimization problem which is sensitive to the ordering of the sample means, so we average over multiple random runs to yield monotonicity with high probability. * wip on integration tests * Fix bug in sampling behavior in NurseryAndBandits state. sample_with_probabilities() in this state required a fresh `uniform_sample`, but this branch of the code was only being reached when `uniform_sample` was >= `nursery_probability`, leading to incorrect sampling. * Filter out feedback variants so they don't enter bandit experiment. Also fix test that was failing due to previous bugfix. * Create test helper to build embedded gateway with postgres and clean clickhouse database * Add comprehensive integration tests * made migrations automatic in test docker composes * removed stray changes * wip * fixed issue with entrypoint * fixed issue with file write * fixed buildkite CI flow * fixed chc test * fixed database names * fixed typing * changed database name to tensorzero_e2e_tests in Python to match previous behavior * Remove tests of test helper functions * Fix incorrect merge conflict resolution * Move static weights sampling function and accompanying tests * Move import to top of file, remove unneeded line * Add comments and docstrings * Disable feedback validation to avoid test failures when feedback precedes inference logging * Remove convergence tests, remove global locks, enable parallel test runs. Convergence tests require require too much parallelism for clickhouse and postgres to handle, so we rely instead on unit tests in `estimate_optimal_probabilities.rs`. The global locks are no longer necessary due to a change elsewhere in the repo, so now tests can be run concurrently. * Add support for optimize='min' in Track-and-Stop * Remove test for optimize=min direction for now since that's not currently handled * Change epsilon and delta test structure to speed them up * Set global constant for sleep period when spawning a new client * Revert changes in test_helpers * Remove 'no_stopping' test: takes too long and this functionality is essentially already tested elsewhere * Make VariantSampler setup argument generic for future use * Change lifetime declaration for older version of clippy * Add test for optimize=min direction * Add unit tests for optimize=min direction * wip: add sql query for variant feedback time series * Change to supported clickhouse function name * Change period_start to period_end, add comments for clarity * Add tests for timeseries sql query * Rebuild typescript bindings * Fix CI clippy errors * Fix field name: period_end -> period_start * Remove in-progress work that was accidentally included * Simplify sql query * Added a join-free query to compute cumulative statistics (#4001) * fixed performance issue in time series sql query * fixed formatting issue * fixed issue with sorting groupArray * Change to parametric sql query * Remove old commented out sql query * Update tests to only expect data in periods where variants have new data * Change time period to support aggregation by minute, hour, day, week, month * Remove unnecessary to_string() calls * Build new node bindings * Update old argument name * Rename get_feedback_timeseries to get_cumulative_feedback_timeseries for clarity * Rename FeedbackTimeSeriesPoint to CumulativeFeedbackTimeSeriesPoint for clarity --------- Co-authored-by: Viraj Mehta <viraj@tensorzero.com> Co-authored-by: Viraj Mehta <virajmehta@users.noreply.github.com>
…3998) * removed variant disabling from prepare_candidate_variants * wip * wip * set up new variant config loading * refactored initialization to set up samplers * prod implementation seems correct, need to refactor tests too * forgot a merge * refactored tests into `experimentation` * small fix to `prepare_candidate_variants` * improved error handling for experimentation * fixed tag version in experimentation * refactored VariantSampler trait to be simpler * fixed clippy * cleanup * fixed typing issues * added test that samples from config * config test should sample many times * Add draft function for estimating optimal sampling probabilities * Add function to check stopping condition * Fix constraint bugs, set solver to non-verbose. Two constraint bugs: - For the simplex constraint, had 1s everywhere instead of just at indices corresponding to the weights - Signs were backwards in the SOCP constraints * Add guard for edge case of equal means and variances * Add unit tests * Add TODOs to comments * added a comment * Add test cases with known solutions * Add test cases to check_stopping * added more informative comments * added validation * Change function signatures, refactor constraint construction. Change functions to accept vectors of FeedbackByVariants structs, rather than a struct of vectors. Construct the quadratic coefficient P matrix directly as a sparse matrix rather than by creating a dense matrix and then converting it to sparse format. Update tests to match new signatures. * Removed unneeded argument struct * refactored inference handler to pull infer_variant into a separate function * refactored batch handler as well * short circuited inference in the pinned / dynamic case * fixed failing tests due to error handling improvement * run merge queue tests * removed merge queue tests * wip * set up config * sketch is done * added spawn * wip * added config loading logic * Return probabilities in hashmap with variant names * wip * oops * Slight change to avoid String cloning * wip * pulled sampling into helper function * added postgres handling * built bindings * forgot a file * Refactor function args into a struct, add tie handling for leader arm, add docstrings * Refactor to use struct for function arg, add docstrings * Rename ridge_variance to variance_floor for clarity * Add TODO for choosing epsilon * Log warning when arms are tied * Rename arg and error structs with full function name * Raise floor on pairwise info rate to avoid degeneracy, add tests to catch degeneracy * Add comprehensive unit tests, log info and warnings about experiment state * Add arguments validation for track and stop config * Make sleep duration configurable * Add integration test files * Add helper functions and initial tests * Clean up imports * Add unit tests for convergence of estimated optimal probabilities. As samples means and variances converge, the estimated optimal probabilities should converge to the true optimum. This convergence may be nonmonotone in the convergence of the sample statistics due to the nonlinear optimization problem which is sensitive to the ordering of the sample means, so we average over multiple random runs to yield monotonicity with high probability. * wip on integration tests * Fix bug in sampling behavior in NurseryAndBandits state. sample_with_probabilities() in this state required a fresh `uniform_sample`, but this branch of the code was only being reached when `uniform_sample` was >= `nursery_probability`, leading to incorrect sampling. * Filter out feedback variants so they don't enter bandit experiment. Also fix test that was failing due to previous bugfix. * Create test helper to build embedded gateway with postgres and clean clickhouse database * Add comprehensive integration tests * made migrations automatic in test docker composes * removed stray changes * wip * fixed issue with entrypoint * fixed issue with file write * fixed buildkite CI flow * fixed chc test * fixed database names * fixed typing * changed database name to tensorzero_e2e_tests in Python to match previous behavior * Remove tests of test helper functions * Fix incorrect merge conflict resolution * Move static weights sampling function and accompanying tests * Move import to top of file, remove unneeded line * Add comments and docstrings * Disable feedback validation to avoid test failures when feedback precedes inference logging * Remove convergence tests, remove global locks, enable parallel test runs. Convergence tests require require too much parallelism for clickhouse and postgres to handle, so we rely instead on unit tests in `estimate_optimal_probabilities.rs`. The global locks are no longer necessary due to a change elsewhere in the repo, so now tests can be run concurrently. * Add support for optimize='min' in Track-and-Stop * Remove test for optimize=min direction for now since that's not currently handled * Change epsilon and delta test structure to speed them up * Set global constant for sleep period when spawning a new client * Revert changes in test_helpers * Remove 'no_stopping' test: takes too long and this functionality is essentially already tested elsewhere * Make VariantSampler setup argument generic for future use * Change lifetime declaration for older version of clippy * Add test for optimize=min direction * Add unit tests for optimize=min direction * wip: add sql query for variant feedback time series * Change to supported clickhouse function name * Change period_start to period_end, add comments for clarity * Add tests for timeseries sql query * Rebuild typescript bindings * Initial implementation of asymptotic confidence sequences * Fix CI clippy errors * Fix field name: period_end -> period_start * Make asymptotic cs computation correct and efficient * Remove in-progress work that was accidentally included * Wrap return type in Result, compute asympCS automatically when retrieving feedback time series * Simplify sql query * Fix handling of optional rho * Added a join-free query to compute cumulative statistics (#4001) * fixed performance issue in time series sql query * fixed formatting issue * fixed issue with sorting groupArray * Change to parametric sql query * Remove old commented out sql query * Update tests to only expect data in periods where variants have new data * Change time period to support aggregation by minute, hour, day, week, month * Remove unnecessary to_string() calls * Build new node bindings * Build new TypeScript bindings * Update old argument name * Rename get_feedback_timeseries to get_cumulative_feedback_timeseries for clarity * Rename FeedbackTimeSeriesPoint to CumulativeFeedbackTimeSeriesPoint for clarity * Small tweak to avoid cloning --------- Co-authored-by: Viraj Mehta <viraj@tensorzero.com> Co-authored-by: Viraj Mehta <virajmehta@users.noreply.github.com>
* removed variant disabling from prepare_candidate_variants * wip * wip * set up new variant config loading * refactored initialization to set up samplers * prod implementation seems correct, need to refactor tests too * forgot a merge * refactored tests into `experimentation` * small fix to `prepare_candidate_variants` * improved error handling for experimentation * fixed tag version in experimentation * refactored VariantSampler trait to be simpler * fixed clippy * cleanup * fixed typing issues * added test that samples from config * config test should sample many times * Add draft function for estimating optimal sampling probabilities * Add function to check stopping condition * Fix constraint bugs, set solver to non-verbose. Two constraint bugs: - For the simplex constraint, had 1s everywhere instead of just at indices corresponding to the weights - Signs were backwards in the SOCP constraints * Add guard for edge case of equal means and variances * Add unit tests * Add TODOs to comments * added a comment * Add test cases with known solutions * Add test cases to check_stopping * added more informative comments * added validation * Change function signatures, refactor constraint construction. Change functions to accept vectors of FeedbackByVariants structs, rather than a struct of vectors. Construct the quadratic coefficient P matrix directly as a sparse matrix rather than by creating a dense matrix and then converting it to sparse format. Update tests to match new signatures. * Removed unneeded argument struct * refactored inference handler to pull infer_variant into a separate function * refactored batch handler as well * short circuited inference in the pinned / dynamic case * fixed failing tests due to error handling improvement * run merge queue tests * removed merge queue tests * wip * set up config * sketch is done * added spawn * wip * added config loading logic * Return probabilities in hashmap with variant names * wip * oops * Slight change to avoid String cloning * wip * pulled sampling into helper function * added postgres handling * built bindings * forgot a file * Refactor function args into a struct, add tie handling for leader arm, add docstrings * Refactor to use struct for function arg, add docstrings * Rename ridge_variance to variance_floor for clarity * Add TODO for choosing epsilon * Log warning when arms are tied * Rename arg and error structs with full function name * Raise floor on pairwise info rate to avoid degeneracy, add tests to catch degeneracy * Add comprehensive unit tests, log info and warnings about experiment state * Add arguments validation for track and stop config * Make sleep duration configurable * Add integration test files * Add helper functions and initial tests * Clean up imports * Add unit tests for convergence of estimated optimal probabilities. As samples means and variances converge, the estimated optimal probabilities should converge to the true optimum. This convergence may be nonmonotone in the convergence of the sample statistics due to the nonlinear optimization problem which is sensitive to the ordering of the sample means, so we average over multiple random runs to yield monotonicity with high probability. * wip on integration tests * Fix bug in sampling behavior in NurseryAndBandits state. sample_with_probabilities() in this state required a fresh `uniform_sample`, but this branch of the code was only being reached when `uniform_sample` was >= `nursery_probability`, leading to incorrect sampling. * Filter out feedback variants so they don't enter bandit experiment. Also fix test that was failing due to previous bugfix. * Create test helper to build embedded gateway with postgres and clean clickhouse database * Add comprehensive integration tests * made migrations automatic in test docker composes * removed stray changes * wip * fixed issue with entrypoint * fixed issue with file write * fixed buildkite CI flow * fixed chc test * fixed database names * fixed typing * changed database name to tensorzero_e2e_tests in Python to match previous behavior * Remove tests of test helper functions * Fix incorrect merge conflict resolution * Move static weights sampling function and accompanying tests * Move import to top of file, remove unneeded line * Add comments and docstrings * Disable feedback validation to avoid test failures when feedback precedes inference logging * Remove convergence tests, remove global locks, enable parallel test runs. Convergence tests require require too much parallelism for clickhouse and postgres to handle, so we rely instead on unit tests in `estimate_optimal_probabilities.rs`. The global locks are no longer necessary due to a change elsewhere in the repo, so now tests can be run concurrently. * Add support for optimize='min' in Track-and-Stop * Remove test for optimize=min direction for now since that's not currently handled * Change epsilon and delta test structure to speed them up * Set global constant for sleep period when spawning a new client * Revert changes in test_helpers * Remove 'no_stopping' test: takes too long and this functionality is essentially already tested elsewhere * Make VariantSampler setup argument generic for future use * Change lifetime declaration for older version of clippy * Add test for optimize=min direction * Add unit tests for optimize=min direction * wip: add sql query for variant feedback time series * Change to supported clickhouse function name * Change period_start to period_end, add comments for clarity * Add tests for timeseries sql query * Rebuild typescript bindings * Initial implementation of asymptotic confidence sequences * Fix CI clippy errors * Fix field name: period_end -> period_start * Make asymptotic cs computation correct and efficient * Remove in-progress work that was accidentally included * Wrap return type in Result, compute asympCS automatically when retrieving feedback time series * Simplify sql query * Fix handling of optional rho * Added a join-free query to compute cumulative statistics (#4001) * fixed performance issue in time series sql query * fixed formatting issue * fixed issue with sorting groupArray * Change to parametric sql query * Remove old commented out sql query * Update tests to only expect data in periods where variants have new data * Change time period to support aggregation by minute, hour, day, week, month * Remove unnecessary to_string() calls * Build new node bindings * Build new TypeScript bindings * Expose estimate_optimal_probabilities function to node * Update old argument name * Rename get_feedback_timeseries to get_cumulative_feedback_timeseries for clarity * Rename FeedbackTimeSeriesPoint to CumulativeFeedbackTimeSeriesPoint for clarity * Small tweak to avoid cloning * Fix numeric types in integration tests * Add node test for getFeedbackByVariant * Remove some comments * Change struct and function names to include TrackAndStop --------- Co-authored-by: Viraj Mehta <viraj@tensorzero.com> Co-authored-by: Viraj Mehta <virajmehta@users.noreply.github.com>
* removed variant disabling from prepare_candidate_variants * wip * wip * set up new variant config loading * refactored initialization to set up samplers * prod implementation seems correct, need to refactor tests too * forgot a merge * refactored tests into `experimentation` * small fix to `prepare_candidate_variants` * improved error handling for experimentation * fixed tag version in experimentation * refactored VariantSampler trait to be simpler * fixed clippy * cleanup * fixed typing issues * added test that samples from config * config test should sample many times * Add draft function for estimating optimal sampling probabilities * Add function to check stopping condition * Fix constraint bugs, set solver to non-verbose. Two constraint bugs: - For the simplex constraint, had 1s everywhere instead of just at indices corresponding to the weights - Signs were backwards in the SOCP constraints * Add guard for edge case of equal means and variances * Add unit tests * Add TODOs to comments * added a comment * Add test cases with known solutions * Add test cases to check_stopping * added more informative comments * added validation * Change function signatures, refactor constraint construction. Change functions to accept vectors of FeedbackByVariants structs, rather than a struct of vectors. Construct the quadratic coefficient P matrix directly as a sparse matrix rather than by creating a dense matrix and then converting it to sparse format. Update tests to match new signatures. * Removed unneeded argument struct * refactored inference handler to pull infer_variant into a separate function * refactored batch handler as well * short circuited inference in the pinned / dynamic case * fixed failing tests due to error handling improvement * run merge queue tests * removed merge queue tests * wip * set up config * sketch is done * added spawn * wip * added config loading logic * Return probabilities in hashmap with variant names * wip * oops * Slight change to avoid String cloning * wip * pulled sampling into helper function * added postgres handling * built bindings * forgot a file * Refactor function args into a struct, add tie handling for leader arm, add docstrings * Refactor to use struct for function arg, add docstrings * Rename ridge_variance to variance_floor for clarity * Add TODO for choosing epsilon * Log warning when arms are tied * Rename arg and error structs with full function name * Raise floor on pairwise info rate to avoid degeneracy, add tests to catch degeneracy * Add comprehensive unit tests, log info and warnings about experiment state * Add arguments validation for track and stop config * Make sleep duration configurable * Add integration test files * Add helper functions and initial tests * Clean up imports * Add unit tests for convergence of estimated optimal probabilities. As samples means and variances converge, the estimated optimal probabilities should converge to the true optimum. This convergence may be nonmonotone in the convergence of the sample statistics due to the nonlinear optimization problem which is sensitive to the ordering of the sample means, so we average over multiple random runs to yield monotonicity with high probability. * wip on integration tests * Fix bug in sampling behavior in NurseryAndBandits state. sample_with_probabilities() in this state required a fresh `uniform_sample`, but this branch of the code was only being reached when `uniform_sample` was >= `nursery_probability`, leading to incorrect sampling. * Filter out feedback variants so they don't enter bandit experiment. Also fix test that was failing due to previous bugfix. * Create test helper to build embedded gateway with postgres and clean clickhouse database * Add comprehensive integration tests * made migrations automatic in test docker composes * removed stray changes * wip * fixed issue with entrypoint * fixed issue with file write * fixed buildkite CI flow * fixed chc test * fixed database names * fixed typing * changed database name to tensorzero_e2e_tests in Python to match previous behavior * Remove tests of test helper functions * Fix incorrect merge conflict resolution * Move static weights sampling function and accompanying tests * Move import to top of file, remove unneeded line * Add comments and docstrings * Disable feedback validation to avoid test failures when feedback precedes inference logging * Remove convergence tests, remove global locks, enable parallel test runs. Convergence tests require require too much parallelism for clickhouse and postgres to handle, so we rely instead on unit tests in `estimate_optimal_probabilities.rs`. The global locks are no longer necessary due to a change elsewhere in the repo, so now tests can be run concurrently. * Add support for optimize='min' in Track-and-Stop * Remove test for optimize=min direction for now since that's not currently handled * Change epsilon and delta test structure to speed them up * Set global constant for sleep period when spawning a new client * Revert changes in test_helpers * Remove 'no_stopping' test: takes too long and this functionality is essentially already tested elsewhere * Make VariantSampler setup argument generic for future use * Change lifetime declaration for older version of clippy * Add test for optimize=min direction * Add unit tests for optimize=min direction * wip: add sql query for variant feedback time series * Change to supported clickhouse function name * Change period_start to period_end, add comments for clarity * Add tests for timeseries sql query * Rebuild typescript bindings * Initial implementation of asymptotic confidence sequences * Fix CI clippy errors * Fix field name: period_end -> period_start * Make asymptotic cs computation correct and efficient * Remove in-progress work that was accidentally included * Wrap return type in Result, compute asympCS automatically when retrieving feedback time series * Simplify sql query * Fix handling of optional rho * Added a join-free query to compute cumulative statistics (#4001) * fixed performance issue in time series sql query * fixed formatting issue * fixed issue with sorting groupArray * Change to parametric sql query * Remove old commented out sql query * Update tests to only expect data in periods where variants have new data * Change time period to support aggregation by minute, hour, day, week, month * Remove unnecessary to_string() calls * Build new node bindings * Build new TypeScript bindings * Expose estimate_optimal_probabilities function to node * Update old argument name * Rename get_feedback_timeseries to get_cumulative_feedback_timeseries for clarity * Rename FeedbackTimeSeriesPoint to CumulativeFeedbackTimeSeriesPoint for clarity * Small tweak to avoid cloning * Fix numeric types in integration tests * Add node test for getFeedbackByVariant * Remove some comments * Move <div> outside <p> so server-rendered html doesn't get discarded * Add pie chart with estimated optimal probabilities * Add Postgres to docker, add track-and-stop to fixtures * Change struct and function names to include TrackAndStop * Update function name to include TrackAndStop * Update track-and-stop pie chart to include probabilities for nursery variants * wip - add feedback time series chart * Replace TimeWindowUnit with TimeWindow * wip - add feedback timeseries chart * Change metric, add minutes and hours to dropdown * Remove useState for URL parameters Read metric_name and time granularities directly from searchParams instead of maintaining local state, ensuring URL is single source of truth. * Make variance field nullable to accommodate ClickHouse using sample variance * Add new fixture with feedback data distributed in time * Make pie chart fall back to uniform weights if estimated optimal sampling probabilities are undefined * Use memo and probability sorting to prevent needless pie chart re-rendering * Propagate cumulative values forward in time so feedback samples chart renders correctly * Remove year from x-axis labels on Hourly chart, remove unnecessary string normalization * Add chart title and dropdown menu width constraint * Revert docker file * Fix docker compose file (merged wrong version) * Add postgres to tensorzero-node ui test config * Pass postgres environment variable through * Cast feedback count to number for type consistency * Use pre-existing utility function to format chart ticks * Factor out date/time functionality out into new util functions * Remove unneeded exports, add docstrings, clean up sorting behavior * Remove default export and unnecessary error, clean up feedbackTimeSeriesPromise * Rename variable to something more informative * Factor out optimal probability computation into separate utility function * Fix typo in docstring --------- Co-authored-by: Viraj Mehta <viraj@tensorzero.com> Co-authored-by: Viraj Mehta <virajmehta@users.noreply.github.com>
* removed variant disabling from prepare_candidate_variants * wip * wip * set up new variant config loading * refactored initialization to set up samplers * prod implementation seems correct, need to refactor tests too * forgot a merge * refactored tests into `experimentation` * small fix to `prepare_candidate_variants` * improved error handling for experimentation * fixed tag version in experimentation * refactored VariantSampler trait to be simpler * fixed clippy * cleanup * fixed typing issues * added test that samples from config * config test should sample many times * Add draft function for estimating optimal sampling probabilities * Add function to check stopping condition * Fix constraint bugs, set solver to non-verbose. Two constraint bugs: - For the simplex constraint, had 1s everywhere instead of just at indices corresponding to the weights - Signs were backwards in the SOCP constraints * Add guard for edge case of equal means and variances * Add unit tests * Add TODOs to comments * added a comment * Add test cases with known solutions * Add test cases to check_stopping * added more informative comments * added validation * Change function signatures, refactor constraint construction. Change functions to accept vectors of FeedbackByVariants structs, rather than a struct of vectors. Construct the quadratic coefficient P matrix directly as a sparse matrix rather than by creating a dense matrix and then converting it to sparse format. Update tests to match new signatures. * Removed unneeded argument struct * refactored inference handler to pull infer_variant into a separate function * refactored batch handler as well * short circuited inference in the pinned / dynamic case * fixed failing tests due to error handling improvement * run merge queue tests * removed merge queue tests * wip * set up config * sketch is done * added spawn * wip * added config loading logic * Return probabilities in hashmap with variant names * wip * oops * Slight change to avoid String cloning * wip * pulled sampling into helper function * added postgres handling * built bindings * forgot a file * Refactor function args into a struct, add tie handling for leader arm, add docstrings * Refactor to use struct for function arg, add docstrings * Rename ridge_variance to variance_floor for clarity * Add TODO for choosing epsilon * Log warning when arms are tied * Rename arg and error structs with full function name * Raise floor on pairwise info rate to avoid degeneracy, add tests to catch degeneracy * Add comprehensive unit tests, log info and warnings about experiment state * Add arguments validation for track and stop config * Make sleep duration configurable * Add integration test files * Add helper functions and initial tests * Clean up imports * Add unit tests for convergence of estimated optimal probabilities. As samples means and variances converge, the estimated optimal probabilities should converge to the true optimum. This convergence may be nonmonotone in the convergence of the sample statistics due to the nonlinear optimization problem which is sensitive to the ordering of the sample means, so we average over multiple random runs to yield monotonicity with high probability. * wip on integration tests * Fix bug in sampling behavior in NurseryAndBandits state. sample_with_probabilities() in this state required a fresh `uniform_sample`, but this branch of the code was only being reached when `uniform_sample` was >= `nursery_probability`, leading to incorrect sampling. * Filter out feedback variants so they don't enter bandit experiment. Also fix test that was failing due to previous bugfix. * Create test helper to build embedded gateway with postgres and clean clickhouse database * Add comprehensive integration tests * made migrations automatic in test docker composes * removed stray changes * wip * fixed issue with entrypoint * fixed issue with file write * fixed buildkite CI flow * fixed chc test * fixed database names * fixed typing * changed database name to tensorzero_e2e_tests in Python to match previous behavior * Remove tests of test helper functions * Fix incorrect merge conflict resolution * Move static weights sampling function and accompanying tests * Move import to top of file, remove unneeded line * Add comments and docstrings * Disable feedback validation to avoid test failures when feedback precedes inference logging * Remove convergence tests, remove global locks, enable parallel test runs. Convergence tests require require too much parallelism for clickhouse and postgres to handle, so we rely instead on unit tests in `estimate_optimal_probabilities.rs`. The global locks are no longer necessary due to a change elsewhere in the repo, so now tests can be run concurrently. * Add support for optimize='min' in Track-and-Stop * Remove test for optimize=min direction for now since that's not currently handled * Change epsilon and delta test structure to speed them up * Set global constant for sleep period when spawning a new client * Revert changes in test_helpers * Remove 'no_stopping' test: takes too long and this functionality is essentially already tested elsewhere * Make VariantSampler setup argument generic for future use * Change lifetime declaration for older version of clippy * Add test for optimize=min direction * Add unit tests for optimize=min direction * wip: add sql query for variant feedback time series * Change to supported clickhouse function name * Change period_start to period_end, add comments for clarity * Add tests for timeseries sql query * Rebuild typescript bindings * Initial implementation of asymptotic confidence sequences * Fix CI clippy errors * Fix field name: period_end -> period_start * Make asymptotic cs computation correct and efficient * Remove in-progress work that was accidentally included * Wrap return type in Result, compute asympCS automatically when retrieving feedback time series * Simplify sql query * Fix handling of optional rho * Added a join-free query to compute cumulative statistics (#4001) * fixed performance issue in time series sql query * fixed formatting issue * fixed issue with sorting groupArray * Change to parametric sql query * Remove old commented out sql query * Update tests to only expect data in periods where variants have new data * Change time period to support aggregation by minute, hour, day, week, month * Remove unnecessary to_string() calls * Build new node bindings * Build new TypeScript bindings * Expose estimate_optimal_probabilities function to node * Update old argument name * Rename get_feedback_timeseries to get_cumulative_feedback_timeseries for clarity * Rename FeedbackTimeSeriesPoint to CumulativeFeedbackTimeSeriesPoint for clarity * Small tweak to avoid cloning * Fix numeric types in integration tests * Add node test for getFeedbackByVariant * Remove some comments * Move <div> outside <p> so server-rendered html doesn't get discarded * Add pie chart with estimated optimal probabilities * Add Postgres to docker, add track-and-stop to fixtures * Change struct and function names to include TrackAndStop * Update function name to include TrackAndStop * Update track-and-stop pie chart to include probabilities for nursery variants * wip - add feedback time series chart * Replace TimeWindowUnit with TimeWindow * wip - add feedback timeseries chart * Change metric, add minutes and hours to dropdown * Remove useState for URL parameters Read metric_name and time granularities directly from searchParams instead of maintaining local state, ensuring URL is single source of truth. * Make variance field nullable to accommodate ClickHouse using sample variance * Add new fixture with feedback data distributed in time * Make pie chart fall back to uniform weights if estimated optimal sampling probabilities are undefined * Use memo and probability sorting to prevent needless pie chart re-rendering * Propagate cumulative values forward in time so feedback samples chart renders correctly * Remove year from x-axis labels on Hourly chart, remove unnecessary string normalization * Add chart title and dropdown menu width constraint * wip: add rewards inference time series chart * Revert unnecessary change * Version with just means now rendering lines * Remove temporary filler for null values * Remove debugging artifacts * Remove extra postgres from merge * Add confidence sequences to mean rewards * fixed import issue * wip * refactored feedbackTimeSeries to be part of FunctionExperimentation * use a hook for time granularity in URL state * separated charts into independent components * added missing files * pulled time selection and data transformation out of each chart and avoided duplicating logic * added tab layout for track and stop charts * added better descriptions * added playwright test * changed mean reward -> mean feedback estimates * added handling for minute-level charts * consolidate timeseries axis formatting * cleaned up tooltip * sort variants so they get consistent colors * Update copy/formatting * Update copy/formatting * fixed playwright test --------- Co-authored-by: Alan Mishler <alan@tensorzero.com> Co-authored-by: Gabriel Bianconi <1275491+GabrielBianconi@users.noreply.github.com>
Important
Replaces cumulative statistics query with a window function-based implementation in
select_queries.rsfor improved performance and accuracy.select_queries.rs.time_bucketsCTE and self-join logic.AggregatedFilteredFeedbackByVariantStatistics,ArraysByVariant,AllCumulativeStats, andFilteredCumulativeStatsfor efficient data processing.This description was created by
for a4ef154. You can customize this summary. It will automatically update as commits are pushed.