Add function for asymptotic confidence sequences for bandit feedback#3998
Merged
Add function for asymptotic confidence sequences for bandit feedback#3998
Conversation
…viraj/experimentation-config
Two constraint bugs: - For the simplex constraint, had 1s everywhere instead of just at indices corresponding to the weights - Signs were backwards in the SOCP constraints
…zero/tensorzero into alan/bandits-confidence-sequences
virajmehta
reviewed
Oct 17, 2025
virajmehta
reviewed
Oct 17, 2025
virajmehta
reviewed
Oct 17, 2025
…nfidence-sequences
…nfidence-sequences
virajmehta
approved these changes
Oct 20, 2025
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 20, 2025
…3998) * removed variant disabling from prepare_candidate_variants * wip * wip * set up new variant config loading * refactored initialization to set up samplers * prod implementation seems correct, need to refactor tests too * forgot a merge * refactored tests into `experimentation` * small fix to `prepare_candidate_variants` * improved error handling for experimentation * fixed tag version in experimentation * refactored VariantSampler trait to be simpler * fixed clippy * cleanup * fixed typing issues * added test that samples from config * config test should sample many times * Add draft function for estimating optimal sampling probabilities * Add function to check stopping condition * Fix constraint bugs, set solver to non-verbose. Two constraint bugs: - For the simplex constraint, had 1s everywhere instead of just at indices corresponding to the weights - Signs were backwards in the SOCP constraints * Add guard for edge case of equal means and variances * Add unit tests * Add TODOs to comments * added a comment * Add test cases with known solutions * Add test cases to check_stopping * added more informative comments * added validation * Change function signatures, refactor constraint construction. Change functions to accept vectors of FeedbackByVariants structs, rather than a struct of vectors. Construct the quadratic coefficient P matrix directly as a sparse matrix rather than by creating a dense matrix and then converting it to sparse format. Update tests to match new signatures. * Removed unneeded argument struct * refactored inference handler to pull infer_variant into a separate function * refactored batch handler as well * short circuited inference in the pinned / dynamic case * fixed failing tests due to error handling improvement * run merge queue tests * removed merge queue tests * wip * set up config * sketch is done * added spawn * wip * added config loading logic * Return probabilities in hashmap with variant names * wip * oops * Slight change to avoid String cloning * wip * pulled sampling into helper function * added postgres handling * built bindings * forgot a file * Refactor function args into a struct, add tie handling for leader arm, add docstrings * Refactor to use struct for function arg, add docstrings * Rename ridge_variance to variance_floor for clarity * Add TODO for choosing epsilon * Log warning when arms are tied * Rename arg and error structs with full function name * Raise floor on pairwise info rate to avoid degeneracy, add tests to catch degeneracy * Add comprehensive unit tests, log info and warnings about experiment state * Add arguments validation for track and stop config * Make sleep duration configurable * Add integration test files * Add helper functions and initial tests * Clean up imports * Add unit tests for convergence of estimated optimal probabilities. As samples means and variances converge, the estimated optimal probabilities should converge to the true optimum. This convergence may be nonmonotone in the convergence of the sample statistics due to the nonlinear optimization problem which is sensitive to the ordering of the sample means, so we average over multiple random runs to yield monotonicity with high probability. * wip on integration tests * Fix bug in sampling behavior in NurseryAndBandits state. sample_with_probabilities() in this state required a fresh `uniform_sample`, but this branch of the code was only being reached when `uniform_sample` was >= `nursery_probability`, leading to incorrect sampling. * Filter out feedback variants so they don't enter bandit experiment. Also fix test that was failing due to previous bugfix. * Create test helper to build embedded gateway with postgres and clean clickhouse database * Add comprehensive integration tests * made migrations automatic in test docker composes * removed stray changes * wip * fixed issue with entrypoint * fixed issue with file write * fixed buildkite CI flow * fixed chc test * fixed database names * fixed typing * changed database name to tensorzero_e2e_tests in Python to match previous behavior * Remove tests of test helper functions * Fix incorrect merge conflict resolution * Move static weights sampling function and accompanying tests * Move import to top of file, remove unneeded line * Add comments and docstrings * Disable feedback validation to avoid test failures when feedback precedes inference logging * Remove convergence tests, remove global locks, enable parallel test runs. Convergence tests require require too much parallelism for clickhouse and postgres to handle, so we rely instead on unit tests in `estimate_optimal_probabilities.rs`. The global locks are no longer necessary due to a change elsewhere in the repo, so now tests can be run concurrently. * Add support for optimize='min' in Track-and-Stop * Remove test for optimize=min direction for now since that's not currently handled * Change epsilon and delta test structure to speed them up * Set global constant for sleep period when spawning a new client * Revert changes in test_helpers * Remove 'no_stopping' test: takes too long and this functionality is essentially already tested elsewhere * Make VariantSampler setup argument generic for future use * Change lifetime declaration for older version of clippy * Add test for optimize=min direction * Add unit tests for optimize=min direction * wip: add sql query for variant feedback time series * Change to supported clickhouse function name * Change period_start to period_end, add comments for clarity * Add tests for timeseries sql query * Rebuild typescript bindings * Initial implementation of asymptotic confidence sequences * Fix CI clippy errors * Fix field name: period_end -> period_start * Make asymptotic cs computation correct and efficient * Remove in-progress work that was accidentally included * Wrap return type in Result, compute asympCS automatically when retrieving feedback time series * Simplify sql query * Fix handling of optional rho * Added a join-free query to compute cumulative statistics (#4001) * fixed performance issue in time series sql query * fixed formatting issue * fixed issue with sorting groupArray * Change to parametric sql query * Remove old commented out sql query * Update tests to only expect data in periods where variants have new data * Change time period to support aggregation by minute, hour, day, week, month * Remove unnecessary to_string() calls * Build new node bindings * Build new TypeScript bindings * Update old argument name * Rename get_feedback_timeseries to get_cumulative_feedback_timeseries for clarity * Rename FeedbackTimeSeriesPoint to CumulativeFeedbackTimeSeriesPoint for clarity * Small tweak to avoid cloning --------- Co-authored-by: Viraj Mehta <viraj@tensorzero.com> Co-authored-by: Viraj Mehta <virajmehta@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.