Add UI elements for autoevals by virajmehta · Pull Request #7064 · tensorzero/tensorzero

virajmehta · 2026-03-24T22:57:30Z

Renders auto-eval example labeling events in the autopilot session footer,
allowing users to label inference input/output pairs to calibrate auto-evaluations.

New component renders inference data using InputElement/ChatOutputElement
with smart detection of data shapes (inference input vs chat output vs raw JSON)
Integrates into pending event queue alongside user_questions events
Submits responses through existing answer-questions API endpoint
Includes 9 Storybook stories covering edge cases (single/multi example,
markdown context, many options, long conversations, loading state, etc.)

Renders auto-eval example labeling events in the autopilot session footer, allowing users to label inference input/output pairs to calibrate auto-evaluations. - New component renders inference data using InputElement/ChatOutputElement with smart detection of data shapes (inference input vs chat output vs raw JSON) - Integrates into pending event queue alongside user_questions events - Submits responses through existing answer-questions API endpoint - Includes 9 Storybook stories covering edge cases (single/multi example, markdown context, many options, long conversations, loading state, etc.)

The addon bundled its own react-router, creating duplicate instances. Context from the preview decorator's Router didn't reach useLocation in app components because they resolved to different copies. Replace the addon and partial provider stack with createMemoryRouter + RouterProvider + AppProviders — mirrors the real app provider tree and stays in sync automatically.

Fix capitalization to match the AutoEval convention everywhere: file names, component names, type names, imports.

- Extract OptionButton from duplicated button styling in MultipleChoiceStep and AutoEvalExampleLabeling - Add exhaustive default: never branch to ContextBlock switch

Both side-by-side context blocks now stretch to the taller one's height (capped at max) instead of shrinking to the shorter one. Reduces CONTEXT_MAX_HEIGHT from 150 to 120.

- Add ContentOverflow discriminated union (scroll | expandable) for InputElement and ChatOutputElement - ScrollFadeContainer: remove negative margin hack, use gradient elements as natural vertical padding instead - Auto-resizing textarea for rationale (1-3 rows, then scroll) - Side-by-side context blocks now match heights via flex-1 - Flesh out storybook fixtures with realistic content and add edge-case size combination stories - Fix nested scrollable regions with descendant selector overrides

…ts and cross-type stories - Replace InputElement/ChatOutputElement with simpler CodeEditor for JSON blocks - Add monospace font to markdown blocks - Add keyboard shortcuts (1-9 for options, Enter to advance) - Require all examples answered before submit, show progress counter - Add auto-resizing textarea (1-3 rows) with overscroll-none - Fix t0- prefix in fixture IDs - Add 6 cross-type storybook stories (JSON+Markdown combinations)

…ex-fill layout - Card border/header/close button use neutral tokens instead of orange - Option buttons use orange selected state with equal-width flex layout - Code blocks use CodeEditor with ScrollFadeContainer for both JSON and markdown - Input/Output labels use MessageWrapper-style colored left-border pattern - Card flexes to fill available viewport height (max-h-[70vh]) - Code blocks are the flex elements that grow/shrink as rationale appears - Rationale textarea only appears after selecting an answer - Remove visual testing stub from route file

…xt blocks - Switch card color scheme to purple (border, header, labels, option buttons) - Replace scroll overflow with ExpandableElement for code blocks - Use ExpandableElement from input_output for consistent show more/less UX - Question labels black, header purple, option buttons neutral with purple hover - Lighten StepTab active background, keep completed tabs green - Purple footer controls (back, labeled count, next button)

…onstants, drop invisible CM override

- White card bg with lighter purple border (matches autoeval card) - Purple header text, matching X button and back/skip styling - Black submit button, purple next button (consistent footer treatment) - Wider footer gap between controls

- Extract QuestionCard component from duplicated card chrome in PendingQuestionCard and AutoEvalExampleLabeling (header, dismiss, step tabs, animated height wrapper, footer slot) - Fix useAnimatedHeight: reset to height:auto after CSS transition so ExpandableElement and other dynamically-resizing content isn't clipped - Both cards now get animated step transitions via the shared shell - Trim stories from 21 to 10, removing redundant size-combination variants

- StepTab: change aria-label from "Go to question" to "Go to step" since it's now shared between questions and examples - AutoEvalExampleLabeling: add placeholder on explanation textarea to match FreeResponseStep's "Type your response..."

virajmehta · 2026-03-25T15:38:08Z

/autopilot-e2e

github-actions · 2026-03-25T15:41:13Z

🚀 Autopilot E2E tests triggered!

View the run: https://github.com/tensorzero/autopilot/actions/runs/23549701841

amishler

I think it would be useful to have the option to have a help circle (CircleHelp I think) next to the multiple choice prompt and the free-response prompt. I'm picturing tooltip text like this:

Multiple choice:
"Only rate the output with respect to the current target behavior, not with respect to overall quality or correctness.

Correct --> The output is correct with respect to the target behavior.
Incorrect --> The output is incorrect with respect to the target behavior.
Irrelevant --> This example is not relevant to the target behavior.
"

Free response:
"Add information that helps clarify the target behavior, including the boundary between correct vs incorrect behavior. For example, explain why the output is correct or incorrect or why this example is or is not relevant to the target behavior."

amishler · 2026-03-27T14:03:26Z

ui/app/components/autopilot/AutoEvalExampleLabeling.tsx

+        <ScrollFadeContainer maxHeight="60vh">
+          <PromptResponseDisplay example={example} />
+        </ScrollFadeContainer>


Could the Input and Output blocks be independently scrollable? That way for example if one is long and the other is short you don't have to scroll back and forth to reference both.

amishler · 2026-03-27T14:05:55Z

ui/app/components/autopilot/AutoEvalExampleLabeling.stories.tsx

+    isLoading: false,
+    onSubmit: () => {},
+  },
+};


When Next or Back is clicked, the box collapses and then re-expands. Is there a way to prevent this or at least make the transition smoother?

amishler · 2026-03-27T14:15:12Z

ui/app/components/autopilot/AutoEvalExampleLabeling.stories.tsx

+
+// ── Stories ───────────────────────────────────────────────────────────
+
+export const SingleExample: Story = {


I know this is just for illustration, but could we change the prompt for this story to "Rate this output with respect to the target behavior." and change "Yes"/"No" to "Correct"/"Incorrect"? This is the text I'm envisioning for the actual labeling task.

amishler · 2026-03-27T14:16:28Z

ui/app/components/autopilot/AutoEvalExampleLabeling.stories.tsx

+export const SingleJsonBlock: Story = {
+  args: {
+    payload: singleJsonBlockPayload,
+    isLoading: false,
+    onSubmit: () => {},
+  },
+};


Could we use the intended set of radio response buttons (Correct/Incorrect/Irrelevant) + the "Explain your rating (optional)" free response field here? Just to see what this looks like with a single block.

amishler · 2026-03-27T14:22:13Z

ui/app/components/autopilot/AutoEvalExampleLabeling.tsx

+          <InferenceButton
+            inferenceId={example.source.id}
+            tooltipText="View source inference"
+          />


For synthetic examples, I think we might want to display something like [This example was synthetically generated] so users don't get confused about why the Source Inference button only appears sometimes.

amishler · 2026-03-27T14:43:29Z

ui/app/components/autopilot/AutoEvalExampleLabeling.stories.tsx

+export default meta;
+type Story = StoryObj<typeof meta>;
+
+// ── Stories ───────────────────────────────────────────────────────────


I assume the variety in terms of the response options and the question prompts is just to illustrate the type? For actual usage I'm envisioning the same set of radio buttons and prompts every time, though I guess that could change in the future.

simeonlee and others added 18 commits March 11, 2026 17:47

Rename AutoevalExampleLabeling to AutoEvalExampleLabeling

4a05aae

Fix capitalization to match the AutoEval convention everywhere: file names, component names, type names, imports.

Extract shared OptionButton, add exhaustive never check

d11594f

- Extract OptionButton from duplicated button styling in MultipleChoiceStep and AutoEvalExampleLabeling - Add exhaustive default: never branch to ContextBlock switch

Backtick-wrap technical terms in comment

7a6f8af

Match context block heights to taller block, reduce max height

86842c1

Both side-by-side context blocks now stretch to the taller one's height (capped at max) instead of shrinking to the shorter one. Reduces CONTEXT_MAX_HEIGHT from 150 to 120.

Merge branch 'main' into simeonlee/autoeval-example-labeling

8984927

Scope flex-1 to scroll mode only, extract ScrollCard, hoist constant

9e51603

Remove keyboard shortcuts from autoeval labeling card

170b50a

Clean up AutoEvalExampleLabeling: remove dead dark:bg, inline color c…

fe38b19

…onstants, drop invisible CM override

merged

b2dc18e

virajmehta self-assigned this Mar 24, 2026

merged main again

3467409

virajmehta force-pushed the viraj/autoevals-block branch from 2ac671f to 3467409 Compare March 25, 2026 15:37

fixed types

0b15ac0

virajmehta force-pushed the viraj/autoevals-block branch from e51676b to 3eda9ce Compare March 25, 2026 18:03

added tests

502596f

virajmehta force-pushed the viraj/autoevals-block branch from 3eda9ce to 502596f Compare March 25, 2026 18:15

fixed smaller issues

56f83ce

virajmehta changed the title ~~Viraj/autoevals block~~ Add UI elements for autoevals Mar 25, 2026

virajmehta assigned amishler Mar 26, 2026

virajmehta removed their assignment Mar 26, 2026

amishler reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UI elements for autoevals#7064

Add UI elements for autoevals#7064
virajmehta wants to merge 22 commits intomainfrom
viraj/autoevals-block

virajmehta commented Mar 24, 2026

Uh oh!

virajmehta commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

amishler left a comment

Uh oh!

amishler Mar 27, 2026

Uh oh!

amishler Mar 27, 2026

Uh oh!

amishler Mar 27, 2026 •

edited

Loading

Uh oh!

amishler Mar 27, 2026

Uh oh!

amishler Mar 27, 2026

Uh oh!

amishler Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		// ── Stories ───────────────────────────────────────────────────────────

		export const SingleExample: Story = {

Conversation

virajmehta commented Mar 24, 2026

Uh oh!

virajmehta commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

amishler left a comment

Choose a reason for hiding this comment

Uh oh!

amishler Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

amishler Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

amishler Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amishler Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

amishler Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

amishler Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amishler Mar 27, 2026 •

edited

Loading