Skip to content

Add UI elements for autoevals#7064

Draft
virajmehta wants to merge 22 commits intomainfrom
viraj/autoevals-block
Draft

Add UI elements for autoevals#7064
virajmehta wants to merge 22 commits intomainfrom
viraj/autoevals-block

Conversation

@virajmehta
Copy link
Copy Markdown
Member

Renders auto-eval example labeling events in the autopilot session footer,
allowing users to label inference input/output pairs to calibrate auto-evaluations.

  • New component renders inference data using InputElement/ChatOutputElement
    with smart detection of data shapes (inference input vs chat output vs raw JSON)
  • Integrates into pending event queue alongside user_questions events
  • Submits responses through existing answer-questions API endpoint
  • Includes 9 Storybook stories covering edge cases (single/multi example,
    markdown context, many options, long conversations, loading state, etc.)

simeonlee and others added 18 commits March 11, 2026 17:47
Renders auto-eval example labeling events in the autopilot session footer,
allowing users to label inference input/output pairs to calibrate auto-evaluations.

- New component renders inference data using InputElement/ChatOutputElement
  with smart detection of data shapes (inference input vs chat output vs raw JSON)
- Integrates into pending event queue alongside user_questions events
- Submits responses through existing answer-questions API endpoint
- Includes 9 Storybook stories covering edge cases (single/multi example,
  markdown context, many options, long conversations, loading state, etc.)
The addon bundled its own react-router, creating duplicate instances.
Context from the preview decorator's Router didn't reach useLocation
in app components because they resolved to different copies.

Replace the addon and partial provider stack with createMemoryRouter +
RouterProvider + AppProviders — mirrors the real app provider tree and
stays in sync automatically.
Fix capitalization to match the AutoEval convention everywhere:
file names, component names, type names, imports.
- Extract OptionButton from duplicated button styling in
  MultipleChoiceStep and AutoEvalExampleLabeling
- Add exhaustive default: never branch to ContextBlock switch
Both side-by-side context blocks now stretch to the taller one's
height (capped at max) instead of shrinking to the shorter one.
Reduces CONTEXT_MAX_HEIGHT from 150 to 120.
- Add ContentOverflow discriminated union (scroll | expandable) for
  InputElement and ChatOutputElement
- ScrollFadeContainer: remove negative margin hack, use gradient
  elements as natural vertical padding instead
- Auto-resizing textarea for rationale (1-3 rows, then scroll)
- Side-by-side context blocks now match heights via flex-1
- Flesh out storybook fixtures with realistic content and add
  edge-case size combination stories
- Fix nested scrollable regions with descendant selector overrides
…ts and cross-type stories

- Replace InputElement/ChatOutputElement with simpler CodeEditor for JSON blocks
- Add monospace font to markdown blocks
- Add keyboard shortcuts (1-9 for options, Enter to advance)
- Require all examples answered before submit, show progress counter
- Add auto-resizing textarea (1-3 rows) with overscroll-none
- Fix t0- prefix in fixture IDs
- Add 6 cross-type storybook stories (JSON+Markdown combinations)
…ex-fill layout

- Card border/header/close button use neutral tokens instead of orange
- Option buttons use orange selected state with equal-width flex layout
- Code blocks use CodeEditor with ScrollFadeContainer for both JSON and markdown
- Input/Output labels use MessageWrapper-style colored left-border pattern
- Card flexes to fill available viewport height (max-h-[70vh])
- Code blocks are the flex elements that grow/shrink as rationale appears
- Rationale textarea only appears after selecting an answer
- Remove visual testing stub from route file
…xt blocks

- Switch card color scheme to purple (border, header, labels, option buttons)
- Replace scroll overflow with ExpandableElement for code blocks
- Use ExpandableElement from input_output for consistent show more/less UX
- Question labels black, header purple, option buttons neutral with purple hover
- Lighten StepTab active background, keep completed tabs green
- Purple footer controls (back, labeled count, next button)
- White card bg with lighter purple border (matches autoeval card)
- Purple header text, matching X button and back/skip styling
- Black submit button, purple next button (consistent footer treatment)
- Wider footer gap between controls
- Extract QuestionCard component from duplicated card chrome in
  PendingQuestionCard and AutoEvalExampleLabeling (header, dismiss,
  step tabs, animated height wrapper, footer slot)
- Fix useAnimatedHeight: reset to height:auto after CSS transition so
  ExpandableElement and other dynamically-resizing content isn't clipped
- Both cards now get animated step transitions via the shared shell
- Trim stories from 21 to 10, removing redundant size-combination variants
- StepTab: change aria-label from "Go to question" to "Go to step"
  since it's now shared between questions and examples
- AutoEvalExampleLabeling: add placeholder on explanation textarea
  to match FreeResponseStep's "Type your response..."
@virajmehta virajmehta self-assigned this Mar 24, 2026
@virajmehta virajmehta force-pushed the viraj/autoevals-block branch from 2ac671f to 3467409 Compare March 25, 2026 15:37
@virajmehta
Copy link
Copy Markdown
Member Author

/autopilot-e2e

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Autopilot E2E tests triggered!

View the run: https://github.com/tensorzero/autopilot/actions/runs/23549701841

@virajmehta virajmehta force-pushed the viraj/autoevals-block branch from e51676b to 3eda9ce Compare March 25, 2026 18:03
@virajmehta virajmehta force-pushed the viraj/autoevals-block branch from 3eda9ce to 502596f Compare March 25, 2026 18:15
@virajmehta virajmehta changed the title Viraj/autoevals block Add UI elements for autoevals Mar 25, 2026
@virajmehta virajmehta removed their assignment Mar 26, 2026
Copy link
Copy Markdown
Member

@amishler amishler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to have the option to have a help circle (CircleHelp I think) next to the multiple choice prompt and the free-response prompt. I'm picturing tooltip text like this:

Multiple choice:
"Only rate the output with respect to the current target behavior, not with respect to overall quality or correctness.

Correct --> The output is correct with respect to the target behavior.
Incorrect --> The output is incorrect with respect to the target behavior.
Irrelevant --> This example is not relevant to the target behavior.
"

Free response:
"Add information that helps clarify the target behavior, including the boundary between correct vs incorrect behavior. For example, explain why the output is correct or incorrect or why this example is or is not relevant to the target behavior."

Comment on lines +268 to +270
<ScrollFadeContainer maxHeight="60vh">
<PromptResponseDisplay example={example} />
</ScrollFadeContainer>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the Input and Output blocks be independently scrollable? That way for example if one is long and the other is short you don't have to scroll back and forth to reference both.

isLoading: false,
onSubmit: () => {},
},
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Next or Back is clicked, the box collapses and then re-expands. Is there a way to prevent this or at least make the transition smoother?


// ── Stories ───────────────────────────────────────────────────────────

export const SingleExample: Story = {
Copy link
Copy Markdown
Member

@amishler amishler Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just for illustration, but could we change the prompt for this story to "Rate this output with respect to the target behavior." and change "Yes"/"No" to "Correct"/"Incorrect"? This is the text I'm envisioning for the actual labeling task.

Comment on lines +885 to +891
export const SingleJsonBlock: Story = {
args: {
payload: singleJsonBlockPayload,
isLoading: false,
onSubmit: () => {},
},
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the intended set of radio response buttons (Correct/Incorrect/Irrelevant) + the "Explain your rating (optional)" free response field here? Just to see what this looks like with a single block.

Comment on lines +84 to +87
<InferenceButton
inferenceId={example.source.id}
tooltipText="View source inference"
/>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For synthetic examples, I think we might want to display something like [This example was synthetically generated] so users don't get confused about why the Source Inference button only appears sometimes.

export default meta;
type Story = StoryObj<typeof meta>;

// ── Stories ───────────────────────────────────────────────────────────
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the variety in terms of the response options and the question prompts is just to illustrate the type? For actual usage I'm envisioning the same set of radio buttons and prompts every time, though I guess that could change in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants