Conversation
Renders auto-eval example labeling events in the autopilot session footer, allowing users to label inference input/output pairs to calibrate auto-evaluations. - New component renders inference data using InputElement/ChatOutputElement with smart detection of data shapes (inference input vs chat output vs raw JSON) - Integrates into pending event queue alongside user_questions events - Submits responses through existing answer-questions API endpoint - Includes 9 Storybook stories covering edge cases (single/multi example, markdown context, many options, long conversations, loading state, etc.)
The addon bundled its own react-router, creating duplicate instances. Context from the preview decorator's Router didn't reach useLocation in app components because they resolved to different copies. Replace the addon and partial provider stack with createMemoryRouter + RouterProvider + AppProviders — mirrors the real app provider tree and stays in sync automatically.
Fix capitalization to match the AutoEval convention everywhere: file names, component names, type names, imports.
- Extract OptionButton from duplicated button styling in MultipleChoiceStep and AutoEvalExampleLabeling - Add exhaustive default: never branch to ContextBlock switch
Both side-by-side context blocks now stretch to the taller one's height (capped at max) instead of shrinking to the shorter one. Reduces CONTEXT_MAX_HEIGHT from 150 to 120.
- Add ContentOverflow discriminated union (scroll | expandable) for InputElement and ChatOutputElement - ScrollFadeContainer: remove negative margin hack, use gradient elements as natural vertical padding instead - Auto-resizing textarea for rationale (1-3 rows, then scroll) - Side-by-side context blocks now match heights via flex-1 - Flesh out storybook fixtures with realistic content and add edge-case size combination stories - Fix nested scrollable regions with descendant selector overrides
…ts and cross-type stories - Replace InputElement/ChatOutputElement with simpler CodeEditor for JSON blocks - Add monospace font to markdown blocks - Add keyboard shortcuts (1-9 for options, Enter to advance) - Require all examples answered before submit, show progress counter - Add auto-resizing textarea (1-3 rows) with overscroll-none - Fix t0- prefix in fixture IDs - Add 6 cross-type storybook stories (JSON+Markdown combinations)
…ex-fill layout - Card border/header/close button use neutral tokens instead of orange - Option buttons use orange selected state with equal-width flex layout - Code blocks use CodeEditor with ScrollFadeContainer for both JSON and markdown - Input/Output labels use MessageWrapper-style colored left-border pattern - Card flexes to fill available viewport height (max-h-[70vh]) - Code blocks are the flex elements that grow/shrink as rationale appears - Rationale textarea only appears after selecting an answer - Remove visual testing stub from route file
…xt blocks - Switch card color scheme to purple (border, header, labels, option buttons) - Replace scroll overflow with ExpandableElement for code blocks - Use ExpandableElement from input_output for consistent show more/less UX - Question labels black, header purple, option buttons neutral with purple hover - Lighten StepTab active background, keep completed tabs green - Purple footer controls (back, labeled count, next button)
…onstants, drop invisible CM override
- White card bg with lighter purple border (matches autoeval card) - Purple header text, matching X button and back/skip styling - Black submit button, purple next button (consistent footer treatment) - Wider footer gap between controls
- Extract QuestionCard component from duplicated card chrome in PendingQuestionCard and AutoEvalExampleLabeling (header, dismiss, step tabs, animated height wrapper, footer slot) - Fix useAnimatedHeight: reset to height:auto after CSS transition so ExpandableElement and other dynamically-resizing content isn't clipped - Both cards now get animated step transitions via the shared shell - Trim stories from 21 to 10, removing redundant size-combination variants
- StepTab: change aria-label from "Go to question" to "Go to step" since it's now shared between questions and examples - AutoEvalExampleLabeling: add placeholder on explanation textarea to match FreeResponseStep's "Type your response..."
2ac671f to
3467409
Compare
|
/autopilot-e2e |
|
🚀 Autopilot E2E tests triggered! View the run: https://github.com/tensorzero/autopilot/actions/runs/23549701841 |
e51676b to
3eda9ce
Compare
3eda9ce to
502596f
Compare
amishler
left a comment
There was a problem hiding this comment.
I think it would be useful to have the option to have a help circle (CircleHelp I think) next to the multiple choice prompt and the free-response prompt. I'm picturing tooltip text like this:
Multiple choice:
"Only rate the output with respect to the current target behavior, not with respect to overall quality or correctness.
Correct --> The output is correct with respect to the target behavior.
Incorrect --> The output is incorrect with respect to the target behavior.
Irrelevant --> This example is not relevant to the target behavior.
"
Free response:
"Add information that helps clarify the target behavior, including the boundary between correct vs incorrect behavior. For example, explain why the output is correct or incorrect or why this example is or is not relevant to the target behavior."
| <ScrollFadeContainer maxHeight="60vh"> | ||
| <PromptResponseDisplay example={example} /> | ||
| </ScrollFadeContainer> |
There was a problem hiding this comment.
Could the Input and Output blocks be independently scrollable? That way for example if one is long and the other is short you don't have to scroll back and forth to reference both.
| isLoading: false, | ||
| onSubmit: () => {}, | ||
| }, | ||
| }; |
There was a problem hiding this comment.
When Next or Back is clicked, the box collapses and then re-expands. Is there a way to prevent this or at least make the transition smoother?
|
|
||
| // ── Stories ─────────────────────────────────────────────────────────── | ||
|
|
||
| export const SingleExample: Story = { |
There was a problem hiding this comment.
I know this is just for illustration, but could we change the prompt for this story to "Rate this output with respect to the target behavior." and change "Yes"/"No" to "Correct"/"Incorrect"? This is the text I'm envisioning for the actual labeling task.
| export const SingleJsonBlock: Story = { | ||
| args: { | ||
| payload: singleJsonBlockPayload, | ||
| isLoading: false, | ||
| onSubmit: () => {}, | ||
| }, | ||
| }; |
There was a problem hiding this comment.
Could we use the intended set of radio response buttons (Correct/Incorrect/Irrelevant) + the "Explain your rating (optional)" free response field here? Just to see what this looks like with a single block.
| <InferenceButton | ||
| inferenceId={example.source.id} | ||
| tooltipText="View source inference" | ||
| /> |
There was a problem hiding this comment.
For synthetic examples, I think we might want to display something like [This example was synthetically generated] so users don't get confused about why the Source Inference button only appears sometimes.
| export default meta; | ||
| type Story = StoryObj<typeof meta>; | ||
|
|
||
| // ── Stories ─────────────────────────────────────────────────────────── |
There was a problem hiding this comment.
I assume the variety in terms of the response options and the question prompts is just to illustrate the type? For actual usage I'm envisioning the same set of radio buttons and prompts every time, though I guess that could change in the future.
Renders auto-eval example labeling events in the autopilot session footer,
allowing users to label inference input/output pairs to calibrate auto-evaluations.
with smart detection of data shapes (inference input vs chat output vs raw JSON)
markdown context, many options, long conversations, loading state, etc.)