OTel Instrumentation Improvement: derive gh-aw.run.status from observable failure signals
Analysis Date: 2026-05-18
Priority: High
Effort: Small (< 2h)
Problem
The conclusion span attribute gh-aw.run.status and the OTLP span status.code are computed exclusively from process.env.GH_AW_AGENT_CONCLUSION (and the rarer workflow_run.conclusion) in actions/setup/js/send_otlp_span.cjs:1670-1683. Because GitHub Actions only exposes needs.<job>.result to downstream jobs, that env var is empty for the agent job's own post-step — and from the live data, it appears to be empty for downstream jobs too. The result is that every conclusion span in Tempo over the last 7 days carries gh-aw.run.status="success" and status.code=STATUS_CODE_OK, even on runs where the agent emitted errors.
A DevOps engineer cannot answer the most basic operational question — "which gh-aw runs failed in the last hour?" — by filtering on either span status or gh-aw.run.status in Grafana. The data exists (gh-aw.error_count, exception events) but is not surfaced through the conventional channels dashboards and alerting rules rely on.
Why This Matters (DevOps Perspective)
- Alerting is blocked: Any rule of the form
count_over_time({status=error}[5m]) > N returns 0 today. The only working failure signal is an in-payload exception event, which most TraceQL/PromQL dashboards do not aggregate cleanly.
- MTTR increases: On-call engineers cannot triage by failure status; they must drill into individual traces or fall back to GitHub Actions UI.
- Failure-rate SLOs are unmeasurable:
success_rate = ok / total is stuck at 100% in every backend, masking real regressions.
- The fix is local: the data needed to derive the correct status (
outputErrors.length, hasNoReadableAgentOutput) is already read in the same function — it is just not consulted for runStatus.
Current Behavior
actions/setup/js/send_otlp_span.cjs:1670-1683:
let runStatus = "success";
const rawRunStatus = agentConclusion || workflowRunConclusion;
if (rawRunStatus === "cancelled") {
runStatus = "cancelled";
} else if (rawRunStatus === "failure" || rawRunStatus === "timed_out") {
runStatus = "failure";
}
if (isAgentFailure && errorMessages.length > 0) {
statusMessage = `agent ${agentConclusion}: ${errorMessages[0]}`.slice(0, 256);
}
const attributes = [..., buildAttr("gh-aw.run.status", runStatus), ...];
And the span status code at line 1651:
const isAgentNonOK = isAgentFailure || isAgentCancelled;
const statusCode = isAgentNonOK ? 2 : 1;
Both depend entirely on agentConclusion (from GH_AW_AGENT_CONCLUSION). When that env var is empty — which is the observed reality for all spans in Tempo — runStatus stays "success" and statusCode stays 1 (OK), regardless of what the agent actually did.
Proposed Change
Fall back to observable failure signals (errors written to agent_output.json, or a missing/unreadable agent_output.json) when the env var path did not yield a non-success status. The same outputErrors and hasNoReadableAgentOutput values are already computed a few lines above.
// actions/setup/js/send_otlp_span.cjs — after the existing rawRunStatus block (~line 1676)
// Fallback: GH_AW_AGENT_CONCLUSION is empty on the agent job's own post step
// (GitHub Actions does not expose needs.<self>.result), and is often empty on
// downstream jobs as well. Derive the failure status from observable signals
// that this function already has in hand so dashboards and alerts can use
// gh-aw.run.status and span status_code as authoritative failure indicators.
if (runStatus === "success" && (outputErrors.length > 0 || hasNoReadableAgentOutput)) {
runStatus = "failure";
}
// Re-derive the OTLP status from the (possibly upgraded) runStatus so the two stay in sync.
const statusCode = runStatus === "success" ? 1 : 2;
let statusMessage;
if (runStatus === "failure") {
statusMessage = errorMessages[0]
? `agent failure: ${errorMessages[0]}`.slice(0, 256)
: (agentConclusion ? `agent ${agentConclusion}` : "agent failure");
} else if (runStatus === "cancelled") {
statusMessage = "agent cancelled";
}
The earlier const statusCode = isAgentNonOK ? 2 : 1; block and the if (isAgentFailure && errorMessages.length > 0) statusMessage assignment are removed (they become subsumed by the new derivation).
Expected Outcome
- In Grafana / Tempo:
{resource.service.name="gh-aw" && status=error} returns the real failing traces. {span.gh-aw.run.status="failure"} becomes a usable filter. The attribute-values index for gh-aw.run.status gains failure (and cancelled when relevant) instead of being stuck on a single success value.
- In the JSONL mirror: failed runs are visibly different from successful ones at the top-level span status field, not just inside nested exception events.
- For on-call engineers: a single TraceQL filter or alert rule is enough to find and page on agent failures. The existing
gh-aw.error.messages attribute (already emitted) becomes immediately useful as a tooltip on those filtered spans.
Implementation Steps
Evidence from Live Grafana Data
Queried Tempo datasource grafanacloud-traces over 2026-05-11T00:00:00Z → 2026-05-18T07:00:00Z:
tempo_get-attribute-values name="span.gh-aw.run.status" returns exactly one value: "success" across the entire 7-day window. No failure, no cancelled.
{resource.service.name="gh-aw" && status=error} returns 0 traces, despite the same data showing exception events on multiple traces.
{span.gh-aw.error_count=1} returns traces with real agent errors. Inspecting trace 5b3a7917f205e61028bd3d6b0f921c72 (gh-aw.copilot-cli-deep-research):
job=agent span=gh-aw.agent.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
EVENT: exception type=gh-aw.AgentError message="Line 2: Too many items of type 'create_discussion'. Maximum allowed: 1."
job=detection span=gh-aw.detection.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
job=safe_outputs span=gh-aw.safe_outputs.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
job=conclusion span=gh-aw.conclusion.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
Every single conclusion span in that failing run reports status_code=OK and gh-aw.run.status=success. The gh-aw.agent.conclusion attribute is absent on every span (it is also missing from the Tempo attribute-name index, confirming the env var path never set it in production).
Related Files
actions/setup/js/send_otlp_span.cjs (primary change)
actions/setup/js/send_otlp_span.test.cjs (test additions)
actions/setup/js/action_conclusion_otlp.cjs (caller — no change expected)
actions/setup/js/generate_observability_summary.cjs (consumer — verify summary reflects new status)
Generated by the Daily Grafana OTel Instrumentation Advisor workflow
Generated by 📊 Daily Grafana OTel Instrumentation Advisor · ● 19.8M · ◷
OTel Instrumentation Improvement: derive
gh-aw.run.statusfrom observable failure signalsAnalysis Date: 2026-05-18
Priority: High
Effort: Small (< 2h)
Problem
The conclusion span attribute
gh-aw.run.statusand the OTLP spanstatus.codeare computed exclusively fromprocess.env.GH_AW_AGENT_CONCLUSION(and the rarerworkflow_run.conclusion) inactions/setup/js/send_otlp_span.cjs:1670-1683. Because GitHub Actions only exposesneeds.<job>.resultto downstream jobs, that env var is empty for the agent job's own post-step — and from the live data, it appears to be empty for downstream jobs too. The result is that every conclusion span in Tempo over the last 7 days carriesgh-aw.run.status="success"andstatus.code=STATUS_CODE_OK, even on runs where the agent emitted errors.A DevOps engineer cannot answer the most basic operational question — "which gh-aw runs failed in the last hour?" — by filtering on either span status or
gh-aw.run.statusin Grafana. The data exists (gh-aw.error_count, exception events) but is not surfaced through the conventional channels dashboards and alerting rules rely on.Why This Matters (DevOps Perspective)
count_over_time({status=error}[5m]) > Nreturns 0 today. The only working failure signal is an in-payload exception event, which most TraceQL/PromQL dashboards do not aggregate cleanly.success_rate = ok / totalis stuck at 100% in every backend, masking real regressions.outputErrors.length,hasNoReadableAgentOutput) is already read in the same function — it is just not consulted forrunStatus.Current Behavior
actions/setup/js/send_otlp_span.cjs:1670-1683:And the span status code at line 1651:
Both depend entirely on
agentConclusion(fromGH_AW_AGENT_CONCLUSION). When that env var is empty — which is the observed reality for all spans in Tempo —runStatusstays"success"andstatusCodestays1(OK), regardless of what the agent actually did.Proposed Change
Fall back to observable failure signals (errors written to
agent_output.json, or a missing/unreadableagent_output.json) when the env var path did not yield a non-success status. The sameoutputErrorsandhasNoReadableAgentOutputvalues are already computed a few lines above.The earlier
const statusCode = isAgentNonOK ? 2 : 1;block and theif (isAgentFailure && errorMessages.length > 0)statusMessage assignment are removed (they become subsumed by the new derivation).Expected Outcome
{resource.service.name="gh-aw" && status=error}returns the real failing traces.{span.gh-aw.run.status="failure"}becomes a usable filter. The attribute-values index forgh-aw.run.statusgainsfailure(andcancelledwhen relevant) instead of being stuck on a singlesuccessvalue.gh-aw.error.messagesattribute (already emitted) becomes immediately useful as a tooltip on those filtered spans.Implementation Steps
actions/setup/js/send_otlp_span.cjsaround line 1676 as shown above.statusCode = isAgentNonOK ? 2 : 1assignment (line ~1651) and the standaloneif (isAgentFailure && errorMessages.length > 0)statusMessage block (lines ~1678-1680).actions/setup/js/send_otlp_span.test.cjsto assert: (a)gh-aw.run.status="failure"andstatus.code=2whenagent_output.jsoncontains errors butGH_AW_AGENT_CONCLUSIONis empty, (b) same whenagent_output.jsonis missing on the agent job, (c) existingagentConclusion=successpath still emitsstatus.code=1.cd actions/setup/js && npx vitest run send_otlp_span.test.cjsto confirm tests pass.make fmtandmake test-unit.Evidence from Live Grafana Data
Queried Tempo datasource
grafanacloud-tracesover2026-05-11T00:00:00Z→2026-05-18T07:00:00Z:tempo_get-attribute-values name="span.gh-aw.run.status"returns exactly one value:"success"across the entire 7-day window. Nofailure, nocancelled.{resource.service.name="gh-aw" && status=error}returns 0 traces, despite the same data showing exception events on multiple traces.{span.gh-aw.error_count=1}returns traces with real agent errors. Inspecting trace5b3a7917f205e61028bd3d6b0f921c72(gh-aw.copilot-cli-deep-research):Every single conclusion span in that failing run reports
status_code=OKandgh-aw.run.status=success. Thegh-aw.agent.conclusionattribute is absent on every span (it is also missing from the Tempo attribute-name index, confirming the env var path never set it in production).Related Files
actions/setup/js/send_otlp_span.cjs(primary change)actions/setup/js/send_otlp_span.test.cjs(test additions)actions/setup/js/action_conclusion_otlp.cjs(caller — no change expected)actions/setup/js/generate_observability_summary.cjs(consumer — verify summary reflects new status)Generated by the Daily Grafana OTel Instrumentation Advisor workflow