[grafana-otel-advisor] OTel improvement: gh-aw.run.status silently reports 'success' on real agent failures

### OTel Instrumentation Improvement: derive `gh-aw.run.status` from observable failure signals

**Analysis Date**: 2026-05-18
**Priority**: High
**Effort**: Small (< 2h)

### Problem

The conclusion span attribute `gh-aw.run.status` and the OTLP span `status.code` are computed exclusively from `process.env.GH_AW_AGENT_CONCLUSION` (and the rarer `workflow_run.conclusion`) in `actions/setup/js/send_otlp_span.cjs:1670-1683`. Because GitHub Actions only exposes `needs.<job>.result` to **downstream** jobs, that env var is empty for the agent job's *own* post-step — and from the live data, it appears to be empty for downstream jobs too. The result is that **every conclusion span in Tempo over the last 7 days carries `gh-aw.run.status="success"` and `status.code=STATUS_CODE_OK`**, even on runs where the agent emitted errors.

A DevOps engineer cannot answer the most basic operational question — *"which gh-aw runs failed in the last hour?"* — by filtering on either span status or `gh-aw.run.status` in Grafana. The data exists (`gh-aw.error_count`, exception events) but is not surfaced through the conventional channels dashboards and alerting rules rely on.

<details>
<summary>Why This Matters (DevOps Perspective)</summary>

- **Alerting is blocked**: Any rule of the form `count_over_time({status=error}[5m]) > N` returns 0 today. The only working failure signal is an in-payload exception event, which most TraceQL/PromQL dashboards do not aggregate cleanly.
- **MTTR increases**: On-call engineers cannot triage by failure status; they must drill into individual traces or fall back to GitHub Actions UI.
- **Failure-rate SLOs are unmeasurable**: `success_rate = ok / total` is stuck at 100% in every backend, masking real regressions.
- **The fix is local**: the data needed to derive the correct status (`outputErrors.length`, `hasNoReadableAgentOutput`) is already read in the same function — it is just not consulted for `runStatus`.

</details>

<details>
<summary>Current Behavior</summary>

`actions/setup/js/send_otlp_span.cjs:1670-1683`:

```javascript
let runStatus = "success";
const rawRunStatus = agentConclusion || workflowRunConclusion;
if (rawRunStatus === "cancelled") {
 runStatus = "cancelled";
} else if (rawRunStatus === "failure" || rawRunStatus === "timed_out") {
 runStatus = "failure";
}

if (isAgentFailure && errorMessages.length > 0) {
 statusMessage = `agent ${agentConclusion}: ${errorMessages[0]}`.slice(0, 256);
}

const attributes = [..., buildAttr("gh-aw.run.status", runStatus), ...];
```

And the span status code at line 1651:

```javascript
const isAgentNonOK = isAgentFailure || isAgentCancelled;
const statusCode = isAgentNonOK ? 2 : 1;
```

Both depend entirely on `agentConclusion` (from `GH_AW_AGENT_CONCLUSION`). When that env var is empty — which is the observed reality for all spans in Tempo — `runStatus` stays `"success"` and `statusCode` stays `1` (OK), regardless of what the agent actually did.

</details>

<details>
<summary>Proposed Change</summary>

Fall back to observable failure signals (errors written to `agent_output.json`, or a missing/unreadable `agent_output.json`) when the env var path did not yield a non-success status. The same `outputErrors` and `hasNoReadableAgentOutput` values are already computed a few lines above.

```javascript
// actions/setup/js/send_otlp_span.cjs — after the existing rawRunStatus block (~line 1676)

// Fallback: GH_AW_AGENT_CONCLUSION is empty on the agent job's own post step
// (GitHub Actions does not expose needs.<self>.result), and is often empty on
// downstream jobs as well. Derive the failure status from observable signals
// that this function already has in hand so dashboards and alerts can use
// gh-aw.run.status and span status_code as authoritative failure indicators.
if (runStatus === "success" && (outputErrors.length > 0 || hasNoReadableAgentOutput)) {
 runStatus = "failure";
}

// Re-derive the OTLP status from the (possibly upgraded) runStatus so the two stay in sync.
const statusCode = runStatus === "success" ? 1 : 2;
let statusMessage;
if (runStatus === "failure") {
 statusMessage = errorMessages[0]
 ? `agent failure: ${errorMessages[0]}`.slice(0, 256)
 : (agentConclusion ? `agent ${agentConclusion}` : "agent failure");
} else if (runStatus === "cancelled") {
 statusMessage = "agent cancelled";
}
```

The earlier `const statusCode = isAgentNonOK ? 2 : 1;` block and the `if (isAgentFailure && errorMessages.length > 0)` statusMessage assignment are removed (they become subsumed by the new derivation).

</details>

<details>
<summary>Expected Outcome</summary>

- **In Grafana / Tempo**: `{resource.service.name="gh-aw" && status=error}` returns the real failing traces. `{span.gh-aw.run.status="failure"}` becomes a usable filter. The attribute-values index for `gh-aw.run.status` gains `failure` (and `cancelled` when relevant) instead of being stuck on a single `success` value.
- **In the JSONL mirror**: failed runs are visibly different from successful ones at the top-level span status field, not just inside nested exception events.
- **For on-call engineers**: a single TraceQL filter or alert rule is enough to find and page on agent failures. The existing `gh-aw.error.messages` attribute (already emitted) becomes immediately useful as a tooltip on those filtered spans.

</details>

<details>
<summary>Implementation Steps</summary>

- [ ] Edit `actions/setup/js/send_otlp_span.cjs` around line 1676 as shown above.
- [ ] Remove the now-superseded `statusCode = isAgentNonOK ? 2 : 1` assignment (line ~1651) and the standalone `if (isAgentFailure && errorMessages.length > 0)` statusMessage block (lines ~1678-1680).
- [ ] Update `actions/setup/js/send_otlp_span.test.cjs` to assert: (a) `gh-aw.run.status="failure"` and `status.code=2` when `agent_output.json` contains errors but `GH_AW_AGENT_CONCLUSION` is empty, (b) same when `agent_output.json` is missing on the agent job, (c) existing `agentConclusion=success` path still emits `status.code=1`.
- [ ] Run `cd actions/setup/js && npx vitest run send_otlp_span.test.cjs` to confirm tests pass.
- [ ] Run `make fmt` and `make test-unit`.
- [ ] Open a PR referencing this issue.

</details>

<details>
<summary>Evidence from Live Grafana Data</summary>

Queried Tempo datasource `grafanacloud-traces` over `2026-05-11T00:00:00Z` → `2026-05-18T07:00:00Z`:

- `tempo_get-attribute-values name="span.gh-aw.run.status"` returns **exactly one value: `"success"`** across the entire 7-day window. No `failure`, no `cancelled`.
- `{resource.service.name="gh-aw" && status=error}` returns **0 traces**, despite the same data showing exception events on multiple traces.
- `{span.gh-aw.error_count=1}` returns traces with real agent errors. Inspecting trace `5b3a7917f205e61028bd3d6b0f921c72` (`gh-aw.copilot-cli-deep-research`):

```
job=agent span=gh-aw.agent.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
 EVENT: exception type=gh-aw.AgentError message="Line 2: Too many items of type 'create_discussion'. Maximum allowed: 1."
job=detection span=gh-aw.detection.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
job=safe_outputs span=gh-aw.safe_outputs.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
job=conclusion span=gh-aw.conclusion.conclusion status_code=STATUS_CODE_OK run_status=success error_count=1 agent_conclusion=None
```

Every single conclusion span in that failing run reports `status_code=OK` and `gh-aw.run.status=success`. The `gh-aw.agent.conclusion` attribute is absent on every span (it is also missing from the Tempo attribute-name index, confirming the env var path never set it in production).

</details>

<details>
<summary>Related Files</summary>

- `actions/setup/js/send_otlp_span.cjs` (primary change)
- `actions/setup/js/send_otlp_span.test.cjs` (test additions)
- `actions/setup/js/action_conclusion_otlp.cjs` (caller — no change expected)
- `actions/setup/js/generate_observability_summary.cjs` (consumer — verify summary reflects new status)

</details>

---

*Generated by the [Daily Grafana OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/26015813823) workflow*







> Generated by [📊 Daily Grafana OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/26015813823) · ● 19.8M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-grafana-otel-instrumentation-advisor%22&type=issues)
> - [x] expires  on May 25, 2026, 5:54 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[grafana-otel-advisor] OTel improvement: gh-aw.run.status silently reports 'success' on real agent failures #32958

OTel Instrumentation Improvement: derive `gh-aw.run.status` from observable failure signals

Problem

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[grafana-otel-advisor] OTel improvement: gh-aw.run.status silently reports 'success' on real agent failures #32958

Description

OTel Instrumentation Improvement: derive gh-aw.run.status from observable failure signals

Problem

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

OTel Instrumentation Improvement: derive `gh-aw.run.status` from observable failure signals