Graceful degradation when MCP server is unavailable

## 🔴 Required Information

### Is your feature request related to a specific problem?

When an MCP server configured in an agent's toolset is unreachable (not started, network failure, crash), the entire agent invocation fails with an unrecoverable `ConnectionError`. The agent cannot continue operating with its remaining tools or built-in knowledge.

`McpToolset.get_tools()` is called during every LLM step via `_preprocess_async` → `_process_agent_tools` → `_convert_tool_union_to_tools`. There is no `try/except` anywhere in this chain, so a single unavailable MCP server takes down the entire agent — even when the agent has other tools or could answer using its own knowledge.

```
base_llm_flow.py _preprocess_async
  → _process_agent_tools (no try/except)
    → _convert_tool_union_to_tools (no try/except)
      → base_toolset.py get_tools_with_prefix (no try/except)
        → mcp_toolset.py get_tools
          → _execute_with_session → create_session
            → ConnectionError: Failed to create MCP session
```

### Describe the Solution You'd Like

An `optional` parameter on `McpToolset` (default `False` for backward compatibility). When `True`, connection failures in `get_tools()` return an empty list instead of raising, allowing the agent to continue with remaining tools.

```python
toolset = McpToolset(
    connection_params=SseConnectionParams(url="http://mcp-server:5031/mcp"),
    tool_filter=["search"],
    optional=True,  # Agent continues if this server is down
)

agent = LlmAgent(
    model="gemini-2.0-flash",
    name="assistant",
    tools=[toolset],  # Agent works even if MCP server is unavailable
)
```

### Impact on your work

In production environments, MCP servers are deployed as independent services and can go down for maintenance, scaling events, or unexpected failures. Currently, an agent with multiple tools from multiple MCP servers becomes completely non-functional if any single MCP server is temporarily unavailable. This severely impacts service reliability.

Agents configured with various MCP tool combinations should not have their entire experience broken by a single MCP server outage.

### Willingness to contribute

Yes — happy to submit a PR if the team agrees on an approach.

---

## 🟡 Recommended Information

### Describe Alternatives You've Considered

There is no clean way to handle this externally:

- **`before_tool_callback` plugin approach** does not work because `get_tools()` fails during tool discovery (before any specific tool is called), so the callback is never reached.
- **Catching errors at agent construction time** and skipping unavailable MCP servers prevents the agent from ever discovering tools if the server comes back online mid-conversation.
- **Subclassing `McpToolset`** and overriding `get_tools()` works as a temporary workaround, but it relies on internal implementation details and may break with future ADK changes.

None of these are ideal. A first-class `optional` parameter would be the cleanest solution.

### Proposed API / Implementation

**Option A: `optional` flag on `McpToolset` (minimal, recommended)**

Add `optional: bool = False` to `McpToolset.__init__()`. In `get_tools()`, catch `ConnectionError` when `optional=True`:

```python
# In mcp_toolset.py
class McpToolset(BaseToolset):
    def __init__(self, *, connection_params, optional=False, **kwargs):
        super().__init__(**kwargs)
        self._optional = optional
        # ... existing init ...

    @retry_on_errors
    async def get_tools(self, readonly_context=None):
        try:
            tools_response = await self._execute_with_session(
                lambda session: session.list_tools(),
                "Failed to get tools from MCP server",
                readonly_context,
            )
        except ConnectionError:
            if self._optional:
                logger.warning("Optional MCP toolset unavailable, returning empty tools")
                return []
            raise
        # ... rest of method ...
```

**Option B: Error handling in `_process_agent_tools` (broader)**

Wrap toolset resolution in `base_llm_flow.py` `_process_agent_tools`:

```python
try:
    tools = await _convert_tool_union_to_tools(tool_union, ...)
except ConnectionError as e:
    logger.warning("Toolset %s unavailable, skipping: %s", tool_union, e)
    continue
```

This is broader but changes behavior for all toolsets without opt-in.

### Additional Context

- Tested on `google-adk` 1.27.4, also verified the issue is not addressed in 1.28.0
- The `@retry_on_errors` decorator retries once, but both attempts fail when the server is truly down, adding ~20s delay before the final `ConnectionError`
- Python 3.13, using `StreamableHTTPConnectionParams` for MCP connections

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful degradation when MCP server is unavailable #5025

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Graceful degradation when MCP server is unavailable #5025

Description

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions