Skip to content

Graceful degradation when MCP server is unavailable #5025

@xianhong1208

Description

@xianhong1208

🔴 Required Information

Is your feature request related to a specific problem?

When an MCP server configured in an agent's toolset is unreachable (not started, network failure, crash), the entire agent invocation fails with an unrecoverable ConnectionError. The agent cannot continue operating with its remaining tools or built-in knowledge.

McpToolset.get_tools() is called during every LLM step via _preprocess_async_process_agent_tools_convert_tool_union_to_tools. There is no try/except anywhere in this chain, so a single unavailable MCP server takes down the entire agent — even when the agent has other tools or could answer using its own knowledge.

base_llm_flow.py _preprocess_async
  → _process_agent_tools (no try/except)
    → _convert_tool_union_to_tools (no try/except)
      → base_toolset.py get_tools_with_prefix (no try/except)
        → mcp_toolset.py get_tools
          → _execute_with_session → create_session
            → ConnectionError: Failed to create MCP session

Describe the Solution You'd Like

An optional parameter on McpToolset (default False for backward compatibility). When True, connection failures in get_tools() return an empty list instead of raising, allowing the agent to continue with remaining tools.

toolset = McpToolset(
    connection_params=SseConnectionParams(url="http://mcp-server:5031/mcp"),
    tool_filter=["search"],
    optional=True,  # Agent continues if this server is down
)

agent = LlmAgent(
    model="gemini-2.0-flash",
    name="assistant",
    tools=[toolset],  # Agent works even if MCP server is unavailable
)

Impact on your work

In production environments, MCP servers are deployed as independent services and can go down for maintenance, scaling events, or unexpected failures. Currently, an agent with multiple tools from multiple MCP servers becomes completely non-functional if any single MCP server is temporarily unavailable. This severely impacts service reliability.

Agents configured with various MCP tool combinations should not have their entire experience broken by a single MCP server outage.

Willingness to contribute

Yes — happy to submit a PR if the team agrees on an approach.


🟡 Recommended Information

Describe Alternatives You've Considered

There is no clean way to handle this externally:

  • before_tool_callback plugin approach does not work because get_tools() fails during tool discovery (before any specific tool is called), so the callback is never reached.
  • Catching errors at agent construction time and skipping unavailable MCP servers prevents the agent from ever discovering tools if the server comes back online mid-conversation.
  • Subclassing McpToolset and overriding get_tools() works as a temporary workaround, but it relies on internal implementation details and may break with future ADK changes.

None of these are ideal. A first-class optional parameter would be the cleanest solution.

Proposed API / Implementation

Option A: optional flag on McpToolset (minimal, recommended)

Add optional: bool = False to McpToolset.__init__(). In get_tools(), catch ConnectionError when optional=True:

# In mcp_toolset.py
class McpToolset(BaseToolset):
    def __init__(self, *, connection_params, optional=False, **kwargs):
        super().__init__(**kwargs)
        self._optional = optional
        # ... existing init ...

    @retry_on_errors
    async def get_tools(self, readonly_context=None):
        try:
            tools_response = await self._execute_with_session(
                lambda session: session.list_tools(),
                "Failed to get tools from MCP server",
                readonly_context,
            )
        except ConnectionError:
            if self._optional:
                logger.warning("Optional MCP toolset unavailable, returning empty tools")
                return []
            raise
        # ... rest of method ...

Option B: Error handling in _process_agent_tools (broader)

Wrap toolset resolution in base_llm_flow.py _process_agent_tools:

try:
    tools = await _convert_tool_union_to_tools(tool_union, ...)
except ConnectionError as e:
    logger.warning("Toolset %s unavailable, skipping: %s", tool_union, e)
    continue

This is broader but changes behavior for all toolsets without opt-in.

Additional Context

  • Tested on google-adk 1.27.4, also verified the issue is not addressed in 1.28.0
  • The @retry_on_errors decorator retries once, but both attempts fail when the server is truly down, adding ~20s delay before the final ConnectionError
  • Python 3.13, using StreamableHTTPConnectionParams for MCP connections

Metadata

Metadata

Assignees

No one assigned

    Labels

    mcp[Component] Issues about MCP supportspam[Status] Issues suspected of having comments which are spamtools[Component] This issue is related to tools

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions