Add dedicated streaming server for fast GGUF SSE inference by danielhanchen · Pull Request #4635 · unslothai/unsloth

danielhanchen · 2026-03-27T02:04:00Z

Summary

Adds a dedicated streaming server that runs in a separate thread with its own event loop
When UNSLOTH_FAST_SSE=1 is set, the frontend transparently negotiates a fast path via one-time tokens
Falls back to the baseline /v1/chat/completions silently on any failure (no user-visible changes)

How it works

On startup, main.py spawns a lightweight FastAPI app on a random port in a daemon thread
Frontend calls GET /api/inference/stream-url to get a one-time token (10s TTL)
Frontend streams from http://127.0.0.1:{port}/stream with the token in X-Stream-Token header
The streaming server proxies directly to llama-server using httpx.AsyncClient
If anything fails, the frontend falls back to the standard /v1/chat/completions endpoint

Three streaming paths

Path	Use case	Method
A (hot path)	Standard streaming	Async httpx direct to llama-server
B	Tool calling	`asyncio.to_thread` (tool execution is the bottleneck)
C	Non-streaming	One-shot JSON response

Feature parity

Vision: image_url content parts and legacy image_base64 field, PNG conversion
Thinking models: reasoning_content in delta wrapped in <think> tags
Tool calling: tool_status, tool_start, tool_end SSE events
CORS: OPTIONS preflight handler
Error handling: friendly messages for context overflow, connection loss
Usage/timings: stream_options.include_usage passthrough from llama-server
Auth: one-time tokens with 10s TTL, consumed on use

Files changed

File	Change
`streaming_server.py`	New. Standalone FastAPI app with all three streaming paths
`stream_token_store.py`	New. Thread-safe one-time token store
`main.py`	Start streaming server daemon thread on `UNSLOTH_FAST_SSE=1`
`routes/inference.py`	`/stream-url`, `/direct-stream`, `/internal/consume-stream-token` endpoints
`core/inference/llama_cpp.py`	`--api-key` support, auth header passthrough in all streaming methods
`chat-api.ts`	Transparent fast-path negotiation with silent fallback

Key design decision: `repeat_penalty`

The streaming server only sends repeat_penalty to llama-server when the client explicitly provides repetition_penalty. Without it, llama-server uses its own default (1.0, no penalty). This avoids a 24% TPS cost from repetition scanning that was previously always applied. See #4634 for the related Pydantic default fix.

Test plan

Set UNSLOTH_FAST_SSE=1, verify streaming works through the fast path
Unset UNSLOTH_FAST_SSE, verify baseline streaming still works (no regression)
Send stream: false, verify JSON response
Enable tools, verify tool calling SSE events
Test with a vision model and image payload
Test with a thinking/reasoning model (verify <think> tags)
Send an expired/invalid token, verify 401 response
Kill the streaming server mid-stream, verify frontend falls back gracefully

Adds a separate streaming server (Option B) that bypasses the main app's asyncio event loop for GGUF token streaming. When UNSLOTH_FAST_SSE=1 or UNSLOTH_STREAM_SERVER=1 is set, a lightweight FastAPI app starts in a daemon thread on a random port. The frontend negotiates a one-time token via /api/inference/stream-url and streams directly from this server, falling back to the baseline /v1/chat/completions transparently on failure. Architecture: - streaming_server.py: standalone FastAPI app with three paths: - Path A (hot): async httpx.AsyncClient streaming direct to llama-server. No cumulative-to-delta conversion needed since llama-server sends delta tokens natively via OpenAI SSE. - Path B: tool calling via asyncio.to_thread (tool execution is the bottleneck, not the streaming proxy). - Path C: non-streaming one-shot JSON response. - stream_token_store.py: thread-safe one-time token store (10s TTL) shared between the main app and streaming server. - routes/inference.py: /stream-url endpoint for token issuance, /direct-stream for Option C, /internal/consume-stream-token for future subprocess mode. - llama_cpp.py: --api-key support so the streaming server can authenticate to llama-server directly. - chat-api.ts: transparent fast-path negotiation with silent fallback. Handles: vision (image_url + legacy image_base64), thinking models (reasoning_content -> <think> tags), CORS preflight, friendly error messages, stream_options usage/timings passthrough. Only sends repeat_penalty when the client explicitly provides it, avoiding the 24% TPS penalty from repetition scanning.

for more information, see https://pre-commit.ci

studio/backend/streaming_server.py

+            yield f"data: {json.dumps({'error': {'message': _friendly_error(e), 'type': 'server_error'}})}\n\n"
+
+    return StreamingResponse(
+        sse_generator(), media_type = "text/event-stream", headers = _SSE_HEADERS


In general, to fix information exposure through exceptions, never send raw exception messages or stack traces to clients. Instead, log detailed error information on the server side, and return a generic, non-sensitive message to the client. If you need user-friendly messages, base them on known-safe patterns rather than arbitrary exception text.

For this file, the best targeted fix is to modify _friendly_error so that it no longer returns text dynamically constructed from str(e) (such as interpolating token counts from a backend error message). Instead, it should inspect the message only to decide which predefined generic text to return. That way, even if the backend error message contains sensitive details, these never flow back to the client. Concretely:

Keep using msg = str(e) and regex checks to detect specific error cases, but avoid returning m.group(1) or m.group(2) to the user.

Replace the current context-window error string with a static, generic explanation without embedding numeric values from the backend.

Keep the existing generic lost-connection message; it already does not reuse raw backend text.

Keep the default fallback "An internal error occurred".

All other logic stays the same. No changes are required at the call site (the except Exception as e: block and the yield with _friendly_error(e) remain). No new imports or helper methods are needed.

studio/backend/streaming_server.py

+            yield f"data: {json.dumps({'error': {'message': _friendly_error(e), 'type': 'server_error'}})}\n\n"
+
+    return StreamingResponse(
+        tool_sse(), media_type = "text/event-stream", headers = _SSE_HEADERS


In general terms, the fix is to ensure that user-facing error messages are generic and do not depend on the raw exception text. The exception (and any stack trace or detailed message) should be logged server-side only, while the client receives a high-level, non-sensitive description. Helper functions such as _friendly_error should avoid calling str(e) for arbitrary exceptions when constructing client-visible messages.

The best focused fix here is to change _friendly_error so that it only uses raw exception messages for very narrow, known-safe patterns (or not at all), and otherwise always returns a generic string without leaking str(e). In this code, _friendly_error already returns a generic "An internal error occurred" for the default case; the problematic bit is using str(e) unconditionally to do regex and substring matches. To avoid exposing arbitrary exception details while retaining functionality, we can still call str(e) internally but constrain what we return: for the context-window regex we only surface the numeric token counts (derived via regex groups), and for the connection-loss case we return a static string that does not echo the original message. We should not fall back to returning msg itself anywhere.

Concretely, in studio/backend/streaming_server.py, we keep the structure of _friendly_error but ensure it never returns raw or formatted exception messages or stack traces, only predefined strings. We do not need to touch the tool_sse function or the StreamingResponse construction; once _friendly_error is safe, the streamed JSON error notice is also safe. No new imports or helper methods are required.

gemini-code-assist

Code Review

This pull request implements a high-performance "fast-path" streaming architecture for the inference backend. It introduces a dedicated FastAPI streaming server (Option B) that runs in a separate thread to eliminate asyncio contention, supported by a short-lived, one-time token store for authentication. Additionally, it adds support for direct client access to the underlying llama-server via API keys (Option C). The frontend is updated to automatically negotiate these fast-paths with a silent fallback to the standard OpenAI-compatible endpoint. Review feedback identifies a logic flaw in the streaming server's state management for thinking tags and suggests consistent use of None for optional authorization headers in the backend.

I am having trouble creating individual review comments. Click here to see my feedback.

studio/backend/streaming_server.py (254-258)

The logic for handling the end of the stream when in_thinking is true seems flawed.

The condition if has_content_tokens: inside if in_thinking: is unreachable because in_thinking is set to false as soon as a content token is processed.
The else block incorrectly re-yields the entire reasoning_text, causing duplicated content in the output.

When the stream ends and only reasoning tokens have been sent (in_thinking is true), the correct action is to simply close the tag.

                                if in_thinking:
                                    # Only reasoning tokens were received, so close the <think> tag.
                                    yield f"data: {json.dumps({'id': completion_id, 'object': 'chat.completion.chunk', 'created': created, 'model': model_name, 'choices': [{'index': 0, 'delta': {'content': '</think>'}, 'finish_reason': None}]}, separators=_SEP)}\n\n"

studio/backend/core/inference/llama_cpp.py (2098)

For consistency with other parts of this file (e.g., line 1564), it's better to use None instead of an empty dictionary {} when self._api_key is not present. This pattern is also repeated on line 2217. While httpx handles both correctly, using None is more idiomatic and consistent with the other changes in this PR.

            _auth_headers = {"Authorization": f"Bearer {self._api_key}"} if self._api_key else None

References

Using None for optional dictionary parameters is more idiomatic in Python and maintains consistency within the codebase. ^(link)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e945f43652

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-27T02:07:56Z

studio/frontend/src/features/chat/api/chat-api.ts

+  const response = useFastPath
+    ? await fetch(fastUrl, {
+        method: "POST",


Fall back to baseline when fast stream request fails

When useFastPath is true, the fast-path POST is awaited directly and any network error or non-2xx response bubbles out as a hard failure instead of retrying /v1/chat/completions. This breaks chats whenever the dedicated server is briefly unavailable (startup race, token expiry, port refusal, etc.), even though the fast path is meant to be optional and silently degradable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-27T02:07:56Z

studio/backend/routes/inference.py

+    token = create_stream_token(current_subject)
+    return {
+        "supported": True,
+        "stream_url": f"http://127.0.0.1:{stream_port}/stream",


Build stream URL from request host instead of 127.0.0.1

Returning a hardcoded loopback URL makes the browser connect to its own localhost, not the backend host, whenever Studio is accessed remotely (or through a proxy). In those environments the fast path becomes unreachable (and can also hit mixed-content issues due to hardcoded http), so enabling fast SSE can regress normal streaming behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-27T02:07:56Z

studio/backend/core/inference/llama_cpp.py

+            else:
+                self._api_key = None
+
            logger.info(f"Starting llama-server: {' '.join(cmd)}")


Redact direct-stream API key from startup logs

When UNSLOTH_DIRECT_STREAM=1, cmd includes --api-key <secret> and this log statement prints the full command line, exposing the bearer key in logs. Anyone with log access can reuse that key to call llama-server directly, which defeats the purpose of protecting the direct endpoint with an API key.

Useful? React with 👍 / 👎.

danielhanchen requested review from Manan17, Shine1i and rolandtannous as code owners March 27, 2026 02:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f0e026

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems Mar 27, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 27, 2026

View reviewed changes

@@ -43,15 +43,15 @@
             def _friendly_error(e):
                 """Convert raw exception messages to user-readable strings."""
                 msg = str(e)
-                m = re.search(
+                # Detect specific known error patterns, but do not expose raw values from the
+                # backend error message to the client. Only return generic, pre-defined text.
+                if re.search(
                     r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)",
                     msg,
-                )
-                if m:
+                ):
                     return (
-                        f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token "
-                        f"context window. Try increasing the Context Length in Model settings, "
-                        f"or shorten the conversation."
+                        "Message too long for the current context window. "
+                        "Try increasing the Context Length in Model settings, or shorten the conversation."
                     )
                 if "Lost connection to llama-server" in msg:
                     return "Lost connection to the model server. It may have crashed -- try reloading the model."

@@ -41,20 +41,23 @@
             def _friendly_error(e):
-                """Convert raw exception messages to user-readable strings."""
+                """Convert raw exception messages to user-readable strings without leaking internals."""
                 msg = str(e)
                 m = re.search(
                     r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)",
                     msg,
                 )
                 if m:
+                    # Only expose the numeric limits, not the original error text.
                     return (
                         f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token "
                         f"context window. Try increasing the Context Length in Model settings, "
                         f"or shorten the conversation."
                     )
                 if "Lost connection to llama-server" in msg:
+                    # Return a generic connectivity message, without including the raw exception.
                     return "Lost connection to the model server. It may have crashed -- try reloading the model."
+                # For all other errors, avoid exposing any exception details.
                 return "An internal error occurred"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dedicated streaming server for fast GGUF SSE inference#4635

Add dedicated streaming server for fast GGUF SSE inference#4635
danielhanchen wants to merge 2 commits intomainfrom
feat/studio-dedicated-streaming-server

danielhanchen commented Mar 27, 2026

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented Mar 27, 2026

Summary

How it works

Three streaming paths

Feature parity

Files changed

Key design decision: repeat_penalty

Test plan

Uh oh!

Check warning

Uh oh!

Copilot Autofix

Check warning

Uh oh!

Copilot Autofix

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

studio/backend/streaming_server.py (254-258)

studio/backend/core/inference/llama_cpp.py (2098)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Key design decision: `repeat_penalty`