[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching #15515

ppraneth · 2025-12-20T06:12:03Z

Motivation

I identified a per-request overhead in the current WASM middleware implementation within sgl-model-gateway, which acts as a bottleneck for high-throughput serving.

The two primary performance issues addressed in this PR are:

Memory Allocation Overhead: The runtime currently allocates a new wasmtime::Store and linear memory (via mmap) for every single request.
Compilation Overhead: The WASM component is re-compiled from raw bytes (JIT) on every request inside the worker loop.

These operations add milliseconds of latency to every request. This PR introduces Instance Pooling to reuse memory slots and Component Caching to skip redundant compilation, ensuring middleware execution remains near-zero cost.

Modifications

I updated sgl-model-gateway/src/wasm/runtime.rs to implement the following optimizations:

Instance Pooling:
- Integrated wasmtime::PoolingAllocationConfig into the worker loop.
- The system now pre-allocates memory slots (configured to 20 per worker thread) to avoid expensive OS memory allocation calls during request processing.
- Aligned memory limits (max_memory_size, max_component_instance_size) with the new pooling strategy.
Component Caching:
- Introduced a local HashMap<Vec<u8>, Component> within the worker_loop.
- Implemented logic to check the cache for existing compiled components before triggering Component::new.
- Added a simple eviction strategy (clearing the cache if the size exceeds module_cache_size) to prevent memory leaks.

Accuracy Tests

Benchmarking and Profiling

I performed a local micro-benchmark simulating 5000 sequential instantiations to isolate the impact of the Instance Pooling strategy.

Benchmark Configuration:

Module: Simple WASM module requiring 1 Memory Page (64KB).
Iterations: 5000.

Local Results:

Metric	Standard Allocator (Baseline)	Pooled Allocator (Optimized)	Speedup
Total Time	47.28ms	29.40ms	1.61x
Avg Latency	9.455 µs	5.880 µs	38%

Production Impact:
While the micro-benchmark demonstrates a 1.6x speedup on instantiation alone, the effective production speedup is expected to be **significantly higher **. This is because the micro-benchmark isolates allocation; in production, this PR also enables Component Caching, eliminating the JIT compilation step (typically 10ms–100ms per request) that previously occurred inside the hot loop.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-20T06:12:07Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations to the WASM runtime by implementing instance pooling and component caching. These are excellent improvements that will reduce per-request overhead and latency. The code is well-structured and the changes are clearly explained. I have a couple of suggestions to further enhance the implementation: one is a minor optimization to avoid an unnecessary data clone during caching, and the other is a recommendation for a more robust cache eviction strategy to handle high-load scenarios more gracefully. Overall, this is a very valuable contribution.

gemini-code-assist · 2025-12-20T13:23:24Z

sgl-model-gateway/src/wasm/runtime.rs

+            })?;
+
+            // Insert into cache
+            cache.insert(wasm_bytes.clone(), comp.clone());


The wasm_bytes vector is being cloned here, which can be inefficient if the WASM component is large. Since wasm_bytes is an owned value in this function and is not used after this point, you can avoid this clone by moving it directly into the cache. This will prevent an unnecessary memory allocation and copy operation.

Suggested change

cache.insert(wasm_bytes.clone(), comp.clone());

cache.insert(wasm_bytes, comp.clone());

gemini-code-assist · 2025-12-20T13:23:24Z

sgl-model-gateway/src/wasm/runtime.rs

+            if cache.len() >= config.module_cache_size {
+                debug!(
+                    target: "sgl_model_gateway::wasm::runtime",
+                    "Module cache full ({} items), clearing.",
+                    cache.len()
+                );
+                cache.clear();
+            }


The current cache eviction strategy of clearing the entire cache (cache.clear()) is simple, but it can be inefficient under certain workloads. If the set of actively used WASM modules is larger than module_cache_size, this could lead to cache thrashing, where modules are repeatedly recompiled, negating some of the benefits of caching.

For a more robust solution, consider using a Least Recently Used (LRU) eviction policy. This would ensure that the most frequently accessed components are retained. You could achieve this by using a crate like lru, which provides lru::LruCache as a near drop-in replacement for HashMap.

benchmark

158292e

ppraneth requested a review from slin1237 as a code owner December 20, 2025 06:12

github-actions bot added the model-gateway label Dec 20, 2025

Merge branch 'main' into bug-rou

e0891b8

gemini-code-assist bot reviewed Dec 20, 2025

View reviewed changes

Merge branch 'main' into bug-rou

8f4702a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching #15515

[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching #15515

ppraneth commented Dec 20, 2025

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 20, 2025

Uh oh!

gemini-code-assist bot Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	cache.insert(wasm_bytes.clone(), comp.clone());
	cache.insert(wasm_bytes, comp.clone());

[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching #15515

Are you sure you want to change the base?

[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching #15515

Conversation

ppraneth commented Dec 20, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant