Skip to content

Conversation

@ppraneth
Copy link
Contributor

Motivation

I identified a per-request overhead in the current WASM middleware implementation within sgl-model-gateway, which acts as a bottleneck for high-throughput serving.

The two primary performance issues addressed in this PR are:

  1. Memory Allocation Overhead: The runtime currently allocates a new wasmtime::Store and linear memory (via mmap) for every single request.
  2. Compilation Overhead: The WASM component is re-compiled from raw bytes (JIT) on every request inside the worker loop.

These operations add milliseconds of latency to every request. This PR introduces Instance Pooling to reuse memory slots and Component Caching to skip redundant compilation, ensuring middleware execution remains near-zero cost.

Modifications

I updated sgl-model-gateway/src/wasm/runtime.rs to implement the following optimizations:

  1. Instance Pooling:

    • Integrated wasmtime::PoolingAllocationConfig into the worker loop.
    • The system now pre-allocates memory slots (configured to 20 per worker thread) to avoid expensive OS memory allocation calls during request processing.
    • Aligned memory limits (max_memory_size, max_component_instance_size) with the new pooling strategy.
  2. Component Caching:

    • Introduced a local HashMap<Vec<u8>, Component> within the worker_loop.
    • Implemented logic to check the cache for existing compiled components before triggering Component::new.
    • Added a simple eviction strategy (clearing the cache if the size exceeds module_cache_size) to prevent memory leaks.

Accuracy Tests

Benchmarking and Profiling

I performed a local micro-benchmark simulating 5000 sequential instantiations to isolate the impact of the Instance Pooling strategy.

Benchmark Configuration:

  • Module: Simple WASM module requiring 1 Memory Page (64KB).
  • Iterations: 5000.

Local Results:

Metric Standard Allocator (Baseline) Pooled Allocator (Optimized) Speedup
Total Time 47.28ms 29.40ms 1.61x
Avg Latency 9.455 µs 5.880 µs 38%

Production Impact:
While the micro-benchmark demonstrates a 1.6x speedup on instantiation alone, the effective production speedup is expected to be **significantly higher **. This is because the micro-benchmark isolates allocation; in production, this PR also enables Component Caching, eliminating the JIT compilation step (typically 10ms–100ms per request) that previously occurred inside the hot loop.

Checklist

@ppraneth ppraneth requested a review from slin1237 as a code owner December 20, 2025 06:12
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations to the WASM runtime by implementing instance pooling and component caching. These are excellent improvements that will reduce per-request overhead and latency. The code is well-structured and the changes are clearly explained. I have a couple of suggestions to further enhance the implementation: one is a minor optimization to avoid an unnecessary data clone during caching, and the other is a recommendation for a more robust cache eviction strategy to handle high-load scenarios more gracefully. Overall, this is a very valuable contribution.

})?;

// Insert into cache
cache.insert(wasm_bytes.clone(), comp.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The wasm_bytes vector is being cloned here, which can be inefficient if the WASM component is large. Since wasm_bytes is an owned value in this function and is not used after this point, you can avoid this clone by moving it directly into the cache. This will prevent an unnecessary memory allocation and copy operation.

Suggested change
cache.insert(wasm_bytes.clone(), comp.clone());
cache.insert(wasm_bytes, comp.clone());

Comment on lines +354 to +361
if cache.len() >= config.module_cache_size {
debug!(
target: "sgl_model_gateway::wasm::runtime",
"Module cache full ({} items), clearing.",
cache.len()
);
cache.clear();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current cache eviction strategy of clearing the entire cache (cache.clear()) is simple, but it can be inefficient under certain workloads. If the set of actively used WASM modules is larger than module_cache_size, this could lead to cache thrashing, where modules are repeatedly recompiled, negating some of the benefits of caching.

For a more robust solution, consider using a Least Recently Used (LRU) eviction policy. This would ensure that the most frequently accessed components are retained. You could achieve this by using a crate like lru, which provides lru::LruCache as a near drop-in replacement for HashMap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant