-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching #15515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant performance optimizations to the WASM runtime by implementing instance pooling and component caching. These are excellent improvements that will reduce per-request overhead and latency. The code is well-structured and the changes are clearly explained. I have a couple of suggestions to further enhance the implementation: one is a minor optimization to avoid an unnecessary data clone during caching, and the other is a recommendation for a more robust cache eviction strategy to handle high-load scenarios more gracefully. Overall, this is a very valuable contribution.
| })?; | ||
|
|
||
| // Insert into cache | ||
| cache.insert(wasm_bytes.clone(), comp.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wasm_bytes vector is being cloned here, which can be inefficient if the WASM component is large. Since wasm_bytes is an owned value in this function and is not used after this point, you can avoid this clone by moving it directly into the cache. This will prevent an unnecessary memory allocation and copy operation.
| cache.insert(wasm_bytes.clone(), comp.clone()); | |
| cache.insert(wasm_bytes, comp.clone()); |
| if cache.len() >= config.module_cache_size { | ||
| debug!( | ||
| target: "sgl_model_gateway::wasm::runtime", | ||
| "Module cache full ({} items), clearing.", | ||
| cache.len() | ||
| ); | ||
| cache.clear(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current cache eviction strategy of clearing the entire cache (cache.clear()) is simple, but it can be inefficient under certain workloads. If the set of actively used WASM modules is larger than module_cache_size, this could lead to cache thrashing, where modules are repeatedly recompiled, negating some of the benefits of caching.
For a more robust solution, consider using a Least Recently Used (LRU) eviction policy. This would ensure that the most frequently accessed components are retained. You could achieve this by using a crate like lru, which provides lru::LruCache as a near drop-in replacement for HashMap.
Motivation
I identified a per-request overhead in the current WASM middleware implementation within
sgl-model-gateway, which acts as a bottleneck for high-throughput serving.The two primary performance issues addressed in this PR are:
wasmtime::Storeand linear memory (viammap) for every single request.These operations add milliseconds of latency to every request. This PR introduces Instance Pooling to reuse memory slots and Component Caching to skip redundant compilation, ensuring middleware execution remains near-zero cost.
Modifications
I updated
sgl-model-gateway/src/wasm/runtime.rsto implement the following optimizations:Instance Pooling:
wasmtime::PoolingAllocationConfiginto the worker loop.max_memory_size,max_component_instance_size) with the new pooling strategy.Component Caching:
HashMap<Vec<u8>, Component>within theworker_loop.Component::new.module_cache_size) to prevent memory leaks.Accuracy Tests
Benchmarking and Profiling
I performed a local micro-benchmark simulating 5000 sequential instantiations to isolate the impact of the Instance Pooling strategy.
Benchmark Configuration:
Local Results:
Production Impact:
While the micro-benchmark demonstrates a 1.6x speedup on instantiation alone, the effective production speedup is expected to be **significantly higher **. This is because the micro-benchmark isolates allocation; in production, this PR also enables Component Caching, eliminating the JIT compilation step (typically 10ms–100ms per request) that previously occurred inside the hot loop.
Checklist