A high-performance, cost-effective AWS Lambda function written in Rust that generates text embeddings using EmbeddingGemma.
This repository was created as a companion project for the article EmbeddingGemma Inference on AWS Lambda: Rust, Quantization, and Graviton Performance.
This project provides a serverless REST API for generating semantic text embeddings. By running inference locally inside AWS Lambda using ONNX Runtime, it offers a private and predictable alternative to external embedding APIs like OpenAI or Google Bedrock.
- Blazing Fast: Leverages Rust and ARM64 (Graviton2) optimizations for minimal latency and cold starts.
- Matryoshka Representation Learning: Supports variable embedding dimensions (128, 256, 512, or 768) without retraining.
- Cost-Effective: Billed per request, not per token. Significant savings for long-form content.
- Privacy-First: Data never leaves your AWS VPC.
- ONNX Optimized: Uses
ort(ONNX Runtime) with vectorized mean pooling for efficient inference.
- Runtime: Rust 1.92+ on AWS Lambda (ARM64).
- Core Engine:
ort(ONNX Runtime) bindings. - Model:
EmbeddingGemma(Quantized). - Format:
provided.al2023-arm64custom runtime via Docker container.
embedding-lambda/
├── src/
│ ├── main.rs # Lambda entry point and lifecycle management
│ ├── embedder.rs # ML inference logic and pooling
│ └── http_handler.rs # API request/response processing
├── scripts/ # Deployment and utility scripts
├── model/ # model_quantized.onnx and tokenizer.json
└── Dockerfile # Multi-stage ARM64 build configuration
Before running or deploying, you must download the model and tokenizer files into the model/ directory:
- Download from Hugging Face:
Visit onnx-community/embeddinggemma-300m-ONNX and download the following files:
onnx/model_quantized.onnx(save asmodel/model_quantized.onnx)onnx/model_quantized.onnx_data(save asmodel/model_quantized.onnx_data)tokenizer.json(save asmodel/tokenizer.json)
Alternatively, use huggingface-cli:
mkdir -p model
huggingface-cli download onnx-community/embeddinggemma-300m-ONNX onnx/model_quantized.onnx --local-dir model --local-dir-use-symlinks False
huggingface-cli download onnx-community/embeddinggemma-300m-ONNX onnx/model_quantized.onnx_data --local-dir model --local-dir-use-symlinks False
huggingface-cli download onnx-community/embeddinggemma-300m-ONNX tokenizer.json --local-dir model --local-dir-use-symlinks False
mv model/onnx/model_quantized.onnx model/ && mv model/onnx/model_quantized.onnx_data model/ && rm -rf model/onnx-
Start the local Lambda server:
cargo lambda watch
-
Invoke the function:
cargo lambda invoke --data-file scripts/test_data_short.json
-
Direct HTTP Call:
curl -X POST http://localhost:9000 \ -H "Content-Type: application/json" \ -d '{"text": "Rust is amazing", "size": 256}'
For a complete setup from scratch (IAM roles, ECR, and Lambda function), run the consolidated install script:
./scripts/install.shTo update existing code and infrastructure:
./scripts/deploy_function.shThis project is licensed under the MIT License.