This repository contains examples and tools for multi-model serving on Backend.AI, demonstrating how to deploy, test, and compare multiple vision-language models (VLMs) simultaneously.
This repository showcases a complete multi-model serving setup with:
- Two Vision-Language Models (Qwen3-VL) in different sizes
- Interactive Testing Interface (Gradio-based) for comparing and testing multiple models
- Real-world deployment patterns including primary/fallback configurations
- Model Comparison: Compare responses from different model sizes side-by-side
- Fallback Testing: Test automatic failover when primary model becomes unavailable
- Performance Benchmarking: Measure and compare latency across different models
- Multi-model Orchestration: Learn patterns for serving multiple models with different characteristics
backend.ai-examples-Multi-serving/
├── models/
│ ├── qwen-3-vl-8b-instruct-fp8/ # 8B parameter VLM (Primary)
│ ├── qwen-3-vl-4b-instruct-fp8/ # 4B parameter VLM (Fallback)
│ └── multi-model-modal-test/ # Testing interface
│ ├── app.py # Main Gradio application
│ ├── utils.py # Shared utilities
│ ├── tab_compare.py # Tab 1: Side-by-side comparison
│ ├── tab_fallback.py # Tab 2: Fallback mechanism test
│ ├── tab_individual.py # Tab 3: Individual model testing
│ ├── requirements.txt # Python dependencies
│ └── model-definition.yml # Backend.AI model definition
└── data/ # Sample images for testing
Location: models/qwen-3-vl-8b-instruct-fp8/
A larger, more capable vision-language model optimized for multimodal instruction following.
- Parameters: 8 Billion
- Quantization: FP8 (8-bit floating point)
- Context Length: 8192 tokens
- Capabilities: Text + Image understanding, detailed reasoning
- Use Case: Primary model for high-quality responses
Location: models/qwen-3-vl-4b-instruct-fp8/
A smaller, faster vision-language model for efficient inference.
- Parameters: 4 Billion
- Quantization: FP8 (8-bit floating point)
- Context Length: 4096 tokens
- Capabilities: Text + Image understanding, faster responses
- Use Case: Fallback model or cost-effective alternative
Location: models/multi-model-modal-test/
A comprehensive Gradio web interface for testing and comparing multiple models with three specialized testing modes.
Features:
- Compare Tab: Send identical input to both models and compare responses side-by-side
- Fallback Tab: Test automatic failover when primary model fails
- Individual Tab: Send different inputs to each model independently
- Parallel Execution: Models run concurrently for faster results
- Performance Metrics: Real-time latency measurement and comparison
- Flexible Configuration: Environment variables or UI-based setup
Key Capabilities:
- Text + Image multimodal inputs
- Configurable model parameters (max_tokens, temperature, timeout)
- Visual performance indicators (faster/slower annotations)
- Error handling and detailed reporting
- Backend.AI account with access to model serving
- Python 3.8+
- Sufficient storage for model weights (~8GB per model)
Follow the setup instructions for each model:
# For 8B model (Primary)
cd models/qwen-3-vl-8b-instruct-fp8
# Follow README.md instructions
# For 4B model (Fallback)
cd models/qwen-3-vl-4b-instruct-fp8
# Follow README.md instructionsBoth models need to be:
- Created as model folders in Backend.AI
- Downloaded via batch sessions
- Deployed as model services with OpenAI-compatible endpoints
Deploy each model as a service to get their endpoint URLs:
- Deploy
qwen3-vl-8b-instruct-fp8as primary service - Deploy
qwen3-vl-4b-instruct-fp8as fallback service - Note the endpoint URLs (e.g.,
PRIMARY_BASE_URL)
cd models/multi-model-modal-test
# Install dependencies
pip install -r requirements.txt
# Option A: Use environment variables
export PRIMARY_BASE_URL="PRIMARY_BASE_URL"
export PRIMARY_MODEL="PRIMARY_MODEL_NAME"
export PRIMARY_API_KEY="your-api-key"
export FALLBACK_BASE_URL="FALLBACK_BASE_URL"
export FALLBACK_MODEL="FALLBACK_MODEL_NAME"
export FALLBACK_API_KEY="your-api-key"
python app.py
# Option B: Configure via UI
# Simply run without env vars and configure in the web interface
python app.pyThe interface will be available at http://localhost:7860
This repository demonstrates a common production pattern:
User Request
↓
Primary Model (8B) ← Try first (higher quality)
↓ (on failure)
Fallback Model (4B) ← Use if primary fails (availability)
Benefits:
- High quality when primary is available
- Graceful degradation on primary failure
- Cost optimization (fallback is smaller/cheaper)
- Improved overall reliability
The testing interface also demonstrates parallel execution:
User Request
↓
├─→ Primary Model (8B) ─┐
└─→ Fallback Model (4B) ─┤
↓
Compare Results
Benefits:
- 2x faster than sequential execution
- Side-by-side quality comparison
- Latency benchmarking
- Model selection insights
All configuration can be provided via environment variables or the web UI:
| Variable | Description | Example |
|---|---|---|
PRIMARY_BASE_URL |
Primary model endpoint | <PRIMARY_BASE_URL> |
PRIMARY_MODEL |
Primary model name | <PRIMARY_MODEL_NAME> |
PRIMARY_API_KEY |
Primary API key | your-api-key |
FALLBACK_BASE_URL |
Fallback model endpoint | <FALLBACK_BASE_URL> |
FALLBACK_MODEL |
Fallback model name | <FALLBACK_MODEL_NAME> |
FALLBACK_API_KEY |
Fallback API key | your-api-key |
Adjustable per model via UI sliders:
- Max Tokens: Output length limit (Primary: up to 7680, Fallback: up to 3840)
- Temperature: Sampling randomness (0.0 to 2.0)
- Timeout: Request timeout in seconds (5 to 120)
Goal: Compare response quality between models
- Navigate to Compare tab
- Enter a prompt and optional image
- Click "Compare Models"
- Review side-by-side responses
- Note quality differences and latency
Best for: Evaluating which model better suits your use case
Goal: Verify fallback mechanism works
- Navigate to Fallback tab
- Test with primary working (normal operation)
- Stop primary service or enter invalid URL
- Test again to see automatic fallback
- Verify status messages and error handling
Best for: Production readiness testing
Goal: Test models with different inputs
- Navigate to Individual tab
- Enter different prompts/images for each model
- Click "Test Both Models"
- Compare how each handles its specific input
Best for: Specialized testing or different use cases per model
Based on typical deployments:
| Model | Size | Latency (avg) | Quality | Use Case |
|---|---|---|---|---|
| Qwen3-VL-8B-FP8 | 8B params | ~2-4s | High | Detailed analysis, complex reasoning |
| Qwen3-VL-4B-FP8 | 4B params | ~1-2s | Good | Quick responses, simple tasks |
Notes:
- Actual latency depends on hardware, input size, and load
- Parallel execution achieves ~2x speedup over sequential
- FP8 quantization provides ~2x speedup vs FP16 with minimal quality loss
Both models expose OpenAI-compatible endpoints:
POST {BASE_URL}/v1/chat/completions
Content-Type: application/json
Authorization: Bearer {API_KEY}
{
"model": "Qwen/Qwen3-VL-8B-Instruct-FP8",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}
],
"max_tokens": 4096,
"temperature": 0.7
}This allows integration with any OpenAI-compatible client library.
Issue: "ERROR: base_url is empty"
- Solution: Set environment variables or configure via Global Configuration in UI
Issue: "400 Bad Request: max_tokens too large"
- Solution: Reduce max_tokens slider value (remember to reserve space for input)
Issue: Models respond slowly or time out
- Solution:
- Increase timeout slider
- Check Backend.AI service status
- Verify network connectivity
- Consider using smaller model (4B) for faster responses
Issue: Image upload fails or produces errors
- Solution:
- Ensure image is in supported format (JPEG, PNG)
- Check image file size (large images auto-resize to 1280px)
- Verify model endpoint is accessible
Issue: Fallback never triggers
- Solution: Check that primary URL is actually failing (try invalid URL to test)
Enable detailed logging:
# In utils.py, add debug prints
import logging
logging.basicConfig(level=logging.DEBUG)The codebase is modular for easy maintenance:
- app.py: Main entry point, orchestrates UI and global configuration
- utils.py: Shared functions (API calls, image processing, configuration)
- tab_*.py: Individual tab implementations with isolated logic
To add a new testing mode:
- Create
tab_newmode.pyinmodels/multi-model-modal-test/ - Define main function and
create_newmode_tab()function - Import and add to
app.py:
from tab_newmode import create_newmode_tab
with gr.Tab("New Mode"):
create_newmode_tab(global_primary_url, ...)Common customizations:
- Add more models: Extend global configuration with additional URLs
- Custom metrics: Modify
call_model()to track additional data - Different layouts: Edit tab files to change UI structure
- Export results: Add CSV/JSON export functionality in tabs
When contributing to this repository:
- Follow the existing code structure
- Update relevant README files
- Test all three tabs thoroughly
- Verify both environment variable and UI configuration paths
- Ensure parallel execution works correctly
This repository contains example code for Backend.AI platform usage. Check individual model licenses:
- Qwen models: Apache 2.0 License (Alibaba Cloud)
- Gradio: Apache 2.0 License
For issues related to:
- Backend.AI platform: Contact Backend.AI support
- Models: Refer to Hugging Face model pages
- This repository: Open an issue or refer to individual component READMEs