Backend.AI Examples - Multi-Serving

This repository contains examples and tools for multi-model serving on Backend.AI, demonstrating how to deploy, test, and compare multiple vision-language models (VLMs) simultaneously.

Overview

This repository showcases a complete multi-model serving setup with:

Two Vision-Language Models (Qwen3-VL) in different sizes
Interactive Testing Interface (Gradio-based) for comparing and testing multiple models
Real-world deployment patterns including primary/fallback configurations

Use Cases

Model Comparison: Compare responses from different model sizes side-by-side
Fallback Testing: Test automatic failover when primary model becomes unavailable
Performance Benchmarking: Measure and compare latency across different models
Multi-model Orchestration: Learn patterns for serving multiple models with different characteristics

Repository Structure

backend.ai-examples-Multi-serving/
├── models/
│   ├── qwen-3-vl-8b-instruct-fp8/     # 8B parameter VLM (Primary)
│   ├── qwen-3-vl-4b-instruct-fp8/     # 4B parameter VLM (Fallback)
│   └── multi-model-modal-test/        # Testing interface
│       ├── app.py                     # Main Gradio application
│       ├── utils.py                   # Shared utilities
│       ├── tab_compare.py             # Tab 1: Side-by-side comparison
│       ├── tab_fallback.py            # Tab 2: Fallback mechanism test
│       ├── tab_individual.py          # Tab 3: Individual model testing
│       ├── requirements.txt           # Python dependencies
│       └── model-definition.yml       # Backend.AI model definition
└── data/                              # Sample images for testing

Components

1. Qwen3-VL-8B-Instruct-FP8 (Primary Model)

Location: models/qwen-3-vl-8b-instruct-fp8/

A larger, more capable vision-language model optimized for multimodal instruction following.

Parameters: 8 Billion
Quantization: FP8 (8-bit floating point)
Context Length: 8192 tokens
Capabilities: Text + Image understanding, detailed reasoning
Use Case: Primary model for high-quality responses

View Setup Instructions

2. Qwen3-VL-4B-Instruct-FP8 (Fallback Model)

Location: models/qwen-3-vl-4b-instruct-fp8/

A smaller, faster vision-language model for efficient inference.

Parameters: 4 Billion
Quantization: FP8 (8-bit floating point)
Context Length: 4096 tokens
Capabilities: Text + Image understanding, faster responses
Use Case: Fallback model or cost-effective alternative

View Setup Instructions

3. Multi-Model Testing Interface

Location: models/multi-model-modal-test/

A comprehensive Gradio web interface for testing and comparing multiple models with three specialized testing modes.

Features:

Compare Tab: Send identical input to both models and compare responses side-by-side
Fallback Tab: Test automatic failover when primary model fails
Individual Tab: Send different inputs to each model independently
Parallel Execution: Models run concurrently for faster results
Performance Metrics: Real-time latency measurement and comparison
Flexible Configuration: Environment variables or UI-based setup

Key Capabilities:

Text + Image multimodal inputs
Configurable model parameters (max_tokens, temperature, timeout)
Visual performance indicators (faster/slower annotations)
Error handling and detailed reporting

View Detailed Documentation

Quick Start

Prerequisites

Backend.AI account with access to model serving
Python 3.8+
Sufficient storage for model weights (~8GB per model)

Setup Steps

1. Download Models

Follow the setup instructions for each model:

# For 8B model (Primary)
cd models/qwen-3-vl-8b-instruct-fp8
# Follow README.md instructions

# For 4B model (Fallback)
cd models/qwen-3-vl-4b-instruct-fp8
# Follow README.md instructions

Both models need to be:

Created as model folders in Backend.AI
Downloaded via batch sessions
Deployed as model services with OpenAI-compatible endpoints

2. Deploy Models on Backend.AI

Deploy each model as a service to get their endpoint URLs:

Deploy qwen3-vl-8b-instruct-fp8 as primary service
Deploy qwen3-vl-4b-instruct-fp8 as fallback service
Note the endpoint URLs (e.g., PRIMARY_BASE_URL)

3. Run Testing Interface

cd models/multi-model-modal-test

# Install dependencies
pip install -r requirements.txt

# Option A: Use environment variables
export PRIMARY_BASE_URL="PRIMARY_BASE_URL"
export PRIMARY_MODEL="PRIMARY_MODEL_NAME"
export PRIMARY_API_KEY="your-api-key"

export FALLBACK_BASE_URL="FALLBACK_BASE_URL"
export FALLBACK_MODEL="FALLBACK_MODEL_NAME"
export FALLBACK_API_KEY="your-api-key"

python app.py

# Option B: Configure via UI
# Simply run without env vars and configure in the web interface
python app.py

The interface will be available at http://localhost:7860

Architecture Patterns

Primary-Fallback Pattern

This repository demonstrates a common production pattern:

User Request
    ↓
Primary Model (8B) ← Try first (higher quality)
    ↓ (on failure)
Fallback Model (4B) ← Use if primary fails (availability)

Benefits:

High quality when primary is available
Graceful degradation on primary failure
Cost optimization (fallback is smaller/cheaper)
Improved overall reliability

Parallel Comparison Pattern

The testing interface also demonstrates parallel execution:

User Request
    ↓
    ├─→ Primary Model (8B)  ─┐
    └─→ Fallback Model (4B) ─┤
                             ↓
                    Compare Results

Benefits:

2x faster than sequential execution
Side-by-side quality comparison
Latency benchmarking
Model selection insights

Configuration

Environment Variables

All configuration can be provided via environment variables or the web UI:

Variable	Description	Example
`PRIMARY_BASE_URL`	Primary model endpoint	`<PRIMARY_BASE_URL>`
`PRIMARY_MODEL`	Primary model name	`<PRIMARY_MODEL_NAME>`
`PRIMARY_API_KEY`	Primary API key	`your-api-key`
`FALLBACK_BASE_URL`	Fallback model endpoint	`<FALLBACK_BASE_URL>`
`FALLBACK_MODEL`	Fallback model name	`<FALLBACK_MODEL_NAME>`
`FALLBACK_API_KEY`	Fallback API key	`your-api-key`

Model Parameters

Adjustable per model via UI sliders:

Max Tokens: Output length limit (Primary: up to 7680, Fallback: up to 3840)
Temperature: Sampling randomness (0.0 to 2.0)
Timeout: Request timeout in seconds (5 to 120)

Testing Workflows

Workflow 1: Quality Comparison

Goal: Compare response quality between models

Navigate to Compare tab
Enter a prompt and optional image
Click "Compare Models"
Review side-by-side responses
Note quality differences and latency

Best for: Evaluating which model better suits your use case

Workflow 2: Reliability Testing

Goal: Verify fallback mechanism works

Navigate to Fallback tab
Test with primary working (normal operation)
Stop primary service or enter invalid URL
Test again to see automatic fallback
Verify status messages and error handling

Best for: Production readiness testing

Workflow 3: Independent Testing

Goal: Test models with different inputs

Navigate to Individual tab
Enter different prompts/images for each model
Click "Test Both Models"
Compare how each handles its specific input

Best for: Specialized testing or different use cases per model

Performance Characteristics

Based on typical deployments:

Model	Size	Latency (avg)	Quality	Use Case
Qwen3-VL-8B-FP8	8B params	~2-4s	High	Detailed analysis, complex reasoning
Qwen3-VL-4B-FP8	4B params	~1-2s	Good	Quick responses, simple tasks

Notes:

Actual latency depends on hardware, input size, and load
Parallel execution achieves ~2x speedup over sequential
FP8 quantization provides ~2x speedup vs FP16 with minimal quality loss

API Compatibility

Both models expose OpenAI-compatible endpoints:

POST {BASE_URL}/v1/chat/completions
Content-Type: application/json
Authorization: Bearer {API_KEY}

{
  "model": "Qwen/Qwen3-VL-8B-Instruct-FP8",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ],
  "max_tokens": 4096,
  "temperature": 0.7
}

This allows integration with any OpenAI-compatible client library.

Troubleshooting

Common Issues

Issue: "ERROR: base_url is empty"

Solution: Set environment variables or configure via Global Configuration in UI

Issue: "400 Bad Request: max_tokens too large"

Solution: Reduce max_tokens slider value (remember to reserve space for input)

Issue: Models respond slowly or time out

Solution:
- Increase timeout slider
- Check Backend.AI service status
- Verify network connectivity
- Consider using smaller model (4B) for faster responses

Issue: Image upload fails or produces errors

Solution:
- Ensure image is in supported format (JPEG, PNG)
- Check image file size (large images auto-resize to 1280px)
- Verify model endpoint is accessible

Issue: Fallback never triggers

Solution: Check that primary URL is actually failing (try invalid URL to test)

Debug Mode

Enable detailed logging:

# In utils.py, add debug prints
import logging
logging.basicConfig(level=logging.DEBUG)

Development

Project Structure

The codebase is modular for easy maintenance:

app.py: Main entry point, orchestrates UI and global configuration
utils.py: Shared functions (API calls, image processing, configuration)
tab_*.py: Individual tab implementations with isolated logic

Adding New Tabs

To add a new testing mode:

Create tab_newmode.py in models/multi-model-modal-test/
Define main function and create_newmode_tab() function
Import and add to app.py:

from tab_newmode import create_newmode_tab

with gr.Tab("New Mode"):
    create_newmode_tab(global_primary_url, ...)

Customization

Common customizations:

Add more models: Extend global configuration with additional URLs
Custom metrics: Modify call_model() to track additional data
Different layouts: Edit tab files to change UI structure
Export results: Add CSV/JSON export functionality in tabs

Contributing

When contributing to this repository:

Follow the existing code structure
Update relevant README files
Test all three tabs thoroughly
Verify both environment variable and UI configuration paths
Ensure parallel execution works correctly

License

This repository contains example code for Backend.AI platform usage. Check individual model licenses:

Qwen models: Apache 2.0 License (Alibaba Cloud)
Gradio: Apache 2.0 License

Resources

Support

For issues related to:

Backend.AI platform: Contact Backend.AI support
Models: Refer to Hugging Face model pages
This repository: Open an issue or refer to individual component READMEs

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

lablup/Multi-serving

Folders and files

Latest commit

History

Repository files navigation

Backend.AI Examples - Multi-Serving

Overview

Use Cases

Repository Structure

Components

1. Qwen3-VL-8B-Instruct-FP8 (Primary Model)

2. Qwen3-VL-4B-Instruct-FP8 (Fallback Model)

3. Multi-Model Testing Interface

Quick Start

Prerequisites

Setup Steps

1. Download Models

2. Deploy Models on Backend.AI

3. Run Testing Interface

Architecture Patterns

Primary-Fallback Pattern

Parallel Comparison Pattern

Configuration

Environment Variables

Model Parameters

Testing Workflows

Workflow 1: Quality Comparison

Workflow 2: Reliability Testing

Workflow 3: Independent Testing

Performance Characteristics

API Compatibility

Troubleshooting

Common Issues

Debug Mode

Development

Project Structure

Adding New Tabs

Customization

Contributing

License

Resources

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages