Skip to content

Multi-model serving examples with Qwen3-VL models (8B/4B) featuring Gradio testing interface for side-by-side comparison, fallback mechanism testing, and parallel execution benchmarking on Backend.AI platform.

License

Notifications You must be signed in to change notification settings

lablup/Multi-serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Backend.AI Examples - Multi-Serving

This repository contains examples and tools for multi-model serving on Backend.AI, demonstrating how to deploy, test, and compare multiple vision-language models (VLMs) simultaneously.

Overview

This repository showcases a complete multi-model serving setup with:

  1. Two Vision-Language Models (Qwen3-VL) in different sizes
  2. Interactive Testing Interface (Gradio-based) for comparing and testing multiple models
  3. Real-world deployment patterns including primary/fallback configurations

Use Cases

  • Model Comparison: Compare responses from different model sizes side-by-side
  • Fallback Testing: Test automatic failover when primary model becomes unavailable
  • Performance Benchmarking: Measure and compare latency across different models
  • Multi-model Orchestration: Learn patterns for serving multiple models with different characteristics

Repository Structure

backend.ai-examples-Multi-serving/
├── models/
│   ├── qwen-3-vl-8b-instruct-fp8/     # 8B parameter VLM (Primary)
│   ├── qwen-3-vl-4b-instruct-fp8/     # 4B parameter VLM (Fallback)
│   └── multi-model-modal-test/        # Testing interface
│       ├── app.py                     # Main Gradio application
│       ├── utils.py                   # Shared utilities
│       ├── tab_compare.py             # Tab 1: Side-by-side comparison
│       ├── tab_fallback.py            # Tab 2: Fallback mechanism test
│       ├── tab_individual.py          # Tab 3: Individual model testing
│       ├── requirements.txt           # Python dependencies
│       └── model-definition.yml       # Backend.AI model definition
└── data/                              # Sample images for testing

Components

1. Qwen3-VL-8B-Instruct-FP8 (Primary Model)

Location: models/qwen-3-vl-8b-instruct-fp8/

A larger, more capable vision-language model optimized for multimodal instruction following.

  • Parameters: 8 Billion
  • Quantization: FP8 (8-bit floating point)
  • Context Length: 8192 tokens
  • Capabilities: Text + Image understanding, detailed reasoning
  • Use Case: Primary model for high-quality responses

View Setup Instructions

2. Qwen3-VL-4B-Instruct-FP8 (Fallback Model)

Location: models/qwen-3-vl-4b-instruct-fp8/

A smaller, faster vision-language model for efficient inference.

  • Parameters: 4 Billion
  • Quantization: FP8 (8-bit floating point)
  • Context Length: 4096 tokens
  • Capabilities: Text + Image understanding, faster responses
  • Use Case: Fallback model or cost-effective alternative

View Setup Instructions

3. Multi-Model Testing Interface

Location: models/multi-model-modal-test/

A comprehensive Gradio web interface for testing and comparing multiple models with three specialized testing modes.

Features:

  • Compare Tab: Send identical input to both models and compare responses side-by-side
  • Fallback Tab: Test automatic failover when primary model fails
  • Individual Tab: Send different inputs to each model independently
  • Parallel Execution: Models run concurrently for faster results
  • Performance Metrics: Real-time latency measurement and comparison
  • Flexible Configuration: Environment variables or UI-based setup

Key Capabilities:

  • Text + Image multimodal inputs
  • Configurable model parameters (max_tokens, temperature, timeout)
  • Visual performance indicators (faster/slower annotations)
  • Error handling and detailed reporting

View Detailed Documentation

Quick Start

Prerequisites

  • Backend.AI account with access to model serving
  • Python 3.8+
  • Sufficient storage for model weights (~8GB per model)

Setup Steps

1. Download Models

Follow the setup instructions for each model:

# For 8B model (Primary)
cd models/qwen-3-vl-8b-instruct-fp8
# Follow README.md instructions

# For 4B model (Fallback)
cd models/qwen-3-vl-4b-instruct-fp8
# Follow README.md instructions

Both models need to be:

  1. Created as model folders in Backend.AI
  2. Downloaded via batch sessions
  3. Deployed as model services with OpenAI-compatible endpoints

2. Deploy Models on Backend.AI

Deploy each model as a service to get their endpoint URLs:

  • Deploy qwen3-vl-8b-instruct-fp8 as primary service
  • Deploy qwen3-vl-4b-instruct-fp8 as fallback service
  • Note the endpoint URLs (e.g., PRIMARY_BASE_URL)

3. Run Testing Interface

cd models/multi-model-modal-test

# Install dependencies
pip install -r requirements.txt

# Option A: Use environment variables
export PRIMARY_BASE_URL="PRIMARY_BASE_URL"
export PRIMARY_MODEL="PRIMARY_MODEL_NAME"
export PRIMARY_API_KEY="your-api-key"

export FALLBACK_BASE_URL="FALLBACK_BASE_URL"
export FALLBACK_MODEL="FALLBACK_MODEL_NAME"
export FALLBACK_API_KEY="your-api-key"

python app.py

# Option B: Configure via UI
# Simply run without env vars and configure in the web interface
python app.py

The interface will be available at http://localhost:7860

Architecture Patterns

Primary-Fallback Pattern

This repository demonstrates a common production pattern:

User Request
    ↓
Primary Model (8B) ← Try first (higher quality)
    ↓ (on failure)
Fallback Model (4B) ← Use if primary fails (availability)

Benefits:

  • High quality when primary is available
  • Graceful degradation on primary failure
  • Cost optimization (fallback is smaller/cheaper)
  • Improved overall reliability

Parallel Comparison Pattern

The testing interface also demonstrates parallel execution:

User Request
    ↓
    ├─→ Primary Model (8B)  ─┐
    └─→ Fallback Model (4B) ─┤
                             ↓
                    Compare Results

Benefits:

  • 2x faster than sequential execution
  • Side-by-side quality comparison
  • Latency benchmarking
  • Model selection insights

Configuration

Environment Variables

All configuration can be provided via environment variables or the web UI:

Variable Description Example
PRIMARY_BASE_URL Primary model endpoint <PRIMARY_BASE_URL>
PRIMARY_MODEL Primary model name <PRIMARY_MODEL_NAME>
PRIMARY_API_KEY Primary API key your-api-key
FALLBACK_BASE_URL Fallback model endpoint <FALLBACK_BASE_URL>
FALLBACK_MODEL Fallback model name <FALLBACK_MODEL_NAME>
FALLBACK_API_KEY Fallback API key your-api-key

Model Parameters

Adjustable per model via UI sliders:

  • Max Tokens: Output length limit (Primary: up to 7680, Fallback: up to 3840)
  • Temperature: Sampling randomness (0.0 to 2.0)
  • Timeout: Request timeout in seconds (5 to 120)

Testing Workflows

Workflow 1: Quality Comparison

Goal: Compare response quality between models

  1. Navigate to Compare tab
  2. Enter a prompt and optional image
  3. Click "Compare Models"
  4. Review side-by-side responses
  5. Note quality differences and latency

Best for: Evaluating which model better suits your use case

Workflow 2: Reliability Testing

Goal: Verify fallback mechanism works

  1. Navigate to Fallback tab
  2. Test with primary working (normal operation)
  3. Stop primary service or enter invalid URL
  4. Test again to see automatic fallback
  5. Verify status messages and error handling

Best for: Production readiness testing

Workflow 3: Independent Testing

Goal: Test models with different inputs

  1. Navigate to Individual tab
  2. Enter different prompts/images for each model
  3. Click "Test Both Models"
  4. Compare how each handles its specific input

Best for: Specialized testing or different use cases per model

Performance Characteristics

Based on typical deployments:

Model Size Latency (avg) Quality Use Case
Qwen3-VL-8B-FP8 8B params ~2-4s High Detailed analysis, complex reasoning
Qwen3-VL-4B-FP8 4B params ~1-2s Good Quick responses, simple tasks

Notes:

  • Actual latency depends on hardware, input size, and load
  • Parallel execution achieves ~2x speedup over sequential
  • FP8 quantization provides ~2x speedup vs FP16 with minimal quality loss

API Compatibility

Both models expose OpenAI-compatible endpoints:

POST {BASE_URL}/v1/chat/completions
Content-Type: application/json
Authorization: Bearer {API_KEY}

{
  "model": "Qwen/Qwen3-VL-8B-Instruct-FP8",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ],
  "max_tokens": 4096,
  "temperature": 0.7
}

This allows integration with any OpenAI-compatible client library.

Troubleshooting

Common Issues

Issue: "ERROR: base_url is empty"

  • Solution: Set environment variables or configure via Global Configuration in UI

Issue: "400 Bad Request: max_tokens too large"

  • Solution: Reduce max_tokens slider value (remember to reserve space for input)

Issue: Models respond slowly or time out

  • Solution:
    • Increase timeout slider
    • Check Backend.AI service status
    • Verify network connectivity
    • Consider using smaller model (4B) for faster responses

Issue: Image upload fails or produces errors

  • Solution:
    • Ensure image is in supported format (JPEG, PNG)
    • Check image file size (large images auto-resize to 1280px)
    • Verify model endpoint is accessible

Issue: Fallback never triggers

  • Solution: Check that primary URL is actually failing (try invalid URL to test)

Debug Mode

Enable detailed logging:

# In utils.py, add debug prints
import logging
logging.basicConfig(level=logging.DEBUG)

Development

Project Structure

The codebase is modular for easy maintenance:

  • app.py: Main entry point, orchestrates UI and global configuration
  • utils.py: Shared functions (API calls, image processing, configuration)
  • tab_*.py: Individual tab implementations with isolated logic

Adding New Tabs

To add a new testing mode:

  1. Create tab_newmode.py in models/multi-model-modal-test/
  2. Define main function and create_newmode_tab() function
  3. Import and add to app.py:
from tab_newmode import create_newmode_tab

with gr.Tab("New Mode"):
    create_newmode_tab(global_primary_url, ...)

Customization

Common customizations:

  • Add more models: Extend global configuration with additional URLs
  • Custom metrics: Modify call_model() to track additional data
  • Different layouts: Edit tab files to change UI structure
  • Export results: Add CSV/JSON export functionality in tabs

Contributing

When contributing to this repository:

  1. Follow the existing code structure
  2. Update relevant README files
  3. Test all three tabs thoroughly
  4. Verify both environment variable and UI configuration paths
  5. Ensure parallel execution works correctly

License

This repository contains example code for Backend.AI platform usage. Check individual model licenses:

  • Qwen models: Apache 2.0 License (Alibaba Cloud)
  • Gradio: Apache 2.0 License

Resources

Support

For issues related to:

  • Backend.AI platform: Contact Backend.AI support
  • Models: Refer to Hugging Face model pages
  • This repository: Open an issue or refer to individual component READMEs

About

Multi-model serving examples with Qwen3-VL models (8B/4B) featuring Gradio testing interface for side-by-side comparison, fallback mechanism testing, and parallel execution benchmarking on Backend.AI platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published