Skip to content

Latest commit

ย 

History

History

README.md

๐ŸŽ™๏ธ StarConnect Voice Cloning Agent System

๐Ÿš€ Cutting-Edge 2026 Multi-Agent Architecture

This system leverages local Ollama LLMs and state-of-the-art TTS models for voice cloning.


๐Ÿ“ฆ Available Agents

Agent File Purpose
๐ŸŽฏ Orchestrator orchestrator.py Master agent for pipeline coordination
๐ŸŽค Zero-Shot Cloning zero_shot_cloning.py Instant voice cloning with VoxCPM/F5-TTS
๐Ÿ”ฌ Quality Agent quality_agent.py Audio quality assessment & comparison
๐Ÿง  Ensemble Agent ensemble_agent.py Multi-model synthesis & selection
โ˜๏ธ Colab Agent colab_agent.py Google Colab GPU orchestration
๐Ÿ–ฅ๏ธ GCP Agent gcp_agent.py Google Cloud Platform deployment
๐Ÿš€ RunPod Agent runpod_agent.py RunPod GPU pod management
๐ŸŽฎ NVIDIA API Agent nvidia_api_agent.py NVIDIA API integration

๐Ÿฆ™ Ollama Models Used

Model Size Best For
qwen3:8b 5.2 GB Reasoning & Planning
deepseek-r1:7b 4.7 GB Code & Technical
qwen2.5:3b 1.9 GB Fast, Simple Tasks
phi3:14b 7.9 GB Best Quality
llama3.2:3b 2.0 GB Balanced

๐ŸŽฏ Quick Start

1. Analyze Your Dataset

python starconnect.py analyze --dataset ./StarConnect

2. Create Voice Profile

python starconnect.py profile --dataset ./StarConnect --name MyVoice

3. Clone a Voice

python starconnect.py clone \
  --text "Bonjour, je suis votre assistant vocal." \
  --reference ./StarConnect/segment_001.wav \
  --ref-text "Bonjour, c'est Mani, producteur et manager."

4. Use Ensemble (Best Quality)

python starconnect.py ensemble \
  --text "Bonjour, je suis votre assistant vocal." \
  --reference ./StarConnect/segment_001.wav \
  --language fr

5. Assess Audio Quality

python starconnect.py assess --audio ./output.wav

6. Compare Original vs Cloned

python starconnect.py assess \
  --audio ./cloned.wav \
  --compare ./original.wav

๐Ÿ”ง CLI Commands

starconnect.py <command> [options]

Commands:
  analyze   - Analyze voice dataset with LLM
  select    - Select best reference samples
  profile   - Create voice profile
  clone     - Clone voice (single text)
  ensemble  - Use multi-model ensemble
  assess    - Assess audio quality
  batch     - Batch process multiple texts
  models    - List available TTS models
  ollama    - Check Ollama status

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    USER REQUEST                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
                      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              ๐ŸŽฏ ORCHESTRATOR AGENT                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  Ollama LLM (qwen3:8b)                          โ”‚   โ”‚
โ”‚  โ”‚  - Task planning                                 โ”‚   โ”‚
โ”‚  โ”‚  - Model selection                               โ”‚   โ”‚
โ”‚  โ”‚  - Quality reasoning                             โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ–ผ             โ–ผ             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐ŸŽค Zero-Shot  โ”‚ โ”‚ ๐Ÿง  Ensemble   โ”‚ โ”‚ ๐Ÿ”ฌ Quality    โ”‚
โ”‚    Cloning    โ”‚ โ”‚    Agent      โ”‚ โ”‚    Agent      โ”‚
โ”‚               โ”‚ โ”‚               โ”‚ โ”‚               โ”‚
โ”‚ - VoxCPM      โ”‚ โ”‚ - VoxCPM      โ”‚ โ”‚ - SNR Analysisโ”‚
โ”‚ - F5-TTS      โ”‚ โ”‚ - F5-TTS      โ”‚ โ”‚ - Similarity  โ”‚
โ”‚ - Emotional   โ”‚ โ”‚ - XTTS        โ”‚ โ”‚ - LLM Report  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚             โ”‚             โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   OUTPUT AUDIO                           โ”‚
โ”‚  - Cloned voice .wav                                     โ”‚
โ”‚  - Quality report                                        โ”‚
โ”‚  - LLM analysis                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“Š StarConnect Dataset

  • Total Segments: 703 audio files
  • Total Duration: ~64 minutes
  • Language: French
  • Speaker: Single speaker
  • Format: WAV + JSON transcriptions

๐ŸŽญ Emotional TTS

Generate speech with automatic emotion detection:

python starconnect.py clone \
  --text "Je suis tellement content de te voir!" \
  --reference ./StarConnect/segment_001.wav \
  --emotion auto

Supported emotions:

  • neutral, happy, sad, angry, surprised
  • fearful, disgusted, professional, excited

๐Ÿ”„ Training on Colab

  1. Upload dataset to Google Drive
  2. Open notebooks/F5_TTS_Colab_Training.ipynb
  3. Run the polling cell
  4. Trigger from local:
python agents/colab_agent.py trigger \
  --dataset /content/drive/MyDrive/f5_tts_datasets/starconnect_f5tts \
  --epochs 200

๐Ÿ“ˆ Quality Metrics

The quality agent calculates:

  • SNR (Signal-to-Noise Ratio)
  • Dynamic Range
  • Clipping Detection
  • Silence Ratio
  • Intelligibility Score
  • Naturalness Score
  • Similarity Score (vs reference)

๐Ÿ”— Related Files

  • STARCONNECT_TRAINING_GUIDE.md - Colab training guide
  • F5_TTS_INTEGRATION.md - F5-TTS documentation
  • GPU_ORCHESTRATION_GUIDE.md - Cloud GPU options
  • NVIDIA_FREE_OPTIONS.md - Free NVIDIA resources

โœ… Status

  • Orchestrator agent with LLM reasoning
  • Zero-shot cloning agent
  • Emotional TTS agent
  • Audio quality assessment
  • Multi-model ensemble
  • Unified CLI
  • Colab orchestration
  • 12 Ollama models available
  • 703 StarConnect segments processed

Built with cutting-edge 2026 AI techniques ๐Ÿš€