This repository contains the demo for the audio-to-video synchronisation network (SyncNet). This network can be used for audio-visual synchronisation tasks including:
- Removing temporal lags between the audio and visual streams in a video;
- Determining who is speaking amongst multiple faces in a video.
Please cite the paper below if you make use of the software.
pip install -r requirements.txt
In addition, ffmpeg is required.
SyncNet demo:
python demo_syncnet.py --videofile data/example.avi --tmp_dir /path/to/temp/directory
Check that this script returns:
AV offset: 3
Min dist: 5.353
Confidence: 10.021
Full pipeline:
sh download_model.sh
python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
--min_face_size: Minimum face size in pixels (default: 100). Reduce this value for videos with smaller faces.--facedet_scale: Scale factor for face detection (default: 0.25)--crop_scale: Scale bounding box (default: 0.40)--min_track: Minimum facetrack duration (default: 100 frames)
Example with smaller faces:
python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output --min_face_size 50
For video preprocessing, chunking, and analysis, use the video utilities:
# Get video information
python utils/video_utils.py info --input data/video.mp4
# Split video into 30-second chunks
python utils/video_utils.py chunk-time --input data/video.mp4 --output_dir chunks/ --duration 30
# Split video at silence boundaries (ideal for speech)
python utils/video_utils.py chunk-silence --input data/conversation.mp4 --output_dir chunks/
# Extract audio for processing
python utils/video_utils.py extract-audio --input data/video.mp4 --output audio/extracted.wav📖 See VIDEO_UTILS_GUIDE.md for comprehensive usage examples and best practices.
Symptoms:
$DATA_DIR/pycrop/$REFERENCE/directory is empty- No bounding boxes appear in the output video
- Face detection appears to work (faces are detected in console output)
Cause: The detected faces are smaller than the minimum face size threshold.
Solution:
- Check your detected face sizes by examining the face detection output
- Reduce the
--min_face_sizeparameter inrun_pipeline.py - For videos with small faces (< 100 pixels), try
--min_face_size 50or lower
Example fix:
# Instead of default parameters
python run_pipeline.py --videofile data/chunk_003.mp4 --reference chunk_003 --data_dir data/test/
# Use lower minimum face size
python run_pipeline.py --videofile data/chunk_003.mp4 --reference chunk_003 --data_dir data/test/ --min_face_size 50Process multiple videos and automatically filter them based on audio-visual synchronization quality:
# Basic filtering with default thresholds
python filter_videos_by_sync_score.py --input_dir /path/to/videos --output_dir /path/to/filtered_results
# Using quality presets
python filter_videos_by_sync_score.py --input_dir /path/to/videos --output_dir /path/to/results --preset high
# Custom quality thresholds
python filter_videos_by_sync_score.py \
--input_dir /path/to/videos \
--output_dir /path/to/results \
--min_confidence 6.0 \
--max_abs_offset 3 \
--min_face_size 40 \
--max_workers 4Quality Presets:
--preset strict: confidence≥8.0, |offset|≤2 (publication ready)--preset high: confidence≥6.0, |offset|≤3 (training data quality)--preset medium: confidence≥4.0, |offset|≤5 (balanced filtering)--preset relaxed: confidence≥2.0, |offset|≤8 (keep most usable)
Output Structure:
output_dir/
├── good_quality/ # Videos that pass quality thresholds
│ ├── video1.mp4 # Original videos
│ ├── video2.mp4
│ └── syncnet_outputs/ # SyncNet processing results (NEW!)
│ ├── video1/
│ │ ├── cropped_faces/ # Cropped face track videos
│ │ ├── video1_with_bboxes.avi # Video with bounding boxes
│ │ └── analysis/ # Analysis results
│ │ ├── offsets.txt # Frame offset values
│ │ └── tracks.pkl # Face tracking data
│ └── video2/
│ └── ...
├── poor_quality/ # Videos filtered out for low quality
│ ├── rejected_video.mp4 # Original videos
│ └── syncnet_outputs/ # SyncNet outputs for poor quality videos
│ └── rejected_video/
│ └── ...
├── syncnet_outputs/ # All SyncNet processing results
│ ├── video1/ # Individual video results
│ └── video2/
└── sync_filter_results.json # Detailed analysis results
🎯 Enhanced Output Preservation (NEW!):
The filtering tool now preserves valuable SyncNet processing outputs for further analysis:
- Cropped Face Videos: Individual face tracks as separate video files (
cropped_faces/*.avi) - Bounding Box Visualizations: Original video with face detection overlays (
*_with_bboxes.avi) - Analysis Data: Frame offsets, confidence scores, and tracking information
- Quality-Organized: All outputs copied to
good_quality/andpoor_quality/folders
This is especially useful for:
- Active Speaker Detection: Use cropped faces and bounding boxes for annotation
- Training Data Preparation: Access to pre-processed face tracks and metadata
- Quality Analysis: Compare outputs between good and poor quality videos
Testing the Enhanced Filter:
Test the enhanced filtering functionality:
python test_enhanced_filter.pyThis will verify that:
- All required dependencies are present
- Output preservation works correctly
- Quality directories are created with proper structure
- SyncNet outputs are preserved in organized folders
Organize SyncNet results into structured directories for easy access:
# Organize filtered results into 4 directories
python utils/directory_prepare.py \
--input_dir results/video1/good_quality \
--output_dir organized_outputThis creates:
video_normal/: Original chunk videosvideo_bbox/: Bounding box visualizations (converted to MP4)video_cropped/: Cropped face videos (converted to MP4)audio/: Extracted audio files (16kHz mono WAV)
See utils/DIRECTORY_PREPARE_README.md for detailed usage guide.
Parameters:
--min_confidence: Minimum SyncNet confidence score to keep video--max_abs_offset: Maximum absolute frame offset to keep video--keep_all: Analyze quality but don't copy files to separate folders
Outputs:
$DATA_DIR/pycrop/$REFERENCE/*.avi - cropped face tracks
$DATA_DIR/pywork/$REFERENCE/offsets.txt - audio-video offset values
$DATA_DIR/pyavi/$REFERENCE/video_out.avi - output video (as shown below)
##Todo
- add duplicate remover
@InProceedings{Chung16a,
author = "Chung, J.~S. and Zisserman, A.",
title = "Out of time: automated lip sync in the wild",
booktitle = "Workshop on Multi-view Lip-reading, ACCV",
year = "2016",
}

