This repository contains the demo for the audio-to-video synchronisation network (SyncNet). This network can be used for audio-visual synchronisation tasks including:
- Removing temporal lags between the audio and visual streams in a video;
- Determining who is speaking amongst multiple faces in a video.
The model can be used for non-commercial research purposes under Creative Commons Attribution License. Please cite the paper below if you make use of the software.
The following packages are required to run the SyncNet demo:
python (2.7.12)
pytorch (0.4.0)
numpy (1.14.3)
scipy (1.0.1)
opencv-python (3.4.0) - via opencv-contrib-python
python_speech_features (0.6)
cuda (8.0)
ffmpeg (3.4.2)
In addition to above, these are required to run the full pipeline:
tensorflow (1.2, 1.4)
pyscenedetect (0.3.5) - does not work with 0.4
The demo has been tested with the package versions shown above, but may also work on other versions.
SyncNet demo:
python demo_syncnet.py --videofile data/example.avi --tmp_dir /path/to/temp/directory
Check that this script returns:
AV offset: 4
Min dist: 6.568
Confidence: 9.889
Full pipeline:
sh download_model.sh
python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
Outputs:
$DATA_DIR/pycrop/$REFERENCE/*.avi - cropped face tracks
$DATA_DIR/pywork/$REFERENCE/offsets.txt - audio-video offset values
$DATA_DIR/pyavi/$REFERENCE/video_out.avi - output video (as shown below)
@InProceedings{Chung16a,
author = "Chung, J.~S. and Zisserman, A.",
title = "Out of time: automated lip sync in the wild",
booktitle = "Workshop on Multi-view Lip-reading, ACCV",
year = "2016",
}

