😭 SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Yu Guo1 Ying Shan 2 Fei Wang 1
CVPR 2023
TL;DR: A realistic and stylized talking head video generation method from a single image and audio
- 2023.03.06 Solve some bugs in code and errors in installation
- 2023.03.03 Release the test code for audio-driven single image animation!
- 2023.02.28 SadTalker has been accepted by CVPR 2023!
- Generating 2D face from a single Image.
- Generating 3D face from Audio.
- Generating 4D free-view talking examples from audio and a single image.
- Gradio/Colab Demo.
- integrade with stable-diffusion-web-ui. (stay tunning!)
sadtalker_demo_short.mp4
- training code of each componments.
- Python
- PyTorch
- ffmpeg
git clone https://github.com/Winfredy/SadTalker.git
cd SadTalker
conda create -n sadtalker python=3.8
source activate sadtalker
pip3 install torch torchvision torchaudio
conda config --add channels conda-forge
conda install ffmpeg
pip install ffmpy
pip install Cmake
pip install boost
conda install dlib
pip install -r requirements.txt
Please download our pre-trained model and put it in ./checkpoints.
| Model | Description |
|---|---|
| checkpoints/auido2exp_00300-model.pth | Pre-trained ExpNet in Sadtalker. |
| checkpoints/auido2pose_00140-model.pth | Pre-trained PoseVAE in Sadtalker. |
| checkpoints/mapping_00229-model.pth.tar | Pre-trained MappingNet in Sadtalker. |
| checkpoints/facevid2vid_00189-model.pth.tar | Pre-trained face-vid2vid model from the reappearance of face-vid2vid. |
| checkpoints/epoch_20.pth | Pre-trained 3DMM extractor in Deep3DFaceReconstruction. |
| checkpoints/wav2lip.pth | Highly accurate lip-sync model in Wav2lip. |
| checkpoints/shape_predictor_68_face_landmarks.dat | Face landmark model used in dilb. |
| checkpoints/BFM | 3DMM library file. |
| checkpoints/hub | Face detection models used in face alignment. |
python inference.py --driven_audio <audio.wav> --source_image <video.mp4 or picture.png> --result_dir <a file to store results>
To do ...
We use camera_yaw, camera_pitch, camera_roll to control camera pose. For example, --camera_yaw -20 30 10 means the camera yaw degree changes from -20 to 30 and then changes from 30 to 10.
python inference.py --driven_audio <audio.wav> \
--source_image <video.mp4 or picture.png> \
--result_dir <a file to store results> \
--camera_yaw -20 30 10
If you find our work useful in your research, please consider citing:
@article{zhang2022sadtalker,
title={SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation},
author={Zhang, Wenxuan and Cun, Xiaodong and Wang, Xuan and Zhang, Yong and Shen, Xi and Guo, Yu and Shan, Ying and Wang, Fei},
journal={arXiv preprint arXiv:2211.12194},
year={2022}
}
Facerender code borrows heavily from zhanglonghao and PIRender. We thank the author for sharing their wonderful code. In training process, We also use the model from Deep3DFaceReconstruction and Wav2lip. We thank for their wonderful work.
- StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN (ECCV 2020)
- CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior (CVPR 2023)
- VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild (SIGGRAPH Asia 2022)
- DPE: Disentanglement of Pose and Expression for General Video Portrait Editing (CVPR 2023)
- 3D GAN Inversion with Facial Symmetry Prior (CVPR 2023)
- T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations (CVPR 2023)


