SpeechLLM

Repository adapted from https://github.com/skit-ai/SpeechLLM/tree/main

Propose the training and testing of Speech LLMs for speaker characterization and summarization, adapted for the CLSP grid.

Options Available right now

Datasets

The following datasets have been adapted and can be used for training, with the following fields:

dataset	train split	dev split	test split	fields
Crema-D	train	dev	test	6 Cat. emotions, gender, transcript
Common Voice EN v11	train	dev	test	nationality, age (decade), gender, accent (16 nationalities), transcript
IEMOCAP	ses01-03	ses04	ses05	4 Cat. emotions, gender
Librispeech	train-clean-100, train-clean-360, train-other-500	dev-clean, dev-other	test-clean, test-other	gender, transcript
MSP podcast	train	validation	test	8 Cat. emotions, gender
Switchboard	train	validation	test	transcript, summary
VoxCeleb1	dev	test	test	gender, accent (from nationality)
VoxCeleb2-AE	dev	test	test	gender, age, accent (from nationality)
WSJ0	si_tr_s	si_dt_05	si_et_05	gender, transcript

Most datasets use the original splits. Voxceleb datasets are using the same validation and test, as the models are trained to optimize validation summary loss anyway.
To add a new csv, the necessary functions are in local/data_csv.
If you want access to the data, please copy the contents of the folder /home/tthebau1/EDART/SpeechLLM/data/* in your own data/ folder.

Architectures

In general, all parameters and their defaults values can be adjusted in the get_model_config() function in utils.py. The base system currently uses:

A WavLM base plus feature encoder, with 768 dimensional output features. It can be replaced by any hugging face encoder, by modifying the parameters:
- --encoder 'microsoft/wavlm-base-plus' (currently accepts facebook/hubert-xlarge-ll60k, microsoft/wavlm-large, microsoft/wavlm-base-plus, MFCC, the list can be expanded in models/encoder.py)
- --encoder-dim 768 to adjust the desired output dimension.
A windowed meanpooling layer, with a ratio --meanpool 5
A CNN connector, which uses the following parameters:
- --connector 'cnn' for the type of connector. more types and architectures can be added in the models/connector.py file.
- --connector-k 2 for the stride
- --connector-layers 2 for the number of layers in case of a MLP
- --connector-dim 1024 for the output dimension of features
A LLM. Currently uses a Tiny LLAMA, you can change it by adjusting the parameter:
- --llm 'TinyLlama-1.1B-Chat-v1.0'

Parameters

For Training

there is one learning rate for the LoRa adaptors of the LLM and the connector (--lr 0.0001), and one learning rate for the feature extractor (--encoder-lr 0.000001, by default 50 times lower than the base lr).
If --no-lora is passed, the LLM is frozen.
If --ft-encoder is passed, the encoder is fine-tuned.
If --use-text is passed, transcripts will be added as inputs when available, with a probability --prob-text during training.
If --no-audio is passed, only the transcripts will be used, no encoder nor connector will be initialized nor used.

Configurations

Training configurations are defined in the folder configs, they can be passed as arguments --use-config summarize_switchboard.json. They define which dataset should be used for training, testing and validation, and which tasks should be used for each.
Trains up to --total-training-epoch maximum training epochs. top 3 models saved in checkpoints/. Uses --epoch-to-test to test a specific epoch.

Naming

--group is used by Wandb to put the experiment in a given group.
--nickname is used to differentiate experiments and models with similar architecture but variations in configurations.

Installation and running an experiment

Installation

Conda: conda environment is available in environment.yml, use

conda env create -f environment.yml

Pip: pip environment is available in requirements.txt, use

pip install -r requirements.txt

Launching a training

To train a network, use

sbatch launch/$expe_series/train/$your_script.sh

To test it, use

sbatch launch/$expe_series/test/$your_script.sh

The experiments with simple linear layer for speaker characterization are available in launch/ASRU2025, allowing partial reproduction of this article: Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM. Please cite this if you use those experiments:

@article{thebaud2025enhancing,
  title={Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM},
  author={Thebaud, Thomas and Lu, Yen-Ju and Wiesner, Matthew and Viechnicki, Peter and Dehak, Najim},
  journal={arXiv preprint arXiv:2508.04795},
  year={2025}
}

The experiments with CNN connector for audio summarization are available in ```launch/ICASSP2025```. The article was not submitted, it will be to a different venue.

Contact

If you have any question, please contact Thomas Thebaud on slack, or use tthebau1@jhu.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
config		config
data		data
launch		launch
local		local
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checkpoints		checkpoints
dataset.py		dataset.py
environment.yml		environment.yml
instructions.txt		instructions.txt
metrics.py		metrics.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechLLM

Options Available right now

Datasets

Architectures

Parameters

For Training

Configurations

Naming

Installation and running an experiment

Installation

Launching a training

Contact

About

Uh oh!

Releases

Packages

Languages

License

svecjan/speechLLM

Folders and files

Latest commit

History

Repository files navigation

SpeechLLM

Options Available right now

Datasets

Architectures

Parameters

For Training

Configurations

Naming

Installation and running an experiment

Installation

Launching a training

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages