Training large language models (LLMs) typically requires massive volumes of high-quality data, which are often distributed across multiple organizations and cannot be centrally aggregated due to privacy regulations. Federated learning offers a promising paradigm for collaboratively training LLMs without sharing raw data from each participant. However, existing federated LLM training systems face challenges: (1) limited and heterogeneous computational resources that prevent many participants from training large-scale models, (2) inefficient training pipelines due to the heavy coupling of computation and communication, and (3) potential privacy leakage through gradient exchanges during model aggregation.
To address these challenges, we present FedScaleLLM, the first system that jointly addresses resource heterogeneity, pipeline inefficiency, and privacy leakage in federated LLM training. First, a Resource-Aware Model Management mechanism partitions model states across clients and dynamically loads layers on demand, significantly reducing the GPU memory footprint at each participant. Second, a Pipelined Parallel Training Engine overlaps computation and communication through asynchronous pipelined execution and clustered parallel training, substantially improving system throughput. Third, an Anonymous Routing mechanism forwards gradient updates through dynamically constructed multi-hop paths, breaking the cross-round linkage between client identities and transmitted updates to mitigate privacy leakage risks. Extensive experiments on three benchmarks under different heterogeneous environments show that FedScaleLLM reduces GPU memory usage by up to 6x, lowers end-to-end training time by 17x, and achieves 18x higher throughput compared with state-of-the-art methods, while demonstrating privacy protection capability.
Python.version = 3.9.21
Other dependencies are listed in requirements.txt.
Experiments are conducted in a heterogeneous federated environment consisting of 10 physical machines interconnected via 10 Gbps links, including 7 servers equipped with dual NVIDIA GeForce RTX 3090 GPUs and 3 servers with dual RTX 2080 Ti GPUs. Different tasks instantiate different numbers of logical clients to reflect realistic deployments: 9 clients for Code Generation, 8 clients for Question Answering, and 3 clients for Math Problem Solving, prioritizing RTX 3090 servers when available. For scalability evaluation, we scale to 16 logical clients by partitioning the 3090 servers and incorporating additional 2080 Ti-based clients.
We adopt the benchmark datasets released in FederatedScope-LLM. As summarized in the following table, the tasks span three representative domains, code generation, question answering, and mathematical reasoning, each exhibiting distinct data heterogeneity patterns.
| Task | Training Dataset | # training samples | Partition | # clients | Test Dataset | # test samples |
|---|---|---|---|---|---|---|
| Code Generation | Fed-CodeAlpaca | 7954 | Non-IID | 9 | HumanEval | 656 |
| Question Answering | Fed-Dolly | 15015 | Non-IID | 8 | HELM | 1600 |
| Math Problem Solving | Fed-GSM8K-3 | 7473 | IID | 3 | GSM8K | 1319 |
In this paper, we used 5 LLMs: DeepSeek-Qwen-1.5B, DeepSeek-Qwen-7B, GPT3, DeepSeek-Llama-8B, and DeepSeek-Qwen-14B
We evaluate FedSpeed against a comprehensive suite of federated LLM training baselines, each representing one of the three dominant paradigms in the field: gradient approximation, model compression, and split learning. For each paradigm, the most competitive state-of-the-art method is chosen as the representative baseline. Besides, we compare our framework with two industrial state-of-the-art federated LLM training systems. The baselines are shown below:
| Baseline | Year | Conference | Paper |
|---|---|---|---|
| SPRY | 2025 | NeurIPS | Thinking forward memory-efficient federated finetuning of language models |
| FedBiOT | 2024 | KDD | FedBiOT LLM Local Fine-tuning in Federated Learning without Full Model |
| M2FedSA | 2024 | ICML | Enhancing Storage and Computational Efficiency in Federated Multimodal Learning for Large-Scale Models |
| FederatedScope-LLM | 2024 | KDD | Federatedscope-llm A comprehensive package for fine-tuning large language models in federated learning |
| FATE-LLM | 2023 | Arxiv | FATE-LLM A Industrial Grade Federated Learning Framework for Large Language Models |
The running example of FedScaleLLM is as follows.
HF_HUB_OFFLINE=1 CUDA_VISIBLE_DEVICES=0 python federatedscope/main.py --cfg federatedscope/llm/baseline/client1.yaml 2>&1 | tee logs/client1.log
