Skip to content

iNeil77/AWS_DistTraining_Tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multi-node reward model training on AWS with EFA and FSx storage

This tutorial details the process of setting up a multi-node reward model training environment on AWS using EFA and FSx storage. The tutorial is structured into several chapters, each covering a specific aspect of the setup process. By following these chapters, you will be able to create a robust and efficient training environment for your deep learning models. As a practical walkthrough, we will outline how to fine-tune an already post-trained language model on the pointwise reward modeling task using the Bradley-Terry loss objective. However, the setup process can be applied to a wide range of training tasks and models.

This practical tutorial uses a modified version of the Axolotl Framework v0.8.1, with changes to support Qwen3 chat models. At the end of the tutorial, our infrastructure setup will look as follows:

Multi-node training setup

Contents

  1. Chapter 0: Set up test EC2, create final AMI (optional), and compile Docker images
  2. Chapter 1: Create security groups and cluster placement groups
  3. Chapter 2: Create shared FSx for Lustre storage
  4. Chapter 3: Create launch template, maximize network bandwidth, and configure swap and FSx mounting
  5. Chapter 4: Launch EC2 instances, assign public IPs, and verify EFA and FSx connectivity
  6. Chapter 5: Run distributed training of the reward model and evaluate it on RewardBench

Prerequisites

  • An AWS account with appropriate permissions to create and manage EC2 instances, security groups, EFA, and FSx storage. The account must also have appropriately set service quota limits for the resources being used (e.g., EC2 instances, EFA interfaces, FSx storage).
  • Basic knowledge of AWS services, particularly EC2, security groups, and EFA.
  • Familiarity with deep learning frameworks and distributed training concepts is beneficial but not required.

Where to go from here

This tutorial demonstrates how to set up a multi-node training environment on AWS using EFA and FSx storage through a practical example of fine-tuning a language model on a reward modeling task. However, the setup mainly leverages the AWS console GUI. For a more replicable setting when conducting large, repeatable experiments, it is recommended to use infrastructure-as-code tools such as Terraform or AWS CloudFormation to automate the setup process. Additionally, it may be beneficial to set up a Slurm cluster on top of the EC2 instances for easier management of distributed training jobs.

About

A walkthrough on multi-node training a scalar RM in AWS with Axolotl

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors