TinyTransformer

AI Workshop: ROCm Tools for PyTorch AI Workload Profiling

README.md from HPCTrainingExamples/MLExamples/TinyTransformer in the Training Examples repository

Workshop Overview

This hands-on workshop provides a comprehensive guide to profiling AI workloads using AMD ROCm tools and PyTorch. Through progressive optimization of a Tiny LLaMA transformer implementation, participants will master the complete profiling ecosystem from framework-level tools to hardware-specific profilers.

Learning Objectives

By the end of this workshop, participants will be able to:

Configure deterministic execution environments for reproducible profiling
Use PyTorch native profiling tools for performance characterization
Integrate DeepSpeed FLOPS profiler for computational intensity analysis
Apply ROCm profiling tools (rocprofv3, rocprof-sys, rocprof-compute) for kernel-level optimization
Implement progressive optimization techniques from kernel fusion to custom GPU programming
Perform roofline analysis and bottleneck identification for production AI workloads

Workshop Structure

This workshop follows a progressive optimization methodology with four implementation versions, each building upon the previous with enhanced profiling capabilities and performance improvements.

Version Progression

Small Configuration (Quick Start)

Config: Hidden=512, Layers=8, SeqLen=128, Batch=8

Version	Speed (samples/sec)	Batch Time (ms)	Forward (ms)	Backward (ms)	Memory (MB)	Speedup
V1 Baseline	372.9	21.7	10.8	9.2	522.3	1.0x
V3 Triton	2,065.0	3.9	3.2	0.3	281.8	5.5x

Medium Configuration (Production Scale)

Config: Hidden=1024, Layers=12, SeqLen=512, Batch=16

Version	Throughput (tok/s)	Batch (ms)	Forward (ms)	Backward (ms)	Optimizer (ms)	Memory (MB)	Speedup
V1 Baseline	50,017	163.8	50.3	107.4	6.1	2,358.7	1.0x
V2 Fused	60,192	136.1	44.8	85.6	5.8	2,358.9	1.20x
V3 Triton	156,652	52.3	51.3	0.6	0.4	916.2	3.13x
V4 Ultra	157,169	52.1	51.1	0.6	0.4	916.5	3.14x

See PERFORMANCE_RESULTS.md for complete analysis

Profiling Tools Progression

Each version introduces additional profiling capabilities:

PyTorch Profiler: Framework-level performance analysis
DeepSpeed FLOPS Profiler: Computational efficiency metrics
rocprofv3: GPU hotspots, device activity tracing and hardware counter collection
rocprof-sys: System-level performance monitoring
rocprof-compute: Advanced kernel-level analysis and optimization

Prerequisites

Hardware Requirements

AMD GPU with ROCm support (MI100, MI200, MI300 series, or RX 6000/7000 series)
Minimum 16GB system memory
ROCm 6.0+ installed and configured

Software Requirements

Python 3.10+
PyTorch with ROCm support
ROCm profiling tools suite
DeepSpeed (for FLOPS profiler)
Triton (for advanced versions)

Quick Start

0. Set up environment

On the training cluster's compute node, the required environment may be set up using the following commands:

module load rocm pytorch openmpi rocprofiler-compute rocprofiler-systems/develop

1. Verify Environment

# Check ROCm installation
rocminfo

# Verify GPU is detected
rocm-smi

# Check PyTorch + ROCm
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

2. Run Version 1 (Baseline) - 5 minutes

cd version1_pytorch_baseline/
python tiny_llama_v1.py --batch-size 8 --seq-len 128 --num-steps 20

# Expected output:
# Loss: ~7.0
# Speed: ~373 samples/sec
# Memory: ~522 MB

For a deeper analysis with the PyTorch profiler, and visualizing the output in TensorBoard, please follow the workshop exercises in version1_pytorch_baseline/README.md.

3. Run Version 2 (Fused) - 5 minutes

cd version2_pytorch_fused
python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 30

# Expected output:
# Loss: 6.9310 
# Speed: 187.6 samples/sec (2x faster)
# Memory: 370.4 MB

To compare the baseline version1 to the fused version2 performance, follow instructions in version2_pytorch_fused/README.md.

Try profiling this workload with ROCm profilers using commands listed in version2_pytorch_fused/README.md. An example of using rocprofv3 on this example is provided below:

rocprofv3 --kernel-trace --stats --truncate-kernels -- python tiny_llama_v2.py --batch-size 8 --seq-len 128 --num-steps 30

The above command produces a hotspot list of GPU kernels. The --truncate-kernels option helps remove arguments from the kernel name for better readability.

3. Run Version 3 (Optimized) - 5 minutes

cd version3_triton/
python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 20

# Expected output:
# Loss: ~7.0 (same correctness!)
# Speed: ~2065 samples/sec (5.5x faster!)
# Memory: ~282 MB (46% less!)

An exercise similar to the one you did for version2 is recommended for this version as well using ROCm profiling tools. As an example, you can collect a comprehensive timeline trace with host and device activity with rocprof-sys using the command below:

rocprof-sys-run --profile --trace -- python tiny_llama_v3.py --batch-size 8 --seq-len 128 --num-steps 30

View the trace at https://ui.perfetto.dev.

4. Run Version 4 (Ultra optimized) - 5 minutes

cd version4_pytorch_sdpa/
python3 tiny_llama_v4.py

Directory Structure

ai-workshop-training/
 README.md                              # This overview
 setup/                                 # Environment and prerequisites
    environment_setup.md               # Detailed setup instructions
    environment_setup.sh               # Automated setup script
    requirements.txt                   # Python dependencies
    validation_scripts/                # Environment validation
        test_environment.py            # Comprehensive environment test
        test_rocm_installation.py      # ROCm stack validation
        test_profiling_tools.py        # Profiling tools validation
 version1_pytorch_baseline/             # Standard PyTorch implementation
    README.md                          # Detailed guided instructions
    tiny_llama_v1.py                   # Enhanced baseline implementation
    run_pytorch_profiler.py            # PyTorch profiler integration
    run_deepspeed_flops.py            # DeepSpeed FLOPS profiler
    run_all_profilers.sh              # Orchestrated profiling script
    exercises/                         # Hands-on exercises and analysis
        exercise_1_baseline_analysis.md
        exercise_2_memory_analysis.md
        exercise_3_bottleneck_identification.md
 version2_pytorch_fused/                # Fused operations optimization
    README.md                          # Fusion optimization guide
    tiny_llama_v2.py                   # Fused implementation
    run_pytorch_profiler.py            # Enhanced PyTorch profiling
    run_deepspeed_flops.py            # FLOPS analysis
    run_rocprofv3.sh                   # rocprofv3 integration
    run_rocprof_sys.sh                # System profiling
    run_rocprof_compute.sh             # Kernel-level profiling
    run_all_profilers.sh              # Complete profiling suite
    exercises/                         # Advanced profiling exercises
        exercise_1_fusion_analysis.md
        exercise_2_flash_attention.md
        exercise_3_rocm_tools_intro.md
 version3_triton/                       # Triton kernel integration
    README.md                          # Triton optimization guide
    tiny_llama_v3.py                   # Triton-enhanced implementation
    triton_kernels.py                  # Custom Triton kernels
    run_pytorch_profiler.py            # Framework profiling
    run_deepspeed_flops.py            # Computational analysis
    run_rocprofv3.sh                   # Legacy profiling
    run_rocprof_sys.sh                # System monitoring
    run_rocprof_compute.sh             # Advanced kernel analysis
    run_all_profilers.sh              # Complete profiling
    exercises/                         # Triton development exercises
        exercise_1_triton_basics.md
        exercise_2_custom_kernels.md
        exercise_3_performance_tuning.md
 version4_pytorch_sdpa/                 # Ultra-fused implementation
    README.md                          # Ultra-optimization guide
    tiny_llama_v4.py                   # Ultra-fused implementation
    triton_ultra_kernels.py            # Ultra-fused kernels
    [profiling scripts]                # Complete profiling suite
    exercises/                         # Advanced optimization
        exercise_1_ultra_fusion.md
        exercise_2_register_optimization.md
        exercise_3_production_deployment.md
 analysis_tools/                        # Performance analysis utilities
    compare_versions.py                # Cross-version performance comparison
    roofline_analysis.py               # Roofline model implementation
    performance_dashboard.py           # Interactive performance dashboard
    regression_tester.py               # Automated regression testing
    report_generator.py                # Comprehensive report generation
 slides/                                # Presentation materials
     luka_presentation_materials/        # AI workshop slides
         workshop_overview.pptx
         profiling_methodology.pptx
         optimization_techniques.pptx
         results_analysis.pptx

Workshop Execution Timeline

Session 1: Foundation (45 minutes)

Environment setup and validation
Version 1 baseline profiling
PyTorch profiler introduction
Performance characterization methodology

Session 2: Optimization (60 minutes)

Version 2 kernel fusion techniques
ROCm tools introduction
Memory optimization analysis
Comparative performance analysis

Session 3: Advanced Techniques (60 minutes)

Version 3 Triton kernel development
Custom GPU programming
Advanced profiling techniques
Production optimization strategies

Session 4: Mastery (45 minutes)

Version 4 ultra-fusion implementation
Complete profiling suite utilization
Roofline analysis and bottleneck resolution
Workshop wrap-up and next steps

Key Performance Insights

Actual Performance Results (AMD MI325X, ROCm 6.4.4, PyTorch 2.7.1)

Test Configuration: Batch=8, SeqLen=128, Hidden=512, Layers=8, Heads=8

Metric	V1 Baseline	V3 Optimized	Improvement
Training Speed	372.9 samples/sec	2065.0 samples/sec	5.5x faster
Batch Time	21.7 ms	3.9 ms	5.6x faster
Forward Pass	10.8 ms	3.2 ms	3.4x faster
Memory Usage	522.3 MB	281.8 MB	46% reduction
Throughput	47,735 tokens/sec	264,320 tokens/sec	5.5x faster

Key Optimization Techniques Applied

Flash Attention (Memory-Efficient Attention)
- V3: Custom Triton Flash Attention kernel
- V4: PyTorch SDPA (hardware-accelerated)
- Both achieve ~3.1x speedup through memory-efficient attention
- Result: 46% memory reduction, 61% less memory bandwidth
Tensor Contiguity (.contiguous() after GQA operations)
- Ensures optimal memory layout for Triton kernels
- Fixes stride-related performance issues
- Result: 20x speedup over non-contiguous version
Hybrid Kernel Strategy
- Use Triton for: RMSNorm, Flash Attention (memory-bound ops)
- Use PyTorch/rocBLAS for: Matrix multiplies (compute-bound ops)
- Don't write custom Triton kernels for matmuls - rocBLAS is already optimal
- Result: 3.1x overall speedup
Proper Weight Initialization (std=0.02)
- Critical for correct logits scale
- Prevents exploding/vanishing gradients
- Result: Loss goes from 942 → 7.0

V3 vs V4: Two Paths to the Same Performance

V3 (Triton Custom Kernels): Custom Triton RMSNorm + Triton Flash Attention
V4 (PyTorch Optimized): PyTorch ops + PyTorch SDPA
Both achieve 3.1x speedup - demonstrates that highly-optimized PyTorch operations can match custom kernels

Profiling Tool Capabilities

PyTorch Profiler: Framework overhead, operator timing, memory tracking
rocprofv3: Kernel execution stats, device activity and runtime API timeline tracing, hardware counter collection
Manual Timing: CUDA synchronization for accurate GPU timing

Contributing

This workshop is designed for continuous improvement. Contributions are welcome:

Additional optimization techniques
Enhanced profiling methodologies
Extended GPU architecture support
Advanced analysis tools

Support and Resources

Workshop Issues: Submit GitHub issues for technical problems
AMD ROCm Documentation: ROCm Developer Portal
rocprofv3 tool usage: Using rocprofv3
rocprof-sys Guide: rocprof-sys documentation
rocprof-compute Guide: rocprof-compute Documentation
PyTorch ROCm Support: PyTorch ROCm Installation

Authors and Acknowledgments

Developed for the CASTIEL AI Workshop (October 16, 2024) by HPC/AI performance engineers with extensive experience optimizing production ML workloads on AMD GPU infrastructure.

License

MIT License - See LICENSE file for details

Ready to start profiling? Begin with the Environment Setup Guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

AI Workshop: ROCm Tools for PyTorch AI Workload Profiling

Workshop Overview

Learning Objectives

Workshop Structure

Version Progression

Small Configuration (Quick Start)

Medium Configuration (Production Scale)

Profiling Tools Progression

Prerequisites

Hardware Requirements

Software Requirements

Quick Start

0. Set up environment

1. Verify Environment

2. Run Version 1 (Baseline) - 5 minutes

3. Run Version 2 (Fused) - 5 minutes

3. Run Version 3 (Optimized) - 5 minutes

4. Run Version 4 (Ultra optimized) - 5 minutes

Directory Structure

Workshop Execution Timeline

Session 1: Foundation (45 minutes)

Session 2: Optimization (60 minutes)

Session 3: Advanced Techniques (60 minutes)

Session 4: Mastery (45 minutes)

Key Performance Insights

Actual Performance Results (AMD MI325X, ROCm 6.4.4, PyTorch 2.7.1)

Key Optimization Techniques Applied

V3 vs V4: Two Paths to the Same Performance

Profiling Tool Capabilities

Contributing

Support and Resources

Authors and Acknowledgments

License

Name		Name	Last commit message	Last commit date
parent directory ..
version1_pytorch_baseline		version1_pytorch_baseline
version2_pytorch_fused		version2_pytorch_fused
version3_triton		version3_triton
version4_pytorch_sdpa		version4_pytorch_sdpa
README.md		README.md
TECHNICAL_APPENDICES.md		TECHNICAL_APPENDICES.md
TINY_LLAMA_ARCHITECTURE.md		TINY_LLAMA_ARCHITECTURE.md

FilesExpand file tree

TinyTransformer

Directory actions

More options

Directory actions

More options

Latest commit

History

TinyTransformer

Folders and files

parent directory

README.md

AI Workshop: ROCm Tools for PyTorch AI Workload Profiling

Workshop Overview

Learning Objectives

Workshop Structure

Version Progression

Small Configuration (Quick Start)

Medium Configuration (Production Scale)

Profiling Tools Progression

Prerequisites

Hardware Requirements

Software Requirements

Quick Start

0. Set up environment

1. Verify Environment

2. Run Version 1 (Baseline) - 5 minutes

3. Run Version 2 (Fused) - 5 minutes

3. Run Version 3 (Optimized) - 5 minutes

4. Run Version 4 (Ultra optimized) - 5 minutes

Directory Structure

Workshop Execution Timeline

Session 1: Foundation (45 minutes)

Session 2: Optimization (60 minutes)

Session 3: Advanced Techniques (60 minutes)

Session 4: Mastery (45 minutes)

Key Performance Insights

Actual Performance Results (AMD MI325X, ROCm 6.4.4, PyTorch 2.7.1)

Key Optimization Techniques Applied

V3 vs V4: Two Paths to the Same Performance

Profiling Tool Capabilities

Contributing

Support and Resources

Authors and Acknowledgments

License