Skip to content

EPIC-RPI/STARC

Repository files navigation

STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems

This repository provides a complete workflow to reproduce the key results of STARC, including:

  1. The implementation of STARC’s selective token access with KV remapping and online clustering.
  2. Evaluation scripts to reproduce:
    • accuracy results on LongBench and RULER, and
    • perplexity results on PG-19.
  3. The simulator setup to reproduce system-level performance/energy results on GPU–PIM platforms based on the AttAcc simulator (Ramulator-based).

Contents


What’s in this artifact

  • Algorithm: The STARC algorithm, which enables efficient long-context LLM inference by selectively accessing and remapping KV cache entries via online clustering under a fixed KV-cache budget.
  • Program: The STARC artifact running public long-context benchmarks: LongBench (16 datasets) and RULER (13 datasets).
  • Models: LongChat-7B v1.5-32K; LLaMA-3.1-8B-Instruct; Mistral-7B-Instruct-v0.3 (publicly available via Hugging Face).
  • Datasets: LongBench (16 datasets; e.g., HotpotQA, QASPER, GovReport, etc.); PG-19; RULER (13 datasets; e.g., NIAH Single, Multi-key NIAH, Multi-value NIAH, etc.), all publicly available (Hugging Face).
  • Metrics: LongBench task scores; PG-19 perplexity; RULER task scores; and system metrics such as latency and energy.
  • Outputs: LongBench/RULER scores, PG-19 perplexity traces, and system-level performance/energy metrics with breakdowns.
  • Availability: Publicly available.
  • License: MIT license.

Requirements

Hardware

  • LLM accuracy evaluation (LongBench / PG-19 / RULER): Compatible with commonly used NVIDIA GPUs. We recommend NVIDIA H100 or L40 with sufficient GPU memory (e.g., at least 48 GB per GPU).
  • System-level simulation: CPU-only execution is sufficient. Experiments in the paper were conducted on a dual-socket AMD EPYC 9334 system with 64 CPU cores in total (2×32 cores).

Software

  • Python: 3.10
  • CUDA: 12.8
  • Python dependencies: see pyproject.toml

Resource estimate

  • Disk space: ~80 GB total
  • Setup time: ~20 minutes
  • Experiment time:
    • Model accuracy experiments: ~12 hours (excluding additional appendix results)
    • System-level performance experiments: ~24 hours

Getting started

1) Get the code

git clone --recurse-submodules https://github.com/EPIC-RPI/STARC
cd STARC

2) Create the environment

To better reproduce the results and avoid potential conflicts, we recommend using Python 3.10 and CUDA 12.8.

We provide scripts for the recommended environment setup. Please follow the instructions below to create the conda environment and install the STARC packages:

conda create -yn STARC python=3.10
conda activate STARC
pip install ninja==1.11.1.1 packaging
pip install -e .
pip install flash-attn==2.3.0 --no-build-isolation
conda install -c conda-forge cupy
conda install numpy scikit-learn
conda install cmake

3) Set up the PIM system simulator

This artifact builds on the AttAcc simulator:

cd simulator_starc
git submodule update --init --recursive

4) Build Ramulator2

bash set_pim_ramulator.sh
cd ramulator2
mkdir build
cd build
cmake .. -DCMAKE_POLICY_VERSION_MINIMUM=3.5
make -j
cp ramulator2 ../ramulator2
cd ../../

Reproducing paper results

This section describes how to reproduce the key results reported in the paper.

E1: LongBench accuracy

To reproduce the LongBench accuracy results:

cd <Your Path>/STARC/scripts/
sh longbench.sh

If you want to evaluate more models, the corresponding model paths are defined in:

STARC/evaluation/LongBench/config/model2path.json

By replacing the model name in longbench.sh, you can evaluate STARC under different models reported in the paper.

E2: PG-19 perplexity

To reproduce the perplexity results on PG-19:

cd <Your Path>/STARC/scripts/
sh ppl_eval.sh

E3: RULER (32K context)

To reproduce RULER results on LLaMA-3.1-8B-Instruct, the RULER testing data are already included in the STARC/ruler directory.

To reproduce the RULER results under a 32K context length:

cd <Your Path>/STARC/scripts/
sh RULER.sh

E4: GPU–PIM system simulation

The system-level simulation experiments are conducted using the AttAcc-based simulator.

Full attention

To reproduce the results for full attention:

python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \
  --lin 2048 --lout 32000 --batch 16 --pim bank \
  --powerlimit --ffopt --pipeopt

Sparse attention configurations

To reproduce the results for configurations with sparse attention methods:

python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \
  --lin 2048 --lout 32000 --batch 16 --pim bank \
  --powerlimit --ffopt --pipeopt \
  --sparsity --kv_budget_table kv_budget_Mistral_STARC.txt

Different sparse attention methods and models use different .txt files specified by the --kv_budget_table option. These files are derived from the attention masks produced by each method at each decoding step in real inference tasks (e.g., LongBench), and map them to the row-level granularity of the PIM architecture, where each DRAM row activation fetches 16 key/value vectors in parallel. They define how many memory rows are activated at each decoding step and are used to guide the simulator accordingly.

Note 1: Please always start simulations from the maximum context length (e.g., 32K). For all methods except STARC, the simulator cache (ramulator.out) generated for the longest context length can be reused for shorter context lengths. In this case, the simulator will directly read the cached results without rerunning the simulation.

Note 2: When switching the evaluated method (Full attention / STARC / SparQ / Quest), please delete the previously generated ramulator.out; otherwise, cached results from the last run may be reused.

Note 3: When reproducing methods other than STARC, comment out the following two lines in STARC/simulator_starc/src/ramulator_wrapper.py to avoid introducing clustering overhead:

if l == l_target - 1:
    trace_args += " --add_cluster"

Note 4: When reproducing STARC results, due to time constraints, we are currently unable to directly present the clustering overhead separately. If you would like to isolate the clustering overhead, you may follow the steps below:

  1. Keep the code in Note 3 enabled:
    if l == l_target - 1:
        trace_args += " --add_cluster"
    and run the corresponding command to obtain the simulation result.
  2. Delete the generated ramulator.out.
  3. Comment out the above code and rerun the same command.
  4. The difference between the two results corresponds to the clustering overhead.

We apologize for this inconvenience. We will optimize the code in future updates to directly report the clustering overhead.


Outputs

  • Model accuracy experiments:
    For LongBench, the evaluation generates a corresponding .jsonl file for each model and each task. These files contain the ground-truth answers and model predictions. The final results are summarized in result.json.
    For RULER, evaluation results are printed directly to the terminal.

  • PG-19 perplexity:
    A log_PG19.txt file is generated to record the evolution of perplexity during evaluation.

  • Simulation experiments:
    The simulator produces an output.csv file that records the breakdown of end-to-end latency and energy consumption. An example format is shown below.

model dtype xpu cap bw sys_opb hw cores pipe_level is_parallel power_constraint gqa_size Lin Lout bs required_cap s_flops g_flops s_time s_matmul s_fc s_comm s_softmax s_act s_lnorm g_time (ms) g_matmul g_fc g_comm g_etc g_qkv_time g_prj_time g_ff_time g2g_comm c2g_comm g_softmax g_act g_lnorm g_energy (nJ) g_dram_energy g_l2_energy g_l1_energy g_reg_energy g_alu_energy g_fc_mem_energy g_fc_comp_energy g_attn_mem_energy g_attn_comp_energy g_etc_mem_energy g_etc_comp_energy g_comm_energy
Mistral-7B W16A16 GPU 1280 9 295.1670644 BA 8 TRUE TRUE TRUE 0 2048 16000 16 1.63247E+11 26440397457 2.74317E+11 137.0683392 27.98165465 62.93091672 11.61364915 24.87258191 2.506916933 7.162619808 0.916930348 0.075737878 0.404029795 0.346612651 0.090550025 0.137451554 0.030404956 0.236173285 0.344865024 0.001747627 0 0.033255129 0.057294896 551880956.3 412383134.3 51675503 30652644.72 19149650.66 37147608.35 344396314.7 130849869.7 61361497.02 6874799.8 6625322.598 900737.2698 872415.232

Since we focus only on the decoding stage, the following columns are used for analysis:

  • Latency breakdown:
    g_matmul, g_fc, g_comm, g_etc, g_time (ms)

  • Energy breakdown:
    g_fc_mem_energy, g_fc_comp_energy,
    g_attn_mem_energy, g_attn_comp_energy,
    g_etc_mem_energy, g_etc_comp_energy,
    g_comm_energy, g_energy (nJ)

Energy consumption is further decomposed into compute (comp) and memory (mem) components.


License

This project is released under the MIT License.


Acknowledgements

This repository incorporates and builds upon open-source implementations from the following projects. We thank the authors for making their code publicly available.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors