STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems
This repository provides a complete workflow to reproduce the key results of STARC, including:
- The implementation of STARC’s selective token access with KV remapping and online clustering.
- Evaluation scripts to reproduce:
- accuracy results on LongBench and RULER, and
- perplexity results on PG-19.
- The simulator setup to reproduce system-level performance/energy results on GPU–PIM platforms based on the AttAcc simulator (Ramulator-based).
- Algorithm: The STARC algorithm, which enables efficient long-context LLM inference by selectively accessing and remapping KV cache entries via online clustering under a fixed KV-cache budget.
- Program: The STARC artifact running public long-context benchmarks: LongBench (16 datasets) and RULER (13 datasets).
- Models: LongChat-7B v1.5-32K; LLaMA-3.1-8B-Instruct; Mistral-7B-Instruct-v0.3 (publicly available via Hugging Face).
- Datasets: LongBench (16 datasets; e.g., HotpotQA, QASPER, GovReport, etc.); PG-19; RULER (13 datasets; e.g., NIAH Single, Multi-key NIAH, Multi-value NIAH, etc.), all publicly available (Hugging Face).
- Metrics: LongBench task scores; PG-19 perplexity; RULER task scores; and system metrics such as latency and energy.
- Outputs: LongBench/RULER scores, PG-19 perplexity traces, and system-level performance/energy metrics with breakdowns.
- Availability: Publicly available.
- License: MIT license.
- LLM accuracy evaluation (LongBench / PG-19 / RULER): Compatible with commonly used NVIDIA GPUs. We recommend NVIDIA H100 or L40 with sufficient GPU memory (e.g., at least 48 GB per GPU).
- System-level simulation: CPU-only execution is sufficient. Experiments in the paper were conducted on a dual-socket AMD EPYC 9334 system with 64 CPU cores in total (2×32 cores).
- Python: 3.10
- CUDA: 12.8
- Python dependencies: see
pyproject.toml
- Disk space: ~80 GB total
- Setup time: ~20 minutes
- Experiment time:
- Model accuracy experiments: ~12 hours (excluding additional appendix results)
- System-level performance experiments: ~24 hours
git clone --recurse-submodules https://github.com/EPIC-RPI/STARC
cd STARCTo better reproduce the results and avoid potential conflicts, we recommend using Python 3.10 and CUDA 12.8.
We provide scripts for the recommended environment setup. Please follow the instructions below to create the conda environment and install the STARC packages:
conda create -yn STARC python=3.10
conda activate STARC
pip install ninja==1.11.1.1 packaging
pip install -e .
pip install flash-attn==2.3.0 --no-build-isolation
conda install -c conda-forge cupy
conda install numpy scikit-learn
conda install cmakeThis artifact builds on the AttAcc simulator:
cd simulator_starc
git submodule update --init --recursivebash set_pim_ramulator.sh
cd ramulator2
mkdir build
cd build
cmake .. -DCMAKE_POLICY_VERSION_MINIMUM=3.5
make -j
cp ramulator2 ../ramulator2
cd ../../This section describes how to reproduce the key results reported in the paper.
To reproduce the LongBench accuracy results:
cd <Your Path>/STARC/scripts/
sh longbench.shIf you want to evaluate more models, the corresponding model paths are defined in:
STARC/evaluation/LongBench/config/model2path.json
By replacing the model name in longbench.sh, you can evaluate STARC under different models reported in the paper.
To reproduce the perplexity results on PG-19:
cd <Your Path>/STARC/scripts/
sh ppl_eval.shTo reproduce RULER results on LLaMA-3.1-8B-Instruct, the RULER testing data are already included in the STARC/ruler directory.
To reproduce the RULER results under a 32K context length:
cd <Your Path>/STARC/scripts/
sh RULER.shThe system-level simulation experiments are conducted using the AttAcc-based simulator.
To reproduce the results for full attention:
python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \
--lin 2048 --lout 32000 --batch 16 --pim bank \
--powerlimit --ffopt --pipeoptTo reproduce the results for configurations with sparse attention methods:
python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \
--lin 2048 --lout 32000 --batch 16 --pim bank \
--powerlimit --ffopt --pipeopt \
--sparsity --kv_budget_table kv_budget_Mistral_STARC.txtDifferent sparse attention methods and models use different .txt files specified by the --kv_budget_table option. These files are derived from the attention masks produced by each method at each decoding step in real inference tasks (e.g., LongBench), and map them to the row-level granularity of the PIM architecture, where each DRAM row activation fetches 16 key/value vectors in parallel. They define how many memory rows are activated at each decoding step and are used to guide the simulator accordingly.
Note 1: Please always start simulations from the maximum context length (e.g., 32K). For all methods except STARC, the simulator cache (
ramulator.out) generated for the longest context length can be reused for shorter context lengths. In this case, the simulator will directly read the cached results without rerunning the simulation.
Note 2: When switching the evaluated method (Full attention / STARC / SparQ / Quest), please delete the previously generated
ramulator.out; otherwise, cached results from the last run may be reused.
Note 3: When reproducing methods other than STARC, comment out the following two lines in
STARC/simulator_starc/src/ramulator_wrapper.pyto avoid introducing clustering overhead:if l == l_target - 1: trace_args += " --add_cluster"
Note 4: When reproducing STARC results, due to time constraints, we are currently unable to directly present the clustering overhead separately. If you would like to isolate the clustering overhead, you may follow the steps below:
- Keep the code in Note 3 enabled:
and run the corresponding command to obtain the simulation result.if l == l_target - 1: trace_args += " --add_cluster"- Delete the generated
ramulator.out.- Comment out the above code and rerun the same command.
- The difference between the two results corresponds to the clustering overhead.
We apologize for this inconvenience. We will optimize the code in future updates to directly report the clustering overhead.
-
Model accuracy experiments:
For LongBench, the evaluation generates a corresponding.jsonlfile for each model and each task. These files contain the ground-truth answers and model predictions. The final results are summarized inresult.json.
For RULER, evaluation results are printed directly to the terminal. -
PG-19 perplexity:
Alog_PG19.txtfile is generated to record the evolution of perplexity during evaluation. -
Simulation experiments:
The simulator produces anoutput.csvfile that records the breakdown of end-to-end latency and energy consumption. An example format is shown below.
| model | dtype | xpu | cap | bw | sys_opb | hw | cores | pipe_level | is_parallel | power_constraint | gqa_size | Lin | Lout | bs | required_cap | s_flops | g_flops | s_time | s_matmul | s_fc | s_comm | s_softmax | s_act | s_lnorm | g_time (ms) | g_matmul | g_fc | g_comm | g_etc | g_qkv_time | g_prj_time | g_ff_time | g2g_comm | c2g_comm | g_softmax | g_act | g_lnorm | g_energy (nJ) | g_dram_energy | g_l2_energy | g_l1_energy | g_reg_energy | g_alu_energy | g_fc_mem_energy | g_fc_comp_energy | g_attn_mem_energy | g_attn_comp_energy | g_etc_mem_energy | g_etc_comp_energy | g_comm_energy |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mistral-7B | W16A16 | GPU | 1280 | 9 | 295.1670644 | BA | 8 | TRUE | TRUE | TRUE | 0 | 2048 | 16000 | 16 | 1.63247E+11 | 26440397457 | 2.74317E+11 | 137.0683392 | 27.98165465 | 62.93091672 | 11.61364915 | 24.87258191 | 2.506916933 | 7.162619808 | 0.916930348 | 0.075737878 | 0.404029795 | 0.346612651 | 0.090550025 | 0.137451554 | 0.030404956 | 0.236173285 | 0.344865024 | 0.001747627 | 0 | 0.033255129 | 0.057294896 | 551880956.3 | 412383134.3 | 51675503 | 30652644.72 | 19149650.66 | 37147608.35 | 344396314.7 | 130849869.7 | 61361497.02 | 6874799.8 | 6625322.598 | 900737.2698 | 872415.232 |
Since we focus only on the decoding stage, the following columns are used for analysis:
-
Latency breakdown:
g_matmul,g_fc,g_comm,g_etc,g_time (ms) -
Energy breakdown:
g_fc_mem_energy,g_fc_comp_energy,
g_attn_mem_energy,g_attn_comp_energy,
g_etc_mem_energy,g_etc_comp_energy,
g_comm_energy,g_energy (nJ)
Energy consumption is further decomposed into compute (comp) and memory (mem) components.
This project is released under the MIT License.
This repository incorporates and builds upon open-source implementations from the following projects. We thank the authors for making their code publicly available.
-
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Codebase: https://github.com/mit-han-lab/quest -
AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference
Simulator: https://github.com/scale-snu/attacc_simulator