STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems

This repository provides a complete workflow to reproduce the key results of STARC, including:

The implementation of STARC’s selective token access with KV remapping and online clustering.
Evaluation scripts to reproduce:
- accuracy results on LongBench and RULER, and
- perplexity results on PG-19.
The simulator setup to reproduce system-level performance/energy results on GPU–PIM platforms based on the AttAcc simulator (Ramulator-based).

What’s in this artifact

Algorithm: The STARC algorithm, which enables efficient long-context LLM inference by selectively accessing and remapping KV cache entries via online clustering under a fixed KV-cache budget.
Program: The STARC artifact running public long-context benchmarks: LongBench (16 datasets) and RULER (13 datasets).
Models: LongChat-7B v1.5-32K; LLaMA-3.1-8B-Instruct; Mistral-7B-Instruct-v0.3 (publicly available via Hugging Face).
Datasets: LongBench (16 datasets; e.g., HotpotQA, QASPER, GovReport, etc.); PG-19; RULER (13 datasets; e.g., NIAH Single, Multi-key NIAH, Multi-value NIAH, etc.), all publicly available (Hugging Face).
Metrics: LongBench task scores; PG-19 perplexity; RULER task scores; and system metrics such as latency and energy.
Outputs: LongBench/RULER scores, PG-19 perplexity traces, and system-level performance/energy metrics with breakdowns.
Availability: Publicly available.
License: MIT license.

Requirements

Hardware

LLM accuracy evaluation (LongBench / PG-19 / RULER): Compatible with commonly used NVIDIA GPUs. We recommend NVIDIA H100 or L40 with sufficient GPU memory (e.g., at least 48 GB per GPU).
System-level simulation: CPU-only execution is sufficient. Experiments in the paper were conducted on a dual-socket AMD EPYC 9334 system with 64 CPU cores in total (2×32 cores).

Software

Python: 3.10
CUDA: 12.8
Python dependencies: see pyproject.toml

Resource estimate

Disk space: ~80 GB total
Setup time: ~20 minutes
Experiment time:
- Model accuracy experiments: ~12 hours (excluding additional appendix results)
- System-level performance experiments: ~24 hours

Getting started

1) Get the code

git clone --recurse-submodules https://github.com/EPIC-RPI/STARC
cd STARC

2) Create the environment

To better reproduce the results and avoid potential conflicts, we recommend using Python 3.10 and CUDA 12.8.

We provide scripts for the recommended environment setup. Please follow the instructions below to create the conda environment and install the STARC packages:

conda create -yn STARC python=3.10
conda activate STARC
pip install ninja==1.11.1.1 packaging
pip install -e .
pip install flash-attn==2.3.0 --no-build-isolation
conda install -c conda-forge cupy
conda install numpy scikit-learn
conda install cmake

3) Set up the PIM system simulator

This artifact builds on the AttAcc simulator:

cd simulator_starc
git submodule update --init --recursive

4) Build Ramulator2

bash set_pim_ramulator.sh
cd ramulator2
mkdir build
cd build
cmake .. -DCMAKE_POLICY_VERSION_MINIMUM=3.5
make -j
cp ramulator2 ../ramulator2
cd ../../

Reproducing paper results

This section describes how to reproduce the key results reported in the paper.

E1: LongBench accuracy

To reproduce the LongBench accuracy results:

cd <Your Path>/STARC/scripts/
sh longbench.sh

If you want to evaluate more models, the corresponding model paths are defined in:

STARC/evaluation/LongBench/config/model2path.json

By replacing the model name in longbench.sh, you can evaluate STARC under different models reported in the paper.

E2: PG-19 perplexity

To reproduce the perplexity results on PG-19:

cd <Your Path>/STARC/scripts/
sh ppl_eval.sh

E3: RULER (32K context)

To reproduce RULER results on LLaMA-3.1-8B-Instruct, the RULER testing data are already included in the STARC/ruler directory.

To reproduce the RULER results under a 32K context length:

cd <Your Path>/STARC/scripts/
sh RULER.sh

E4: GPU–PIM system simulation

The system-level simulation experiments are conducted using the AttAcc-based simulator.

Full attention

To reproduce the results for full attention:

python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \
  --lin 2048 --lout 32000 --batch 16 --pim bank \
  --powerlimit --ffopt --pipeopt

Sparse attention configurations

To reproduce the results for configurations with sparse attention methods:

python main.py --system dgx-attacc --gpu H100 --ngpu 8 --model Mistral-7B \
  --lin 2048 --lout 32000 --batch 16 --pim bank \
  --powerlimit --ffopt --pipeopt \
  --sparsity --kv_budget_table kv_budget_Mistral_STARC.txt

Different sparse attention methods and models use different .txt files specified by the --kv_budget_table option. These files are derived from the attention masks produced by each method at each decoding step in real inference tasks (e.g., LongBench), and map them to the row-level granularity of the PIM architecture, where each DRAM row activation fetches 16 key/value vectors in parallel. They define how many memory rows are activated at each decoding step and are used to guide the simulator accordingly.

Note 1: Please always start simulations from the maximum context length (e.g., 32K). For all methods except STARC, the simulator cache (ramulator.out) generated for the longest context length can be reused for shorter context lengths. In this case, the simulator will directly read the cached results without rerunning the simulation.

Note 2: When switching the evaluated method (Full attention / STARC / SparQ / Quest), please delete the previously generated ramulator.out; otherwise, cached results from the last run may be reused.

Note 3: When reproducing methods other than STARC, comment out the following two lines in STARC/simulator_starc/src/ramulator_wrapper.py to avoid introducing clustering overhead:
if l == l_target - 1:
    trace_args += " --add_cluster"

Note 4: When reproducing STARC results, due to time constraints, we are currently unable to directly present the clustering overhead separately. If you would like to isolate the clustering overhead, you may follow the steps below:
Keep the code in Note 3 enabled:
if l == l_target - 1:
    trace_args += " --add_cluster"
and run the corresponding command to obtain the simulation result.
Delete the generated ramulator.out.

Comment out the above code and rerun the same command.

The difference between the two results corresponds to the clustering overhead.
We apologize for this inconvenience. We will optimize the code in future updates to directly report the clustering overhead.

Outputs

Model accuracy experiments:
For LongBench, the evaluation generates a corresponding .jsonl file for each model and each task. These files contain the ground-truth answers and model predictions. The final results are summarized in result.json.
For RULER, evaluation results are printed directly to the terminal.
PG-19 perplexity:
A log_PG19.txt file is generated to record the evolution of perplexity during evaluation.
Simulation experiments:
The simulator produces an output.csv file that records the breakdown of end-to-end latency and energy consumption. An example format is shown below.

model	dtype	xpu	cap	bw	sys_opb	hw	cores	pipe_level	is_parallel	power_constraint	gqa_size	Lin	Lout	bs	required_cap	s_flops	g_flops	s_time	s_matmul	s_fc	s_comm	s_softmax	s_act	s_lnorm	g_time (ms)	g_matmul	g_fc	g_comm	g_etc	g_qkv_time	g_prj_time	g_ff_time	g2g_comm	c2g_comm	g_softmax	g_act	g_lnorm	g_energy (nJ)	g_dram_energy	g_l2_energy	g_l1_energy	g_reg_energy	g_alu_energy	g_fc_mem_energy	g_fc_comp_energy	g_attn_mem_energy	g_attn_comp_energy	g_etc_mem_energy	g_etc_comp_energy	g_comm_energy
Mistral-7B	W16A16	GPU	1280	9	295.1670644	BA	8	TRUE	TRUE	TRUE	0	2048	16000	16	1.63247E+11	26440397457	2.74317E+11	137.0683392	27.98165465	62.93091672	11.61364915	24.87258191	2.506916933	7.162619808	0.916930348	0.075737878	0.404029795	0.346612651	0.090550025	0.137451554	0.030404956	0.236173285	0.344865024	0.001747627	0	0.033255129	0.057294896	551880956.3	412383134.3	51675503	30652644.72	19149650.66	37147608.35	344396314.7	130849869.7	61361497.02	6874799.8	6625322.598	900737.2698	872415.232

Since we focus only on the decoding stage, the following columns are used for analysis:

Latency breakdown:
g_matmul, g_fc, g_comm, g_etc, g_time (ms)
Energy breakdown:
g_fc_mem_energy, g_fc_comp_energy,
g_attn_mem_energy, g_attn_comp_energy,
g_etc_mem_energy, g_etc_comp_energy,
g_comm_energy, g_energy (nJ)

Energy consumption is further decomposed into compute (comp) and memory (mem) components.

License

This project is released under the MIT License.

Acknowledgements

This repository incorporates and builds upon open-source implementations from the following projects. We thank the authors for making their code publicly available.

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Codebase: https://github.com/mit-han-lab/quest
AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference
Simulator: https://github.com/scale-snu/attacc_simulator

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
evaluation		evaluation
ruler/data/llama-3/32768		ruler/data/llama-3/32768
scripts		scripts
simulator_starc		simulator_starc
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems

Contents

What’s in this artifact

Requirements

Hardware

Software

Resource estimate

Getting started

1) Get the code

2) Create the environment

3) Set up the PIM system simulator

4) Build Ramulator2

Reproducing paper results

E1: LongBench accuracy

E2: PG-19 perplexity

E3: RULER (32K context)

E4: GPU–PIM system simulation

Full attention

Sparse attention configurations

Outputs

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems

Contents

What’s in this artifact

Requirements

Hardware

Software

Resource estimate

Getting started

1) Get the code

2) Create the environment

3) Set up the PIM system simulator

4) Build Ramulator2

Reproducing paper results

E1: LongBench accuracy

E2: PG-19 perplexity

E3: RULER (32K context)

E4: GPU–PIM system simulation

Full attention

Sparse attention configurations

Outputs

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages