BitDelta/docs/index.html at main · FasterDecoding/BitDelta

History

548 lines (477 loc) · 25.6 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

<!DOCTYPE html>

<html>

<head>

content="BitDelta">

<title>BitDelta</title>

window.dataLayer = window.dataLayer || [];

function gtag() {

dataLayer.push(arguments);

}

gtag('js', new Date());

gtag('config', 'G-PYVRSFMDRL');

</script>

rel="stylesheet">

href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">

window.MathJax = {

tex: {

packages: {'[+]': ['ams', 'color']}

loader: {

load: ['[tex]/ams']

}

};

</script>

</head>

<body>

<h1 class="title is-2 publication-title">BitDelta: Your Fine-Tune May Only Be Worth One Bit</h1>

James Liu<sup>1*</sup>, </span>

Guangxuan Xiao<sup>1</sup>, </span>

</span>

Kai Li<sup>2</sup>, </span>

</span>

Jason D. Lee<sup>2</sup>, </span>

</span>

Song Han<sup>1,3</sup>, </span>

</span>

Tri Dao<sup>2,4</sup>, </span>

</span>

Tianle Cai<sup>2,4*</sup></span>

</span>

</div>

<span class="author-block">* indicates equal contribution</span>

</div>

<span class="author-block"><sup>2</sup>Princeton University</span>

<span class="author-block"><sup>3</sup>NVIDIA</span>

<span class="author-block"><sup>4</sup>Together AI</span>

</div>

<a href="https://arxiv.org/pdf/2402.10193.pdf"

class="external-link button is-normal is-rounded is-dark">

</span>

<span>Paper</span>

</a>

</span>

<a href="https://arxiv.org/abs/2402.10193"

class="external-link button is-normal is-rounded is-dark">

</span>

<span>arXiv</span>

</a>

</span>

<a href="https://github.com/FasterDecoding/BitDelta"

class="external-link button is-normal is-rounded is-dark">

</span>

</a>

</span>

</div>

</section>

</div>

</section>

<!-- <div class="columns is-centered has-text-centered">

<h2 class="title is-3">Abstract</h2>

<p>

Large Language Models (LLMs) are typically trained in two phases:

pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.

Given the higher computational demand of pre-training, it's intuitive to assume that

fine-tuning adds less new information to the model, and is thus more compressible.

We explore this assumption by decomposing the weights of fine-tuned models into their

pre-trained components and an additional delta. We introduce a

simple method, BitDelta, which successfully quantizes this delta down to 1 bit without

compromising performance.

</p>

<p>

This interesting finding not only highlights the potential redundancy of information added during

fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant

storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied

by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x,

which can also be translated to enhanced generation latency in multi-tenant settings. We validate

BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters,

showcasing minimal performance degradation over all tested settings.

</p>

<p>This is an inline equation: $E = mc^2$</p>

<p>And this is a displayed equation: $$E = mc^2$$</p>

</div>

</div> -->

<!-- <h2 class="title is-3">Header</h2>

<h3 class="title is-4">Subheader</h3> -->

The <i>pretrain-finetune</i> paradigm has revolutionized machine learning; Through fine-tuning, LLMs

are adeptly equipped to align with distinct user preferences or specialized

task requirements, showcasing an unprecedented level of adaptability.

Thus, the prospect of serving <i>millions of uniquely fine-tuned models</i>,

each tailored to individual tasks and user needs, presents a promising vision

for the future of machine learning.

</p>

This application is known as <b>multi-tenant serving</b>, an architectural practice where

a single instance of software is used to serve multiple customers. In particular,

it would be ideal for multiple customers to be able to efficiently use their own

fine-tuned models, hosted on one centralized service.

</p>

However, multi-tenant serving is challenging due to two key reasons:

1) <b>Expensive Storage.</b> Each new fine-tuned model is large, even if we have

relatively few base models, making them expensive to store and challenging to manage on disk.

2) <b>Expensive Serving.</b> Distinct fine-tuned models each demand significant GPU memory,

making it difficult and expensive to concurrently serve such models without noticeable downtime.

</p>

<h2 class="title is-4">Insight: Information Disparity in Pre-training vs. Fine-tuning </h2>

Given the higher computational demand of pre-training, it makes sense to assume that

fine-tuning adds less new information to the model. This implies that fine-tuned models

that are derived from the same base model may share a significant amount of redundant

information. Can we exploit this to address the above storage and serving challenges?

</p>

Quantization results for <code>Vicuna-7B v1.5</code> with base model

<code>Llama 2-7B</code>. Adjusted average is over ARC, BBH, HellaSwag, Winogrande.

We highlight TruthfulQA, GSM8K, MT-Bench as the base model struggles on these tasks,

showing that BitDelta effectively retains fine-tune information.

</p>

\begin{array}{lccccc}

\hline

\textbf{Model/Method} & \textbf{Train Loss} & \textbf{TruthfulQA} & \textbf{GSM8K} & \textbf{MT-Bench} & \textbf{Adjusted Average} \uparrow \\

\hline

\textit{Llama 2-7B} & -- & 38.96 & 13.57 & -- & 60.53 \\

\textit{Vicuna-7B v1.5} & -- & 50.36 & 19.03 & 6.04 & 60.51 \\

\hline

\text{BitDelta-Initial} & 0.41 & 47.63 & 19.56 & 5.67 & 60.99 \\

\text{BitDelta} & 0.052 & 49.97 & 20.17 & 5.99 & 60.68 \\

\hline

\end{array}

</p>

<b>It turns out that we can!</b> We introduce BitDelta, which decomposes the weights of fine-tuned

models into their pre-trained

components and an additional delta: $W_\text{fine} = W_\text{base} + \Delta $. Drawing from

this insight, we find that we can quantize this delta, which encodes the fine-tuning

information, down to <b>1 bit</b> without compromising performance. We conduct experiments

over 17 popular fine-tuned models across the Llama-2 and Mistral families, and show that BitDelta

is quite general. BitDelta is fast (compression takes minutes), works for models across a

wide range of sizes (we test models between 7B and 70B parameters), and can retain all sorts of

fine-tuning information (we test SFT, RLHF, DPO, and RoPE based context extension). Check out

our paper for more details!

</p>

By representing multiple

fine-tuned models as a single high-precision base model accompanied by multiple 1-bit deltas,

we can drastically reduce GPU memory requirements. This addresses the <b>storage challenge</b>.

Since LLM inference is memory bound,

we can also translate this memory reduction into <b>faster inference</b> (2x for now)

in multi-tenant settings, using an efficient 1-bit matrix multiplication kernel!

This addresses the <b>serving challenge</b>.

</p>

Past work (GPT-Zip, DeltaZip) has also explored quantization of the weight delta, achieving

quantization levels as low as 2-bits by applying methods introduced by GPTQ. We find that

the weight delta is extremely compressible, and are able to achieve <b>1-bit quantization</b>

with minimal performance degradation using a simpler methodology.

</p>

<h2 class="title is-4">BitDelta Overview</h2>

<h2 class="title is-5">1-bit quantization</h2>

Let $W_\text{base}, W_\text{fine} \in \mathbb{R}^{n \times m}$ be weight matrices

from the base model and fine-tuned model, respectively. We define the weight delta

as $\Delta = W_\text{fine} - W_\text{base}$, representing the modification in

weights post-fine-tuning. For efficient representation, we aim to obtain a binarized

estimator of this weight delta, denoted as $\hat{\Delta}$, by encoding its sign bits:

\hat{\Delta} = \alpha \odot \text{Sign}(\Delta),

where

\text{Sign}(W_{ij}) =

\begin{cases}

+1, & \text{if } W_{ij} > 0, \\

-1, & \text{if } W_{ij} \leq 0,

\end{cases}

and $\alpha$ is a high-precision scaling factor for the entire matrix. To minimize the approximation error

in $L_2$ norm:

||\Delta - \hat{\Delta}||_2^2 = \sum_{ij}(|W_{ij}|-\alpha)^2,

we take

\alpha = \frac{1}{nm} \sum_{ij} |W_{ij}|.

Surprisingly, we find that the above quantization approach already does quite well

and retains most of the fine-tuned models' performance.

</p>

<h2 class="title is-5">Scale distillation</h2>

Intuitively, the scaling factor $\alpha$ plays a more significant role

in the low-bit regime, so we further optimize these scales by performing

model distillation to align the output logits of the quantized model to that

of the original fine-tuned model. More concretely, we freeze the model

weights and optimize for the following objective:

\boldsymbol{\alpha}^* = \arg\min_{\boldsymbol{\alpha}} \mathbb{E}_{x \sim \mathbf{X}}\left[ \left\| \mathbf{Z}_{\text{fine}}(x) - \mathbf{Z}_{\text{bin}}(x; \boldsymbol{\alpha}) \right\|^2 \right]

where $\mathbf{X}$ is a calibration dataset, and $\mathbf{Z}(\cdot)$ are the logits of the

respective models. We find that scale distillation is fairly insensitive to choice $\mathbf{X}$,

as 1) the process is extremely parameter efficient, and 2) the crucial aspect of the process is

to logit match with the fine-tuned model, regardless of the actual text content. We denote the method

without scale distillation as BitDelta-Initial, and the method with scale distillation as BitDelta.

As seen in the table above, scale distillation is effective in further recovering fine-tune

performance.

</p>

<h2 class="title is-5">Inference speedup</h2>

BitDelta achieves over 10$\times$ compression. We can further compress the embedding and LM head layers,

but leave this to future work due to inconsistencies in tokenizer vocabularies.

</p>

\begin{array}{lccc}

\hline

\textbf{Base Model} & \textbf{Size} & \Delta \textbf{Size} & \textbf{Comp. Factor} \\

\hline

\textit{Llama 2-7B} & 13.48 \text{ GB} & 1.24 \text{ GB} & 10.87 \\

\textit{Llama 2-13B} & 26.03 \text{ GB} & 2.09 \text{ GB} & 12.45 \\

\textit{Llama 2-70B} & 137.95 \text{ GB} & 8.95 \text{ GB} & 15.41 \\

\textit{Mistral-7B v0.1} & 14.48 \text{ GB} & 1.30 \text{ GB} & 11.14 \\

\hline

\end{array}

</p>

Since LLM inference follows the memory-bound computation pattern where generation latency

is proportional to the GPU memory used by the model weights, this reduced memory consumption also

suggests the opportunity to improve the serving latency. For example, Punica and S-LoRA exploit

LoRA's structure and memory saving by developing a CUDA kernel that can efficiently calculate

the batched delta-activation product for low rank deltas. Similarly, we decompose the forward pass

of each linear layer as follows:

X'_i = W_{\text{fine}, i}X_i \approx W_{\text{base}}X_i +

\underbrace{ \hat{\Delta}_iX_i}_\textbf{Kernel}

\label{eqn:kernel_decomp}

where $X_i$ and $X_i'$ represent input and output features to the $i$-th fine-tuned model,

and the base model weight and the delta are computed separately. For a batch of requests,

$W_{\text{base}}X_i$ can be computed with the classic batched GEMM kernel.

We implement a fused binary GEMM kernel in Triton that allows us to calculate

$\hat{\Delta}_iX$ in a batched setting while keeping the 1-bit deltas quantized until

they are transferred to the GPU cache. This kernel fuses the dequantization operation

with the GEMM calculation, reducing the data moving overhead by a large factor!

</p>

To illustrate the speedup, we benchmark the decoding latency of our kernel,

a batched linear operation over multiple deltas with a single base model, as in

the decomposed forward pass, and compare against naively computing the forward pass

separately for each model. We ablate across the batch size and hidden size dimensions

and find that our kernel consistently achieves a ~2$\times$ speedup.

</p>

<p>Decoding latency vs. hidden size, assuming $N=M$. Batch size of 8.</p>

</div>

<p>Decoding latency vs. batch size $B$, assuming $N=M=8192$.</p>

</div>

<p>Decoding latency of a linear layer with and without BitDelta. Blue: Naive forward pass with

$B$ distinct fine-tuned models. Yellow: Batched forward pass with BitDelta,

corresponding to one base model and $B$ 1-bit deltas, utilizing a Triton kernel.</p>

</div>

<h2 class="title is-4">Ablation Studies</h2>

<h2 class="title is-5">Quantized base models</h2>

We apply BitDelta to <code>Llama 2-7B Chat</code>, and find it holds up when the

underlying base model is quantized at various levels. Because 8-bit RTN and GPTQ

work with 16-bit activations, we can keep the fine-tune weights $W_\text{fine}$

and scaling factors $\alpha$ in high precision, only quantizing the base weights

$W_\text{base}$.

</p>

FP16 + $\Delta$ outperforms GPTQ across the board.

In the performance engineering context of multi-tenancy serving,

we would rather store a single high precision

base model with many 1-bit deltas than store many quantized fine-tuned models.

This interesting result implies that the above also holds true in the

<i>model quality</i> context of multi-tenancy serving.

</p>

<p>

We try using <code>Llama 2-7B Chat</code> as both the base model and fine-tune model,

quantizing the base model using GPTQ, and find that we're able to outperform baseline

GPTQ on many evaluations. We hypothesize this is because we can diffuse 16-bit

information into the model through high precision scaling factors, at the cost

of including a 1-bit delta.

</p>

<div>

\begin{array}{llcccc}

\hline

\textbf{Base Model} & \textbf{Method} & \textbf{TruthfulQA} & \textbf{GSM8K} & \textbf{MT-Bench} & \textbf{Adjusted Average} \uparrow \\

\hline

& \text{FP16} & 45.32 & 22.74 & 6.56 & 59.81 \\

\text{Baseline} & \text{INT8 RTN} & 45.02 & 22.29 & 6.28 & 59.63 \\

& \text{GPTQ} & 44.92 & 19.48 & 5.90 & 58.67 \\

\hline

& \text{FP16 +} \Delta & 44.95 & 20.24 & 6.47 & 59.88 \\

\textit{Llama 2-7B} & \text{INT8 RTN +} \Delta & 44.71 & 19.86 & 6.16 & 59.85 \\

& \text{GPTQ +} \Delta & 42.52 & 19.94 & 6.02 & 59.22 \\

\hline

\textit{Llama 2-7B Chat} & \text{GPTQ +} \Delta & 44.63 & 22.14 & 6.11 & 59.17 \\

\hline

\end{array}

</p>

</div>

<h2 class="title is-5">Varying fidelity of $\Delta$</h2>

<p>

By successively applying BitDelta, treating the compressed model from the

previous iteration as our base model, we can vary the granularity over the delta,

associating it with multiple 1-bit masks. One advantage of doing this is the

ability to assign arbitrary scale factors to each 1-bit mask. In contrast,

when just increasing the bit size, scale factors are implicitly fixed with respect

to each other. The figure shows how the TruthfulQA of <code>Llama 2-7B</code>

plus an increasingly granular delta approaches that of <code>Vicuna-7B v1.5</code>.

</p>

</div>

<h2 class="title is-4">Future Work</h2>

There are many exciting directions for future work. On the model quality side, we can

incorporate saliency aware quantization in the weight deltas, similar to <a href="https://arxiv.org/pdf/2306.00978.pdf">

AWQ (Ji et. al.)</a>. On the compression side, we can investigate sub 1-bit quantization methods

that maintain hardware-friendliness. On the serving side, we can further optimize the Triton kernel;

it is actually fairly slow compared to the theoretical upper bound, considering

small memory footprint of weight deltas. With further optimization, it should be possible to

achieve a ~4-8$\times$ speedup. Finally, the idea of calibrating certain scale factors

through distillation may be applied more generally to PTQ methods, which we hope will

make low-bit quantized LLMs more robust.

</p>

</div>

</section>

<h2 class="title">BibTeX</h2>

<pre><code>@misc{liu2024bitdelta,

title={BitDelta: Your Fine-Tune May Only Be Worth One Bit},

author={James Liu and Guangxuan Xiao and Kai Li and Jason D. Lee and Song Han and Tri Dao and Tianle Cai},

year={2024},

eprint={2402.10193},

archivePrefix={arXiv},

primaryClass={cs.LG}

}

</code></pre>

</div>

</section>

This website is adapted from the <a href="https://github.com/nerfies/nerfies.github.io">Nerfies</a> template.

</p>

</div>

</footer>

</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

index.html

Latest commit

History

index.html

File metadata and controls