Use more parallelism in attention block in prefill mode. by szabadka · Pull Request #177 · google/gemma.cpp

szabadka · 2024-05-03T13:14:21Z

Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads.

This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later.

Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation):

                   Prefill speed
Num threads      BEFORE       AFTER
32               61.76 t/s    65.08 t/s
64               89.46 t/s    98.62 t/s

Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads. This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Num threads BEFORE AFTER 32 61.76 t/s 65.08 t/s 64 89.46 t/s 98.62 t/s ```

jan-wassenberg

Nice, loop + MatVec is starting to look a lot like a matmul!
One small fix and a question:

gemma/gemma.cc

szabadka force-pushed the gemma2 branch from c6a954b to f1fae09 Compare May 3, 2024 13:26

Remove unused vars.

429eb78

szabadka force-pushed the gemma2 branch from f1fae09 to 429eb78 Compare May 3, 2024 13:38

jan-wassenberg requested changes May 3, 2024

View reviewed changes

gemma/gemma.cc Outdated Show resolved Hide resolved

gemma/gemma.cc Show resolved Hide resolved

Fix expression in DASSERT()

19017fd

jan-wassenberg approved these changes May 3, 2024

View reviewed changes

jan-wassenberg added the copybara-import Trigger Copybara for merging pull requests label May 3, 2024

copybara-service bot merged commit 8ed22e5 into google:dev May 3, 2024

szabadka deleted the gemma2 branch May 3, 2024 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use more parallelism in attention block in prefill mode.#177

Use more parallelism in attention block in prefill mode.#177
copybara-service[bot] merged 3 commits intogoogle:devfrom
szabadka:gemma2

szabadka commented May 3, 2024

Uh oh!

jan-wassenberg left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

szabadka commented May 3, 2024

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants