You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you read the previous two sections you realize that this approach can get complicated. The complication stems from the fact that in the naive way we have to process every row separately. But what if we grouped the inputs into chunks that are easier to work with? At first, this may appear as a pessimisation, because such an approach would have to consider an input to be non-zero if any of the values in a group is non-zero.
1641
+
1642
+
Anyway, let's consider a group of size 4. With basic probability math we can calculate that if we group by 4, and the chance of a single input to be zero is `x`, then the chance of all 4 inputs to be zero is `x^4`. For example, if `x = 0.9` then `x^4 ~= 0.65` - that's almost 4 times as many indices to process... BUT. There's also 4 times fewer indices now, because they are grouped! What about the amount of work required for each non-zero input? Normally it would be 4 times as much, because inputs are grouped by 4. However, two things align to help our cause. First, as we saw, the implementation is clunky for group size of 1, and requires either more memory for the weights or additional work to unpack them. Second, It prevents us from efficiently handling a small amount of outputs, because we're limited by SIMD register width. Stockfish, for example, uses only 16 outputs after the large layer, so processing multiple (4) inputs at a time is a natural optimization.
1643
+
1644
+
So, overall, we have the following tradeoffs:
1645
+
1646
+
1. 4 times fewer indices to calculate
1647
+
2. 4 times more inputs to process per index
1648
+
3. but cheaper, we can do simple and fast processing for each input chunk again (no weight unpacking, no int16 weights), especially with less outputs
1649
+
1650
+
Combined, it gives a sizable speedup for larger networks.
1651
+
1652
+
Let's see the rough code.
1653
+
1654
+
```cpp
1655
+
intlsb(std::uint32_t v) {
1656
+
// returns the least significant set bit in v
1657
+
// implementation detail
1658
+
// can be implemented for example using compiler intrinsics
The math of `x^4` assumes uniform distribution of non-zero inputs. We, however, help it a little bit by reordering the weights such that values that are more likely to be non-zero are grouped together (say, at the beginning). This can be performed empirically. This is a minor (~2%), but essentially free, speedup!
1758
+
1636
1759
#### Linear layer with sparse input and blocked sparse output
1637
1760
1638
1761
Let's go one step further. For now all linear layers had dense outputs, but we can consider a layer where each input is connected only to a subset of outputs. We can consider the weights to be 0 where no connection is present. To make it possible to implement efficiently with vectorization in mind we have to zero out whole blocks of weights. A 16x128 Weight matrix with 2 non-zero 1x16 blocks per input may look like this for example:
0 commit comments