OPT: Optimize indexing using dynamic thread block sizes. #3111

umar456 · 2021-03-20T04:09:01Z

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.

Description

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.

Changes to Users

None

Checklist

Rebased on latest master
Code compiles
Tests pass
~~[ ] Functions added to unified API~~
~~[ ] Functions documented~~

9prady9 · 2021-03-20T04:34:56Z

What is the speedup range ?

umar456 · 2021-03-20T05:45:36Z

TEST(Index, blah) {
    array a   = randu(20000000, 10);
    array idx = seq(20000000);

    array b = a(idx);
    b.eval();
    af::sync();
}

Master:

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
    49.7       10,107,916          1  10,107,916.0  10,107,916  10,107,916  void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int) 
    38.4        7,811,232          1   7,811,232.0   7,811,232   7,811,232  void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
     8.1        1,645,554          1   1,645,554.0   1,645,554   1,645,554  KER9745534647381087054                                                                              
     3.9          791,546          1     791,546.0     791,546     791,546  void cuda::range<float>(cuda::Param<float>, int, int, int)

This PR:

 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                   Name                                                
 -------  ---------------  ---------  -----------  ---------  ---------  ----------------------------------------------------------------------------------------------------
    60.6        7,809,308          1  7,809,308.0  7,809,308  7,809,308  void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
    20.3        2,615,049          1  2,615,049.0  2,615,049  2,615,049  void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int) 
    12.8        1,646,066          1  1,646,066.0  1,646,066  1,646,066  KER9745534647381087054                                                                              
     6.3          812,761          1    812,761.0    812,761    812,761  void cuda::range<float>(cuda::Param<float>, int, int, int)

3.8x faster

This optimization dynamically sets the block size based on the output array dimension. Originally we had a block size of 32x8 threads per block. This configuration was not ideal when indexing into a long array where you had few columns and many rows. The current approach creates blocks of 256x1, 128x2, 64x4 and 32x8 to better accommodate smaller dimensions.

umar456 added the perf label Mar 20, 2021

umar456 requested a review from 9prady9 March 20, 2021 04:09

umar456 force-pushed the index_opt branch from c6a49ca to dbe133c Compare March 20, 2021 05:53

9prady9 approved these changes Mar 23, 2021

View reviewed changes

9prady9 merged commit d56c3bc into arrayfire:master Mar 23, 2021

9prady9 deleted the index_opt branch March 23, 2021 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPT: Optimize indexing using dynamic thread block sizes. #3111

OPT: Optimize indexing using dynamic thread block sizes. #3111

Uh oh!

umar456 commented Mar 20, 2021

Uh oh!

9prady9 commented Mar 20, 2021

Uh oh!

umar456 commented Mar 20, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OPT: Optimize indexing using dynamic thread block sizes. #3111

OPT: Optimize indexing using dynamic thread block sizes. #3111

Uh oh!

Conversation

umar456 commented Mar 20, 2021

Description

Changes to Users

Checklist

Uh oh!

9prady9 commented Mar 20, 2021

Uh oh!

umar456 commented Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umar456 commented Mar 20, 2021 •

edited

Loading