Skip to content

Conversation

@umar456
Copy link
Member

@umar456 umar456 commented Mar 20, 2021

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.

Description

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32,8 to better accommodate smaller dimensions.

Changes to Users

None

Checklist

  • Rebased on latest master
  • Code compiles
  • Tests pass
  • [ ] Functions added to unified API
  • [ ] Functions documented

@umar456 umar456 added the perf label Mar 20, 2021
@umar456 umar456 requested a review from 9prady9 March 20, 2021 04:09
@9prady9
Copy link
Member

9prady9 commented Mar 20, 2021

What is the speedup range ?

@umar456
Copy link
Member Author

umar456 commented Mar 20, 2021

TEST(Index, blah) {
    array a   = randu(20000000, 10);
    array idx = seq(20000000);

    array b = a(idx);
    b.eval();
    af::sync();
}

Master:

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                                                    Name                                                
 -------  ---------------  ---------  ------------  ----------  ----------  ----------------------------------------------------------------------------------------------------
    49.7       10,107,916          1  10,107,916.0  10,107,916  10,107,916  void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int) 
    38.4        7,811,232          1   7,811,232.0   7,811,232   7,811,232  void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
     8.1        1,645,554          1   1,645,554.0   1,645,554   1,645,554  KER9745534647381087054                                                                              
     3.9          791,546          1     791,546.0     791,546     791,546  void cuda::range<float>(cuda::Param<float>, int, int, int) 

This PR:

 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                   Name                                                
 -------  ---------------  ---------  -----------  ---------  ---------  ----------------------------------------------------------------------------------------------------
    60.6        7,809,308          1  7,809,308.0  7,809,308  7,809,308  void cuda::kernel::uniformPhilox<float>(float*, unsigned int, unsigned int, unsigned int, unsigned …
    20.3        2,615,049          1  2,615,049.0  2,615,049  2,615,049  void cuda::index<float>(cuda::Param<float>, cuda::CParam<float>, cuda::AssignKernelParam, int, int) 
    12.8        1,646,066          1  1,646,066.0  1,646,066  1,646,066  KER9745534647381087054                                                                              
     6.3          812,761          1    812,761.0    812,761    812,761  void cuda::range<float>(cuda::Param<float>, int, int, int) 

3.8x faster

This optimization dynamically sets the block size based on the output array
dimension. Originally we had a block size of 32x8 threads per block. This
configuration was not ideal when indexing into a long array where you
had few columns and many rows. The current approach creates blocks of
256x1, 128x2, 64x4 and 32x8 to better accommodate smaller dimensions.
@9prady9 9prady9 merged commit d56c3bc into arrayfire:master Mar 23, 2021
@9prady9 9prady9 deleted the index_opt branch March 23, 2021 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants