`bincount` feature implementation #6688

chintak · 2018-04-18T00:55:43Z

Add support for torch.bincount and t.bincount(). Solves #4570.

TODO:

CPU support
GPU support

torch/functional.py

+    Example::
+
+        >>> input = torch.randint(0, 8, (5,), dtype=torch.int64)
+        >>> weights = torch.randint(10, (5,), dtype=torch.float32) / 10


torch/functional.py

+         0.4000
+        [torch.DoubleTensor of size (8,)]
+    """
+    return torch._bincount(input, weights, minlength)


test/test_torch.py


+    def test_bincount_cpu(self):
+        # test weights
+        byte_counts = torch.bincount(torch.ByteTensor([0, 1, 1, 1, 4]),


aten/src/ATen/native/Bincount.cpp

+
+namespace at { namespace native {
+
+#define AT_DISPATCH_BINCOUNT_INT_CASE(scalar, type)                \


test/test_torch.py

+        long_counts = torch.LongTensor([0, 3, 2, 1, 3]).bincount()
+        self.assertEqual(torch.LongTensor([1, 1, 1, 2]), long_counts)
+        # test minlength functionality
+        int_counts = torch.bincount(torch.IntTensor([1, 1, 1, 1]), minlength=5)


torch/functional.py

+         0.4000
+        [torch.DoubleTensor of size (8,)]
+    """
+    return torch._bincount(input, weights, minlength)


aten/src/ATen/native/Bincount.cpp

+
+#include <tuple>
+
+namespace at { namespace native {


aten/src/ATen/native/Bincount.cpp

+template <typename integral_t>
+Tensor _bincount_cpu_template(
+    const Tensor& self,
+    const Tensor& weights = {},


aten/src/ATen/native/Bincount.cpp

+
+  integral_t* self_p = self.contiguous().data<integral_t>();
+  if (isWeights) {
+    output = zeros(CPU(kDouble), {nbins});


aten/src/ATen/native/Bincount.cpp

+        "bincount only supports 1-d non-negative integral inputs.");
+  }
+
+  bool isWeights = weights.defined();


aten/src/ATen/native/Bincount.cpp

+    const Tensor& weights = {},
+    int64_t minlength = 0) {
+  if (minlength < 0) {
+    throw std::domain_error("minlength should be >= 0");


aten/src/ATen/native/SummaryOps.cpp

+Tensor
+_bincount_cpu(const Tensor& self, const Tensor& weights, int64_t minlength) {
+  return AT_DISPATCH_INTEGRAL_TYPES(self.type(), "bincount", [&] {
+    if (weights.type().scalarType() == ScalarType::Float)


aten/src/ATen/native/SummaryOps.cpp

+  if (minlength < 0) {
+    throw std::domain_error("minlength should be >= 0");
+  }
+  if (self.dim() != 1 || self.numel() == 0 ||


chintak · 2018-04-21T00:42:29Z

I'm trying to implement the CUDA version. I see at::cuda::CUDA_tensor_apply2, which requires input and output size to be equal since indexing is 1:1 a[i] <-> b[i](see CUDAApplyUtils.cuh#L152).

For implementing bincount, however, I need more flexibility in terms of indexing. Specifically, I need to be able to do atomicAdd(&a[b[i]], 1). Further, I want to create an intermediary 2D device tensor which can hold partial histogram per block, followed by a reduceDim with sum op to get the final histogram. (reference vanilla CUDA implementation)

Are there wrappers for doing similar kernel launch? Any pointers to how I can create such a wrapper?
Or any other approaches to calculating bincount?

torch/_torch_docs.py

+        Should be of same size as input tensor.
+    minlength (int): optional, min number of bins in the output tensor.
+        Should be non-negative.
+


torch/_torch_docs.py

+     6
+     3
+     4
+    [torch.LongTensor of size (5,)]


test/test_torch.py

+        self.assertEqual(torch.tensor([0, 4, 0, 0, 0], dtype=long_t), int_counts)
+
+        # negative input throws
+        with self.assertRaises(RuntimeError):


test/test_cuda.py

        TestTorch._test_random_neg_values(self, use_cuda=True)

+    def test_bincount_cuda(self):
+        cuda = torch.device('cuda')


test/test_cuda.py

+        # test large input size
+        big_exp = torch.zeros(2).cuda()
+        big_exp[1] = 1000000
+        big_out = torch.ones(1000000, dtype=torch.int8).cuda().bincount()


test/test_cuda.py

+        self.assertEqual(
+            torch.tensor([0.1, 0.9, 0, 0, 0.5], device=cuda), byte_counts)
+        # test large number of bins - global memory use
+        big_exp = torch.zeros(100000000).cuda()


aten/src/ATen/native/SummaryOps.cpp

+
+///////////////// bincount /////////////////
+namespace {
+template <typename integral_t, typename T>


aten/src/ATen/native/SummaryOps.cpp

+  }
+
+  bool has_weights = weights.defined();
+  if (has_weights && weights.numel() != self.numel()) {


aten/src/ATen/native/cuda/SummaryOps.cu

+
+///////////////// bincount /////////////////
+namespace {
+template <typename weights_t, typename integral_t>


aten/src/ATen/native/cuda/SummaryOps.cu

+  if (self.dim() != 1 || self.numel() == 0 ||
+      !isIntegralType(self.type().scalarType()) ||
+      (!std::is_same<integral_t, uint8_t>::value &&
+       *self.min().toBackend(kCPU).data<integral_t>() < 0)) {


aten/src/ATen/cuda/CUDAApplyUtils.cuh

+  auto sharedMem = nbins * sizeof(scalar1) + 8; // 8 guard bytes
+
+#define GET_WEIGHTS_DUMMY_OP [] __device__(IndexType) { return 1L; }
+#define GET_WEIGHTS_OP                                                    \


aten/src/ATen/native/cuda/SummaryOps.cu

+
+namespace at {
+
+namespace cuda { namespace detail {


ssnl

This should be differentiable wrt weights btw.

acgtyrant · 2018-06-06T12:24:13Z

I have not read the source code of this pull, but I noticed that the bincount can be implemented by scatter_add_, see #5405. Any idea?

acgtyrant · 2018-06-07T07:56:57Z

I have tried to use the scatter_add_ to implement the bincount, it is too slow to use. Convert any tensor from gpu to cpu and then use np.bincount is faster.

acgtyrant · 2018-06-07T08:20:52Z

I am amazing to find that moving tensor from gpu to cpu and then use scatter_add_ is faster too! It improves 1500x speed than before. I think this may becuase indexing in gpu is very slow.

zou3519 · 2018-06-07T14:30:17Z

@acgtyrant What sizes have you tried?

acgtyrant · 2018-06-07T22:31:00Z

@zou3519 About 8x1024x2048.

chintak · 2018-06-08T22:32:55Z

Benchmark with sync

Following offline discussion with @zou3519, there was an error with my earlier benchmarking, in that, I wasn't explicitly synchronizing the GPU after torch.bincount call. This has been now fixed. Here is the benchmark.py file used.

CPU

input size	time
100K	0.0055
1M	0.052
10M	0.52

Shared memory usage

input size	10	100	500	1000	2500	5000	7500	10000
100K	0.0135	0.0007	0.0006	0.0005	0.0004	0.0004	NA	NA
1M	0.0013	0.0073	0.0048	0.003	0.0014	0.0016	NA	NA
10M	0.0087	0.0826	0.0454	0.0268	0.0101	0.0125	NA	NA

Multi Block memory usage

input size	10	100	500	1000	2500	5000	7500	10000
100K	0.0148	0.0006	0.0007	0.0006	0.0005	0.0006	0.0007	0.0008
1M	0.004	0.004	0.0049	0.0028	0.0018	0.0023	0.003	0.0038
10M	0.0315	0.0308	0.0421	0.0246	0.0134	0.0183	0.0248	0.032

Global memory usage

input size	10	100	500	1000	2500	5000	7500	10000
100K	0.1677	0.005	0.0008	0.0004	0.0003	0.0002	0.0002	0.0002
1M	1.8555	0.0625	0.0071	0.0022	0.0011	0.0007	0.0006	0.0006
10M	18.5349	0.6453	0.0694	0.019	0.0086	0.0048	0.0038	0.0038

Conclusion

For small number of bins, 10-100, shared memory implementation performs the best.
For the intermediate range of bins, 100-500, multi-block memory implementation wins over the other three.
For larger number of bins, 1000+, global memory implementation performs the best.

cc @zou3519, @ssnl

P.S.:

All time reported in sec.
Experiments performed on Tesla M40 (12 Gb) (K40 seems to be having a bad day - out of memory errors even though nothing was running).

ssnl · 2018-06-08T22:35:41Z

Cool! I’ll do another pass tomorrow. About your K40, out of curiosity, do you see memory occupied in nvidia-smi?

…

On Fri, Jun 8, 2018 at 18:33 Chintak Sheth ***@***.***> wrote: Benchmark with sync Following offline discussion with @zou3519 <https://github.com/zou3519>, there was an error with my earlier benchmarking, in that, I wasn't explicitly synchronizing the GPU after torch.bincount call. This has been now fixed. Here is the benchmark.py <https://pastebin.com/pci0nQtv> file used. CPU input size time 100K 0.0055 1M 0.052 10M 0.52 Shared memory usage input size 10 100 500 1000 2500 5000 7500 10000 100K 0.0135 0.0007 0.0006 0.0005 0.0004 0.0004 NA NA 1M 0.0013 0.0073 0.0048 0.003 0.0014 0.0016 NA NA 10M 0.0087 0.0826 0.0454 0.0268 0.0101 0.0125 NA NA Multi Block memory usage input size 10 100 500 1000 2500 5000 7500 10000 100K 0.0148 0.0006 0.0007 0.0006 0.0005 0.0006 0.0007 0.0008 1M 0.004 0.004 0.0049 0.0028 0.0018 0.0023 0.003 0.0038 10M 0.0315 0.0308 0.0421 0.0246 0.0134 0.0183 0.0248 0.032 Global memory usage input size 10 100 500 1000 2500 5000 7500 10000 100K 0.1677 0.005 0.0008 0.0004 0.0003 0.0002 0.0002 0.0002 1M 1.8555 0.0625 0.0071 0.0022 0.0011 0.0007 0.0006 0.0006 10M 18.5349 0.6453 0.0694 0.019 0.0086 0.0048 0.0038 0.0038 Conclusion 1. For small number of bins, 10-100, shared memory implementation performs the best. 2. For the intermediate range of bins, 100-500, multi-block memory implementation wins over the other three. 3. For larger number of bins, 1000+, global memory implementation performs the best. cc @zou3519 <https://github.com/zou3519>, @ssnl <https://github.com/SsnL> P.S.: 1. All time reported in sec. 2. Experiments performed on Tesla M40 (12 Gb) (K40 seems to be having a bad day - out of memory errors even though nothing was running). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6688 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFaWZZwu0DQPpfZrBMjjagfNQYIqE7ZUks5t6vutgaJpZM4TZQmp> .

chintak · 2018-06-08T23:04:39Z

Builds failing due to AT_ASSERT being passed 2 arguments. Changed it to AT_ASSERTM.

Heads up: files aten/doc/Tensor.h and aten/doc/Functions.h still uses AT_ASSERT with 2 arguments. Should be changed to AT_ASSERTM.

aten/src/ATen/native/cuda/SummaryOps.cu

+              ->multiProcessorCount *
+          AT_APPLY_BLOCKS_PER_SM,
+      grid.x);
+#endif


aten/src/ATen/native/cuda/SummaryOps.cu

+      at::globalContext().getCurrentDeviceProperties()->sharedMemPerBlock;
+  auto sharedMem = nbins * sizeof(output_t) + 8; // 8 guard bytes
+  // determine memory type to use in the kernel
+  if (nbins < block.x && sharedMem < maxSharedMem) {


aten/src/ATen/native/cuda/SummaryOps.cu

+  detail::TensorInfo<output_t, IndexType> pInfo = aInfo;
+  Tensor partial_output;
+  if (memType == CUDAHistogramMemoryType::MULTI_BLOCK) {
+    partial_output = a.type().zeros({grid.x, nbins});


torch/_torch_docs.py

+    minlength (int): optional, min number of bins. Should be non-negative.
+
+Shape:
+    output (Tensor): ``Size([argmax(input) + 1])``


aten/src/ATen/native/cuda/SummaryOps.cu

+  using IndexType = int64_t;
+  auto aInfo = detail::getTensorInfo<output_t, IndexType>(a);
+  auto bInfo = detail::getTensorInfo<input_t, IndexType>(b);
+  detail::TensorInfo<output_t, IndexType> pInfo = aInfo;


zou3519

Went over the CUDA bits in detail. Those look good, but I had a few questions

zou3519

lgtm, thank you for the contribution @chintak!

For the record, there are three points that we might want to revisit later:

Right now, if integral weights are passed, they are casted to double rather than float to match numpy. Numpy's default floating-point type is double, but Pytorch's is really float. It might be better to deviate from numpy in this respect and instead cast the weights to float or keep them as integral types.
It might be nice for bincount to be differentiable w.r.t. weights
The check for if we go into the multi-block case is very conservative because we have THCCachingAllocation. This means that sometimes the code will be slower than it has to be, but based on the above benchmarks it shouldn't matter too much.

soumith · 2018-06-14T15:38:04Z

thanks Chintak!

torch/_torch_docs.py

+    >>> torch.bincount(input)
+    tensor([0, 0, 0, 2, 2, 0, 1])
+
+    >>> input.bincount(weights)


Implement CPU bincount feature support

002f411

chintak requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners April 18, 2018 00:55

ssnl reviewed Apr 18, 2018

View reviewed changes

torch/functional.py Outdated

Example::

>>> input = torch.randint(0, 8, (5,), dtype=torch.int64)

>>> weights = torch.randint(10, (5,), dtype=torch.float32) / 10

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ssnl reviewed Apr 18, 2018

View reviewed changes

test/test_torch.py Outdated

def test_bincount_cpu(self):

# test weights

byte_counts = torch.bincount(torch.ByteTensor([0, 1, 1, 1, 4]),

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ssnl reviewed Apr 18, 2018

View reviewed changes

apaszke reviewed Apr 18, 2018

View reviewed changes

zou3519 reviewed Apr 18, 2018

View reviewed changes

aten/src/ATen/native/Bincount.cpp Outdated

const Tensor& weights = {},

int64_t minlength = 0) {

if (minlength < 0) {

throw std::domain_error("minlength should be >= 0");

This comment was marked as off-topic.

Sign in to view

Incorporate feedback on renaming to SummaryOps file and other nits

187817d

chintak commented Apr 18, 2018

View reviewed changes

aten/src/ATen/native/SummaryOps.cpp Outdated

if (minlength < 0) {

throw std::domain_error("minlength should be >= 0");

}

if (self.dim() != 1 || self.numel() == 0 ||

This comment was marked as off-topic.

Sign in to view

bincount gpu implementation

c70f832