Skip to content

Conversation

@chintak
Copy link
Contributor

@chintak chintak commented Apr 18, 2018

Add support for torch.bincount and t.bincount(). Solves #4570.

TODO:

  • CPU support
  • GPU support

Example::
>>> input = torch.randint(0, 8, (5,), dtype=torch.int64)
>>> weights = torch.randint(10, (5,), dtype=torch.float32) / 10

This comment was marked as off-topic.

This comment was marked as off-topic.

0.4000
[torch.DoubleTensor of size (8,)]
"""
return torch._bincount(input, weights, minlength)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.


def test_bincount_cpu(self):
# test weights
byte_counts = torch.bincount(torch.ByteTensor([0, 1, 1, 1, 4]),

This comment was marked as off-topic.

This comment was marked as off-topic.


namespace at { namespace native {

#define AT_DISPATCH_BINCOUNT_INT_CASE(scalar, type) \

This comment was marked as off-topic.

long_counts = torch.LongTensor([0, 3, 2, 1, 3]).bincount()
self.assertEqual(torch.LongTensor([1, 1, 1, 2]), long_counts)
# test minlength functionality
int_counts = torch.bincount(torch.IntTensor([1, 1, 1, 1]), minlength=5)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

0.4000
[torch.DoubleTensor of size (8,)]
"""
return torch._bincount(input, weights, minlength)

This comment was marked as off-topic.


#include <tuple>

namespace at { namespace native {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

template <typename integral_t>
Tensor _bincount_cpu_template(
const Tensor& self,
const Tensor& weights = {},

This comment was marked as off-topic.


integral_t* self_p = self.contiguous().data<integral_t>();
if (isWeights) {
output = zeros(CPU(kDouble), {nbins});

This comment was marked as off-topic.

This comment was marked as off-topic.

"bincount only supports 1-d non-negative integral inputs.");
}

bool isWeights = weights.defined();

This comment was marked as off-topic.

const Tensor& weights = {},
int64_t minlength = 0) {
if (minlength < 0) {
throw std::domain_error("minlength should be >= 0");

This comment was marked as off-topic.

Tensor
_bincount_cpu(const Tensor& self, const Tensor& weights, int64_t minlength) {
return AT_DISPATCH_INTEGRAL_TYPES(self.type(), "bincount", [&] {
if (weights.type().scalarType() == ScalarType::Float)

This comment was marked as off-topic.

This comment was marked as off-topic.

if (minlength < 0) {
throw std::domain_error("minlength should be >= 0");
}
if (self.dim() != 1 || self.numel() == 0 ||

This comment was marked as off-topic.

@chintak
Copy link
Contributor Author

chintak commented Apr 21, 2018

I'm trying to implement the CUDA version. I see at::cuda::CUDA_tensor_apply2, which requires input and output size to be equal since indexing is 1:1 a[i] <-> b[i](see CUDAApplyUtils.cuh#L152).

For implementing bincount, however, I need more flexibility in terms of indexing. Specifically, I need to be able to do atomicAdd(&a[b[i]], 1). Further, I want to create an intermediary 2D device tensor which can hold partial histogram per block, followed by a reduceDim with sum op to get the final histogram. (reference vanilla CUDA implementation)

Are there wrappers for doing similar kernel launch? Any pointers to how I can create such a wrapper?
Or any other approaches to calculating bincount?

Should be of same size as input tensor.
minlength (int): optional, min number of bins in the output tensor.
Should be non-negative.

This comment was marked as off-topic.

6
3
4
[torch.LongTensor of size (5,)]

This comment was marked as off-topic.

self.assertEqual(torch.tensor([0, 4, 0, 0, 0], dtype=long_t), int_counts)

# negative input throws
with self.assertRaises(RuntimeError):

This comment was marked as off-topic.

TestTorch._test_random_neg_values(self, use_cuda=True)

def test_bincount_cuda(self):
cuda = torch.device('cuda')

This comment was marked as off-topic.

# test large input size
big_exp = torch.zeros(2).cuda()
big_exp[1] = 1000000
big_out = torch.ones(1000000, dtype=torch.int8).cuda().bincount()

This comment was marked as off-topic.

self.assertEqual(
torch.tensor([0.1, 0.9, 0, 0, 0.5], device=cuda), byte_counts)
# test large number of bins - global memory use
big_exp = torch.zeros(100000000).cuda()

This comment was marked as off-topic.


///////////////// bincount /////////////////
namespace {
template <typename integral_t, typename T>

This comment was marked as off-topic.

}

bool has_weights = weights.defined();
if (has_weights && weights.numel() != self.numel()) {

This comment was marked as off-topic.


///////////////// bincount /////////////////
namespace {
template <typename weights_t, typename integral_t>

This comment was marked as off-topic.

This comment was marked as off-topic.

if (self.dim() != 1 || self.numel() == 0 ||
!isIntegralType(self.type().scalarType()) ||
(!std::is_same<integral_t, uint8_t>::value &&
*self.min().toBackend(kCPU).data<integral_t>() < 0)) {

This comment was marked as off-topic.

auto sharedMem = nbins * sizeof(scalar1) + 8; // 8 guard bytes

#define GET_WEIGHTS_DUMMY_OP [] __device__(IndexType) { return 1L; }
#define GET_WEIGHTS_OP \

This comment was marked as off-topic.


namespace at {

namespace cuda { namespace detail {

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Collaborator

@ssnl ssnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be differentiable wrt weights btw.

@acgtyrant
Copy link
Contributor

acgtyrant commented Jun 6, 2018

I have not read the source code of this pull, but I noticed that the bincount can be implemented by scatter_add_, see #5405. Any idea?

@acgtyrant
Copy link
Contributor

acgtyrant commented Jun 7, 2018

I have tried to use the scatter_add_ to implement the bincount, it is too slow to use. Convert any tensor from gpu to cpu and then use np.bincount is faster.

@acgtyrant
Copy link
Contributor

I am amazing to find that moving tensor from gpu to cpu and then use scatter_add_ is faster too! It improves 1500x speed than before. I think this may becuase indexing in gpu is very slow.

@zou3519
Copy link
Contributor

zou3519 commented Jun 7, 2018

@acgtyrant What sizes have you tried?

@acgtyrant
Copy link
Contributor

@zou3519 About 8x1024x2048.

@chintak
Copy link
Contributor Author

chintak commented Jun 8, 2018

Benchmark with sync

Following offline discussion with @zou3519, there was an error with my earlier benchmarking, in that, I wasn't explicitly synchronizing the GPU after torch.bincount call. This has been now fixed. Here is the benchmark.py file used.

CPU

input size time
100K 0.0055
1M 0.052
10M 0.52

Shared memory usage

input size 10 100 500 1000 2500 5000 7500 10000
100K 0.0135 0.0007 0.0006 0.0005 0.0004 0.0004 NA NA
1M 0.0013 0.0073 0.0048 0.003 0.0014 0.0016 NA NA
10M 0.0087 0.0826 0.0454 0.0268 0.0101 0.0125 NA NA

Multi Block memory usage

input size 10 100 500 1000 2500 5000 7500 10000
100K 0.0148 0.0006 0.0007 0.0006 0.0005 0.0006 0.0007 0.0008
1M 0.004 0.004 0.0049 0.0028 0.0018 0.0023 0.003 0.0038
10M 0.0315 0.0308 0.0421 0.0246 0.0134 0.0183 0.0248 0.032

Global memory usage

input size 10 100 500 1000 2500 5000 7500 10000
100K 0.1677 0.005 0.0008 0.0004 0.0003 0.0002 0.0002 0.0002
1M 1.8555 0.0625 0.0071 0.0022 0.0011 0.0007 0.0006 0.0006
10M 18.5349 0.6453 0.0694 0.019 0.0086 0.0048 0.0038 0.0038

Conclusion

  1. For small number of bins, 10-100, shared memory implementation performs the best.
  2. For the intermediate range of bins, 100-500, multi-block memory implementation wins over the other three.
  3. For larger number of bins, 1000+, global memory implementation performs the best.

cc @zou3519, @ssnl

P.S.:

  1. All time reported in sec.
  2. Experiments performed on Tesla M40 (12 Gb) (K40 seems to be having a bad day - out of memory errors even though nothing was running).

@ssnl
Copy link
Collaborator

ssnl commented Jun 8, 2018 via email

@chintak
Copy link
Contributor Author

chintak commented Jun 8, 2018

Builds failing due to AT_ASSERT being passed 2 arguments. Changed it to AT_ASSERTM.

Heads up: files aten/doc/Tensor.h and aten/doc/Functions.h still uses AT_ASSERT with 2 arguments. Should be changed to AT_ASSERTM.

@zou3519 zou3519 self-assigned this Jun 11, 2018
->multiProcessorCount *
AT_APPLY_BLOCKS_PER_SM,
grid.x);
#endif

This comment was marked as off-topic.

This comment was marked as off-topic.

at::globalContext().getCurrentDeviceProperties()->sharedMemPerBlock;
auto sharedMem = nbins * sizeof(output_t) + 8; // 8 guard bytes
// determine memory type to use in the kernel
if (nbins < block.x && sharedMem < maxSharedMem) {

This comment was marked as off-topic.

This comment was marked as off-topic.

detail::TensorInfo<output_t, IndexType> pInfo = aInfo;
Tensor partial_output;
if (memType == CUDAHistogramMemoryType::MULTI_BLOCK) {
partial_output = a.type().zeros({grid.x, nbins});

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

minlength (int): optional, min number of bins. Should be non-negative.
Shape:
output (Tensor): ``Size([argmax(input) + 1])``

This comment was marked as off-topic.

using IndexType = int64_t;
auto aInfo = detail::getTensorInfo<output_t, IndexType>(a);
auto bInfo = detail::getTensorInfo<input_t, IndexType>(b);
detail::TensorInfo<output_t, IndexType> pInfo = aInfo;

This comment was marked as off-topic.

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over the CUDA bits in detail. Those look good, but I had a few questions

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you for the contribution @chintak!

For the record, there are three points that we might want to revisit later:

  1. Right now, if integral weights are passed, they are casted to double rather than float to match numpy. Numpy's default floating-point type is double, but Pytorch's is really float. It might be better to deviate from numpy in this respect and instead cast the weights to float or keep them as integral types.
  2. It might be nice for bincount to be differentiable w.r.t. weights
  3. The check for if we go into the multi-block case is very conservative because we have THCCachingAllocation. This means that sometimes the code will be slower than it has to be, but based on the above benchmarks it shouldn't matter too much.

@soumith
Copy link
Contributor

soumith commented Jun 14, 2018

thanks Chintak!

@soumith soumith merged commit 21609e0 into pytorch:master Jun 14, 2018
@chintak chintak deleted the bincount branch June 16, 2018 18:59
>>> torch.bincount(input)
tensor([0, 0, 0, 2, 2, 0, 1])
>>> input.bincount(weights)

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants