-
Notifications
You must be signed in to change notification settings - Fork 26.3k
bincount feature implementation
#6688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
torch/functional.py
Outdated
| Example:: | ||
| >>> input = torch.randint(0, 8, (5,), dtype=torch.int64) | ||
| >>> weights = torch.randint(10, (5,), dtype=torch.float32) / 10 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/functional.py
Outdated
| 0.4000 | ||
| [torch.DoubleTensor of size (8,)] | ||
| """ | ||
| return torch._bincount(input, weights, minlength) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_torch.py
Outdated
|
|
||
| def test_bincount_cpu(self): | ||
| # test weights | ||
| byte_counts = torch.bincount(torch.ByteTensor([0, 1, 1, 1, 4]), |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/Bincount.cpp
Outdated
|
|
||
| namespace at { namespace native { | ||
|
|
||
| #define AT_DISPATCH_BINCOUNT_INT_CASE(scalar, type) \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_torch.py
Outdated
| long_counts = torch.LongTensor([0, 3, 2, 1, 3]).bincount() | ||
| self.assertEqual(torch.LongTensor([1, 1, 1, 2]), long_counts) | ||
| # test minlength functionality | ||
| int_counts = torch.bincount(torch.IntTensor([1, 1, 1, 1]), minlength=5) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/functional.py
Outdated
| 0.4000 | ||
| [torch.DoubleTensor of size (8,)] | ||
| """ | ||
| return torch._bincount(input, weights, minlength) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/Bincount.cpp
Outdated
|
|
||
| #include <tuple> | ||
|
|
||
| namespace at { namespace native { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/Bincount.cpp
Outdated
| template <typename integral_t> | ||
| Tensor _bincount_cpu_template( | ||
| const Tensor& self, | ||
| const Tensor& weights = {}, |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/Bincount.cpp
Outdated
|
|
||
| integral_t* self_p = self.contiguous().data<integral_t>(); | ||
| if (isWeights) { | ||
| output = zeros(CPU(kDouble), {nbins}); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/Bincount.cpp
Outdated
| "bincount only supports 1-d non-negative integral inputs."); | ||
| } | ||
|
|
||
| bool isWeights = weights.defined(); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/Bincount.cpp
Outdated
| const Tensor& weights = {}, | ||
| int64_t minlength = 0) { | ||
| if (minlength < 0) { | ||
| throw std::domain_error("minlength should be >= 0"); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/SummaryOps.cpp
Outdated
| Tensor | ||
| _bincount_cpu(const Tensor& self, const Tensor& weights, int64_t minlength) { | ||
| return AT_DISPATCH_INTEGRAL_TYPES(self.type(), "bincount", [&] { | ||
| if (weights.type().scalarType() == ScalarType::Float) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/SummaryOps.cpp
Outdated
| if (minlength < 0) { | ||
| throw std::domain_error("minlength should be >= 0"); | ||
| } | ||
| if (self.dim() != 1 || self.numel() == 0 || |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
I'm trying to implement the CUDA version. I see For implementing bincount, however, I need more flexibility in terms of indexing. Specifically, I need to be able to do Are there wrappers for doing similar kernel launch? Any pointers to how I can create such a wrapper? |
| Should be of same size as input tensor. | ||
| minlength (int): optional, min number of bins in the output tensor. | ||
| Should be non-negative. | ||
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/_torch_docs.py
Outdated
| 6 | ||
| 3 | ||
| 4 | ||
| [torch.LongTensor of size (5,)] |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_torch.py
Outdated
| self.assertEqual(torch.tensor([0, 4, 0, 0, 0], dtype=long_t), int_counts) | ||
|
|
||
| # negative input throws | ||
| with self.assertRaises(RuntimeError): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_cuda.py
Outdated
| TestTorch._test_random_neg_values(self, use_cuda=True) | ||
|
|
||
| def test_bincount_cuda(self): | ||
| cuda = torch.device('cuda') |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_cuda.py
Outdated
| # test large input size | ||
| big_exp = torch.zeros(2).cuda() | ||
| big_exp[1] = 1000000 | ||
| big_out = torch.ones(1000000, dtype=torch.int8).cuda().bincount() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_cuda.py
Outdated
| self.assertEqual( | ||
| torch.tensor([0.1, 0.9, 0, 0, 0.5], device=cuda), byte_counts) | ||
| # test large number of bins - global memory use | ||
| big_exp = torch.zeros(100000000).cuda() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/SummaryOps.cpp
Outdated
|
|
||
| ///////////////// bincount ///////////////// | ||
| namespace { | ||
| template <typename integral_t, typename T> |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/SummaryOps.cpp
Outdated
| } | ||
|
|
||
| bool has_weights = weights.defined(); | ||
| if (has_weights && weights.numel() != self.numel()) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
|
||
| ///////////////// bincount ///////////////// | ||
| namespace { | ||
| template <typename weights_t, typename integral_t> |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| if (self.dim() != 1 || self.numel() == 0 || | ||
| !isIntegralType(self.type().scalarType()) || | ||
| (!std::is_same<integral_t, uint8_t>::value && | ||
| *self.min().toBackend(kCPU).data<integral_t>() < 0)) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| auto sharedMem = nbins * sizeof(scalar1) + 8; // 8 guard bytes | ||
|
|
||
| #define GET_WEIGHTS_DUMMY_OP [] __device__(IndexType) { return 1L; } | ||
| #define GET_WEIGHTS_OP \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
|
||
| namespace at { | ||
|
|
||
| namespace cuda { namespace detail { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ssnl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be differentiable wrt weights btw.
|
I have not read the source code of this pull, but I noticed that the |
|
I have tried to use the |
|
I am amazing to find that moving tensor from gpu to cpu and then use |
|
@acgtyrant What sizes have you tried? |
|
@zou3519 About 8x1024x2048. |
Benchmark with syncFollowing offline discussion with @zou3519, there was an error with my earlier benchmarking, in that, I wasn't explicitly synchronizing the GPU after CPU
Shared memory usage
Multi Block memory usage
Global memory usage
Conclusion
P.S.:
|
|
Cool! I’ll do another pass tomorrow.
About your K40, out of curiosity, do you see memory occupied in nvidia-smi?
…On Fri, Jun 8, 2018 at 18:33 Chintak Sheth ***@***.***> wrote:
Benchmark with sync
Following offline discussion with @zou3519 <https://github.com/zou3519>,
there was an error with my earlier benchmarking, in that, I wasn't
explicitly synchronizing the GPU after torch.bincount call. This has been
now fixed. Here is the benchmark.py <https://pastebin.com/pci0nQtv> file
used.
CPU
input size time
100K 0.0055
1M 0.052
10M 0.52 Shared memory usage
input size 10 100 500 1000 2500 5000 7500 10000
100K 0.0135 0.0007 0.0006 0.0005 0.0004 0.0004 NA NA
1M 0.0013 0.0073 0.0048 0.003 0.0014 0.0016 NA NA
10M 0.0087 0.0826 0.0454 0.0268 0.0101 0.0125 NA NA Multi Block memory
usage
input size 10 100 500 1000 2500 5000 7500 10000
100K 0.0148 0.0006 0.0007 0.0006 0.0005 0.0006 0.0007 0.0008
1M 0.004 0.004 0.0049 0.0028 0.0018 0.0023 0.003 0.0038
10M 0.0315 0.0308 0.0421 0.0246 0.0134 0.0183 0.0248 0.032 Global memory
usage
input size 10 100 500 1000 2500 5000 7500 10000
100K 0.1677 0.005 0.0008 0.0004 0.0003 0.0002 0.0002 0.0002
1M 1.8555 0.0625 0.0071 0.0022 0.0011 0.0007 0.0006 0.0006
10M 18.5349 0.6453 0.0694 0.019 0.0086 0.0048 0.0038 0.0038 Conclusion
1. For small number of bins, 10-100, shared memory implementation
performs the best.
2. For the intermediate range of bins, 100-500, multi-block memory
implementation wins over the other three.
3. For larger number of bins, 1000+, global memory implementation
performs the best.
cc @zou3519 <https://github.com/zou3519>, @ssnl <https://github.com/SsnL>
P.S.:
1. All time reported in sec.
2. Experiments performed on Tesla M40 (12 Gb) (K40 seems to be having
a bad day - out of memory errors even though nothing was running).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6688 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFaWZZwu0DQPpfZrBMjjagfNQYIqE7ZUks5t6vutgaJpZM4TZQmp>
.
|
|
Builds failing due to Heads up: files |
| ->multiProcessorCount * | ||
| AT_APPLY_BLOCKS_PER_SM, | ||
| grid.x); | ||
| #endif |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| at::globalContext().getCurrentDeviceProperties()->sharedMemPerBlock; | ||
| auto sharedMem = nbins * sizeof(output_t) + 8; // 8 guard bytes | ||
| // determine memory type to use in the kernel | ||
| if (nbins < block.x && sharedMem < maxSharedMem) { |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| detail::TensorInfo<output_t, IndexType> pInfo = aInfo; | ||
| Tensor partial_output; | ||
| if (memType == CUDAHistogramMemoryType::MULTI_BLOCK) { | ||
| partial_output = a.type().zeros({grid.x, nbins}); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/_torch_docs.py
Outdated
| minlength (int): optional, min number of bins. Should be non-negative. | ||
| Shape: | ||
| output (Tensor): ``Size([argmax(input) + 1])`` |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| using IndexType = int64_t; | ||
| auto aInfo = detail::getTensorInfo<output_t, IndexType>(a); | ||
| auto bInfo = detail::getTensorInfo<input_t, IndexType>(b); | ||
| detail::TensorInfo<output_t, IndexType> pInfo = aInfo; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went over the CUDA bits in detail. Those look good, but I had a few questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thank you for the contribution @chintak!
For the record, there are three points that we might want to revisit later:
- Right now, if integral weights are passed, they are casted to double rather than float to match numpy. Numpy's default floating-point type is double, but Pytorch's is really float. It might be better to deviate from numpy in this respect and instead cast the weights to float or keep them as integral types.
- It might be nice for bincount to be differentiable w.r.t. weights
- The check for if we go into the multi-block case is very conservative because we have THCCachingAllocation. This means that sometimes the code will be slower than it has to be, but based on the above benchmarks it shouldn't matter too much.
|
thanks Chintak! |
| >>> torch.bincount(input) | ||
| tensor([0, 0, 0, 2, 2, 0, 1]) | ||
| >>> input.bincount(weights) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Add support for
torch.bincountandt.bincount(). Solves #4570.TODO: