-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[ready] Move bernoulli into ATen #10273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/CPUApplyUtils.h
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ed451c1 to
70ae05c
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
3350db8 to
af1686a
Compare
|
ping @ssnl |
|
yeah I'll fix windows error |
8b50449 to
aefa656
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
56eff49 to
a3fa723
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: + pytorch/pytorch#10236 : torch.bernoulli's out kwarg is broken fixed in moving `bernoulli_out` to ATen + pytorch/pytorch#9917 : BUG torch.bernoulli(p.expand(shape)) is broken fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods + pytorch/pytorch#10357 : torch.bernoulli inconsistent gpu/cpu results fixed by adding CUDA asserts In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors. The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like: ```cpp // The template argument `4` below indicates that we want to operate on four // element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details. at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>( ret, p, [seeds] __device__( int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4, const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) { curandStatePhilox4_32_10_t state; curand_init( seeds.first, blockIdx.x * blockDim.x + threadIdx.x, seeds.second, &state); float4 rand = curand_uniform4(&state); switch (n) { case 4: { assert(0 <= p4 && p4 <= 1); v4 = static_cast<scalar_t>(rand.w <= p4); } case 3: { assert(0 <= p3 && p3 <= 1); v3 = static_cast<scalar_t>(rand.z <= p3); } case 2: { assert(0 <= p2 && p2 <= 1); v2 = static_cast<scalar_t>(rand.y <= p2); } case 1: { assert(0 <= p1 && p1 <= 1); v1 = static_cast<scalar_t>(rand.x <= p1); } } } ); ``` Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops: post patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 6.841588497161865 +- 0.05413117632269859 torch.bernoulli(xc) 0.05963418632745743 +- 0.0008014909108169377 x.bernoulli_() 0.4024486541748047 +- 0.0021550932433456182 xc.bernoulli_() 0.02167394384741783 +- 2.3818030967959203e-05 ``` pre-patch ``` ➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py torch.bernoulli(x) 12.394511222839355 +- 0.0966421514749527 torch.bernoulli(xc) 0.08970972150564194 +- 0.0038722590543329716 x.bernoulli_() 1.654480218887329 +- 0.02364428900182247 xc.bernoulli_() 0.058352887630462646 +- 0.003094920190051198 ``` Pull Request resolved: pytorch/pytorch#10273 Differential Revision: D9831294 Pulled By: SsnL fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088
Fixes
fixed in moving
bernoulli_outto ATenfixed in moving all
bernoulliops in ATen to use the modern apply utils methodsfixed by adding CUDA asserts
Notable changes:
In order to use
curand_uniform4, I made some changes toCUDAApplyUtils.cuh. Specifically, I introduced an optional template parameterint stepto theCUDA_tensor_applyNmethods, representing that we want to processstepvalues at each time for each of theNtensors.The calling convention for
step = 1(default) isn't changed. But ifstep > 1, the given lambdaopmust take inint nas its first argument, representing the number of valid values, because there may not be fullstepvalues at the boundary. E.g., here is what thebernoulli(self, p_tensor)call look like:Benchmarking
Benchmarking on
torch.rand(200, 300, 400)20 times, each time with 20 loops:post patch
pre-patch