Skip to content

Conversation

@emcastillo
Copy link
Collaborator

@emcastillo emcastillo commented Jun 4, 2020

Fixes #38716 and #37234

In some operations, just doing a running reduction over a large array can cause precision issues leading to incorrect results. This happens in the outer reduction case, as the reduction is parallelized across columns and there are no partial results.

This PR batches these operations so we have several intermediate results that can lead to more stable results and fewer precision issues. The changes are minimal and I believe it is better to do it here than in the specific reduction routine such as sum.

TODO: Decide a correct granularity, the 256 right now was put there almost randomly
Measure performance.
Test.

This PR is for discussing this alternative mostly, please give your thoughts and I will work on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed

@dr-ci
Copy link

dr-ci bot commented Jun 4, 2020

💊 CI failures summary and remediations

As of commit 69d910d (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 28 times.

@peterbell10
Copy link
Collaborator

Hey @emcastillo , I tried to include your version in the error comparison for #39516 but found your non-vectorized loop has exactly the same error as master. The basic_loop accumulates directly into the output, so you are sharing the same accumulator between batches and it performs the same additions as before.

@emcastillo
Copy link
Collaborator Author

Yeah, you are right, let me give it a twist

@mruberry mruberry requested a review from VitalyFedyunin June 12, 2020 04:18
@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 12, 2020
@emcastillo
Copy link
Collaborator Author

I added some code that removes the basic_loop and just adds a simplified version of reduction_128 but non vectorized. This also allows to get rid of the code adapting the parameters from the reduction to the Element-wise based loop that basic_loop is.

I still have to measure performance.

@emcastillo emcastillo force-pushed the batch-reduction branch 2 times, most recently from 1cbcb40 to 483ed1c Compare June 12, 2020 07:00
@emcastillo
Copy link
Collaborator Author

seems that some tests are affected, I will try to fix it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel happy with this change so I am open to suggestions.

@peterbell10
Copy link
Collaborator

Here is the error comparison from your latest commit.
Errors

@emcastillo
Copy link
Collaborator Author

Thanks!
Tunning the block_size will have a great impact on the error measurements, I selected 256 without caring about it too much.
This PR is a general solution that might apply to other routines than sum, (although there doesn't seem to be other ones that could be greatly affected by this issue).

@ngimel
Copy link
Collaborator

ngimel commented Oct 21, 2020

#39516 achieving the same goal was merged, can we close this?

@emcastillo
Copy link
Collaborator Author

I am fine with closing it, I was just wondering that since this is a generic solution to the reduction algorithm, other reduction routines could benefit from this, but I can't think of other than sum/prod.

@emcastillo emcastillo closed this Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.mean returns a wildly incorrect result 0.3277 on YCbCr version of CIFAR10 on CPU with dtype=float32

5 participants