Skip to content

Conversation

@mfkasim1
Copy link
Contributor

Fixes #92043.
I'm following numpy's implementation as suggested by @min-jean-cho.
I found out that this implementation still produces overflow if we're working with numbers greater than finfo.max / 2, but this is still much better than the previous implementation where it gets overflow with numbers greater than finfo.max ** 0.5.

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 18, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92539

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 6 Pending

As of commit 01e738b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@mfkasim1
Copy link
Contributor Author

I found a very strange behaviour regarding the error on this PR. The tests are passed for complex64, but it fails on complex128 although I did no special treatment for complex64. Further investigation shows a very strange behaviour below:

import torch

device = torch.device("cpu")
dtype = torch.complex128
finfo = torch.finfo(dtype)
nom = torch.tensor([complex(finfo.min / 2, finfo.min / 2),
                    complex(finfo.max / 2, finfo.max / 2),
                    complex(finfo.tiny, finfo.tiny),
                    complex(finfo.tiny, 0.0),
                    complex(0.0, 0.0)], dtype=dtype, device=device)
denom = torch.tensor([complex(finfo.min / 2, finfo.min / 2),
                      complex(finfo.max / 2, finfo.max / 2),
                      complex(finfo.tiny, finfo.tiny),
                      complex(0.0, finfo.tiny),
                      complex(finfo.tiny, finfo.tiny)], dtype=dtype, device=device)

print(nom[:1] / denom[:1])  # works fine
print(nom[:2] / denom[:2])  # works fine
print(nom[:3] / denom[:3])  # works fine
print(nom[:4] / denom[:4])  # WRONG: all nans!
print(nom[3:4] / denom[3:4])  # works fine
print(nom[2:4] / denom[2:4])  # works fine
print(nom[1:4] / denom[1:4])  # works fine
print(nom[1:5] / denom[1:5])  # WRONG: all nans

It seems like with complex128, the results are wrong if the length of the operands are 4 or more. Any idea on why this happens?
@min-jean-cho @lezcano

@lezcano
Copy link
Collaborator

lezcano commented Jan 19, 2023

That path is vectorised, so it used vectorised CPU operations. Have a look at how they are implemented within aten/src/ATen/cpu/vec. Fixing those while keeping a not-too-bad performance is going to be trickier though. If you want, add a test for tensors of length 1, and then submit a follow up PR fixing the vectorised path and extending the tests. That way we can keep the size of the PRs reasonably small.

@mfkasim1
Copy link
Contributor Author

Thanks, @lezcano!
Another alternative is to use hypot:

import torch

def div(lhs, rhs):
    # (a + i * b) / (c + i * d)
    a, b = lhs.real, lhs.imag
    c, d = rhs.real, rhs.imag
    inv_denom_sqrt = torch.hypot(c, d) ** (-1)
    a2 = a * inv_denom_sqrt
    b2 = b * inv_denom_sqrt
    c2 = c * inv_denom_sqrt
    d2 = d * inv_denom_sqrt
    real = a2 * c2 + b2 * d2
    imag = b2 * c2 - a2 * d2
    return real + 1j * imag

device = torch.device("cpu")
dtype = torch.complex128
finfo = torch.finfo(dtype)
nom = torch.tensor([complex(finfo.min / 2, finfo.min / 2),
                    complex(finfo.max / 2, finfo.max / 2),
                    complex(finfo.tiny, finfo.tiny),
                    complex(finfo.tiny, 0.0),
                    complex(0.0, 0.0)], dtype=dtype, device=device)
denom = torch.tensor([complex(finfo.min / 2, finfo.min / 2),
                      complex(finfo.max / 2, finfo.max / 2),
                      complex(finfo.tiny, finfo.tiny),
                      complex(0.0, finfo.tiny),
                      complex(finfo.tiny, finfo.tiny)], dtype=dtype, device=device)

div(nom, denom)  # works fine!

This way, we don't need to worry about the vectorization path. hypot is a pretty standard function and it's available in std::hypot. I also check there exist hypot for 256-vectorization (here). What do you think?

@lezcano
Copy link
Collaborator

lezcano commented Jan 19, 2023

You'd still need to fix the AVX2 and AVX512 implementations of div accordingly. And sure, you can use hypot there, that may be faster.
When you do that, it'd still be good to throw in some benchmarks for good measure.

@lezcano
Copy link
Collaborator

lezcano commented Jan 19, 2023

At any rate, I'd still suggest first merging this PR, and then fixing the vectorised path on a follow-up PR.

If what you are proposing is to implement div as a composite operation, I don't think that's going to cut it. This is too basic of a building block to be able to afford that perf-wise. We need to fix these issues in the actual vectorised implementation of this operation.

@mfkasim1
Copy link
Contributor Author

Thanks. I agree that perf might be a problem if we're using hypot for such a basic operation here.

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 20, 2023
@mfkasim1
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased compldiv onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout compldiv && git pull --rebase)

Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Looking forward for the vectorisation fix!

@mfkasim1
Copy link
Contributor Author

Thanks, @lezcano!

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 20, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed (Rule Core Reviewers). The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@lezcano
Copy link
Collaborator

lezcano commented Jan 20, 2023

This PR already fixed some operations, so you can remove the xfails on those! Also, it seems to be failing in Windows, so we need to fix those.

@lezcano
Copy link
Collaborator

lezcano commented Jan 25, 2023

There's still at least one xfail that needs to be removed (there's an "unexpected success" in a test) but otherwise this is ready to go!

@lezcano
Copy link
Collaborator

lezcano commented Jan 25, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed (Rule Core Reviewers). The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@mfkasim1
Copy link
Contributor Author

It seems that the unexpected pass on softsign is only at mac and the other still failing. I suspect this might have something to do with the vectorization on the CPU.

template <>
template <typename T>
bool SoftsignFunctor<CPUContext>::
operator()(const int N, const T* X, T* Y, CPUContext* /* context */) const {
ConstEigenVectorArrayMap<T> X_arr(X, N);
EigenVectorMap<T>(Y, N) = (T(1) + X_arr.abs()).inverse() * X_arr;
return true;
}

@lezcano
Copy link
Collaborator

lezcano commented Jan 25, 2023

That code you found is from caffe. I don't think that code is tested in CI.
So, it seems that it was passing on CUDA (and MPS I guess as it's not failing) but not on CPU.
As such, adding a device="cpu" in the xfail should do the trick.

@mfkasim1
Copy link
Contributor Author

mfkasim1 commented Jan 25, 2023

It failed (i.e. producing nan) only if it's on CPU for non-mac (based on what I see here): https://hud.pytorch.org/pytorch/pytorch/pull/92539?sha=6effd2f1e5d6aa64aeb9dc1d729eb214cc52c592). So I guess also adding active_if=not IS_MACOS

@lezcano
Copy link
Collaborator

lezcano commented Jan 25, 2023

It also passes on CUDA. See https://github.com/pytorch/pytorch/actions/runs/4004960425/jobs/6876076243 (or see how there are no failing CUDA jobs when you removed the xfail).

Comment on lines 12739 to 12740
"test_reference_numerics_large", dtypes=(torch.complex64,), device_type='cpu',
active_if=not IS_MACOS),),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the active_if is not necessary, as in mac device_type == "mps" I believe. In any case, no need to change it for this PR. You can try removing it in the next PR if you feel like it, otherwise it's alright as well.

@lezcano
Copy link
Collaborator

lezcano commented Jan 25, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed (Rule Core Reviewers). The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

@lezcano
Copy link
Collaborator

lezcano commented Jan 25, 2023

The test also seems to pass on CPU on windows lol
I'm sorry that this is happening. Let's just wait for the CI to finish to see if there's any other case when it may be passing, and let's add them to the active_if= not...

@mfkasim1
Copy link
Contributor Author

No need to be sorry. I appreciate the complexity of pytorch (testing on a lot of platforms) and the fact that complex division is not a rare function to be used. It seems only windows that has the unexpected success, so I guess active_if=not IS_MACOS and not IS_WINDOWS should work (I hope my logic is still intact).

Do you know about the other error? RuntimeError: test_jit_fuser_te failed! Received signal: SIGIOT

@lezcano
Copy link
Collaborator

lezcano commented Jan 26, 2023

@pytorchbot merge

13th's a charm

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@mfkasim1
Copy link
Contributor Author

13th's a charm

it really is! Thank you, @lezcano! I’ll do the vectorisation next

pytorchmergebot pushed a commit that referenced this pull request Feb 3, 2023
Fixes #92043 and completing #92539 by implementing the vectorized more stable complex division.
I implement this using the internal `abs_` function to avoid branching. I also re-implement the internal `abs_` to make it more stable.

Pull Request resolved: #93277
Approved by: https://github.com/peterbell10, https://github.com/lezcano
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrongly returns nan for complex numbers division

6 participants