Skip to content

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented Jul 1, 2020

Per title, fixes #40768

@ngimel ngimel requested review from pearu and vincentqb July 1, 2020 06:22
@dr-ci
Copy link

dr-ci bot commented Jul 1, 2020

💊 CI failures summary and remediations

As of commit 7184385 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 01 07:19:25 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.
Jul 01 07:19:25 processing existing schema:  __str__(__torch__.torch.classes._TorchScriptTesting._StackString _0) -> (str _0) 
Jul 01 07:19:25 processing existing schema:  __init__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0, int[] _1) -> (None _0) 
Jul 01 07:19:25 processing existing schema:  __getstate__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int[] _0) 
Jul 01 07:19:25 processing existing schema:  __setstate__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0, int[] _1) -> (None _0) 
Jul 01 07:19:25 processing existing schema:  top(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int _0) 
Jul 01 07:19:25 processing existing schema:  pop(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int _0) 
Jul 01 07:19:25 processing existing schema:  get(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0, Tensor _1) -> (str _0) 
Jul 01 07:19:25 processing existing schema:  __getstate__(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0) -> (int _0) 
Jul 01 07:19:25 processing existing schema:  __setstate__(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0, int _1) -> (None _0) 
Jul 01 07:19:25 processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (None _0) 
Jul 01 07:19:25 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Jul 01 07:19:25  
Jul 01 07:19:25 Broken ops: [ 
Jul 01 07:19:25 	aten::view_as(Tensor self, Tensor other) -> (Tensor) 
Jul 01 07:19:25 	aten::reshape_as(Tensor self, Tensor other) -> (Tensor) 
Jul 01 07:19:25 	aten::pin_memory(Tensor self) -> (Tensor) 
Jul 01 07:19:25 	aten::reshape(Tensor self, int[] shape) -> (Tensor) 
Jul 01 07:19:25 	aten::detach(Tensor self) -> (Tensor) 
Jul 01 07:19:25 	aten::expand_as(Tensor self, Tensor other) -> (Tensor) 
Jul 01 07:19:25 	aten::flatten.DimnameList(Tensor self, str[] dims, str out_dim) -> (Tensor) 
Jul 01 07:19:25 	aten::flatten.named_out_dim(Tensor self, int start_dim, int end_dim, str out_dim) -> (Tensor) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this simplification!

Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to use stride tricks to avoid creating possible large temporary tensors.

Comment on lines +270 to +271
C = A.mean(dim=(-2,), keepdim=True)
return _svd_lowrank(A - C, q, niter=niter, M=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this code to:

        C = A.mean(dim=(-2,), keepdim=True)
        M_strides = [C.stride(i) for i in range(len(C.shape))]
        M_strides[-2] = 0
        M = C.as_strided(A.shape, M_strides)
        return _svd_lowrank(A, q, niter=niter, M=M)

that will avoid creating a possibly large A - C.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do that, but I suspect it will just postpone the inevitable - when matmul sees discontiguous arguments, it blow them up. Will have to double check matmul code, but I vaguely remember it from when I looked at it previously.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you ever backprop through pca_lowrank? If not, and A is otherwise not needed, A-C can be computed inplace, avoiding large memory allocation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do that, but I suspect it will just postpone the inevitable - when matmul sees discontiguous arguments, it blow them up

I tested the code locally, all pca tests passed ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, by "blow up" I meant it will materialize expanded contiguous tensors. So yeah, the tests will pass, but memory usage is possibly still suboptimal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a memory is contiguous, but tensor a is not contiguous. In contiguous tensors, no 2 logically different elements point to the same memory location, whereas in a all elements point to the same memory location.

Copy link
Collaborator

@pearu pearu Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the whole point of expand-contiguous conversion is to ensure that tensor data is C-contiguous and can be passed to some low-level routine that is not stride-aware, that is, the low-level routine would access the data memory according to the shape information, not strides. That said, matmul low-level routines ought to be stride-aware (thinking of LAPACK) so that the expand-contiguous conversion is unnecessary. So there could be some ideas for matmul optimization here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear what you are saying, however, typically matmul (let's talk about 2d tensors for simplicity) will be dispatched to a blas library that implements gemm API . As you can see here, lda/ldb that are providing stride-awareness are explicitly constrained to be positive, not 0. That said, instead of dispatching calls with stride=0 args to blas, doing pointwise multiplication of arguments followed by a reduction and expanding back to the necessary size is likely to be faster. Double points if multiplication could be done on the fly during reduction, so that A*B (pointwise-multiplied) did not have to be materialized, but unfortunately pytorch cannot do that, you'd need keops for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprised that blas doesn't support zero stride :( Do we need to write specialized kernels for these cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The joys of 80's libraries for you :-) The primitive that we need here to avoid materialization is "broadcasted multiplication followed by a reduction" #32591 (comment), einsum people have been talking about it for a long time. It sounds easy but, given how different broadcasting patterns can be and how generally hard it is to write an efficient reduction for various access patterns, turns out to be pretty hard, there are libraries that are doing it to varying degrees of success (Keops and cutensor that I know of)

Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR as it is because my suggestion was invalid.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 6095808.

@ngimel ngimel deleted the pca_lowrank branch July 6, 2020 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pca_lowrank memory allocation

5 participants