Skip to content

Conversation

@IvanKobzarev
Copy link
Contributor

@IvanKobzarev IvanKobzarev commented Feb 3, 2022

Stack from ghstack (oldest at bottom):

Differential Revision: D33960933

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)

The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

  1. Introducing ExternalCall2 which has bufs_out and bufs_in.
    The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
    first bufs_out slots are for output values of the result tensor buffers ptrs,
    bufs_in - tensor buffer ptrs of input arguments
    last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

  1. Changing logic of memReuse:
  • No allocate Bufs that will be allocated in external function calls
  • Adding FreeExt for Bufs allocated with external function calls,
  1. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2

Before:
Screen Shot 2022-02-11 at 11 57 09 AM

After: (no memcpy)
Screen Shot 2022-02-11 at 11 34 49 AM

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 3, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/46d8031dd345668028b5d08baea1fe8561d05b92/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk, ciflow/xla ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-bionic-rocm4.5-py3.7 ciflow/linux, ciflow/rocm 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Feb 3, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 854ac78 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

IvanKobzarev added a commit that referenced this pull request Feb 3, 2022
ghstack-source-id: 40f3d37
Pull Request resolved: #72225
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Feb 3, 2022
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

IvanKobzarev added a commit that referenced this pull request Feb 3, 2022
ghstack-source-id: 6dcf4de
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

IvanKobzarev added a commit that referenced this pull request Feb 5, 2022
ghstack-source-id: 4395cc5
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

IvanKobzarev added a commit that referenced this pull request Feb 7, 2022
ghstack-source-id: 24da694
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

IvanKobzarev added a commit that referenced this pull request Feb 8, 2022
ghstack-source-id: de7025f
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D33960933](https://our.internmc.facebook.com/intern/diff/D33960933)

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)


The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

1. Introducing ExternalCall2 which has bufs_out and bufs_in.
The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
first bufs_out slots are for output values of the result tensor buffers ptrs,
bufs_in - tensor buffer ptrs of input arguments
last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

2. Changing logic of memReuse:
 - No allocate Bufs that will be allocated in external function calls
 - Adding FreeExt for Bufs allocated with external function calls, 

3. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2


Before:
![Screen Shot 2022-02-11 at 11 57 09 AM](https://user-images.githubusercontent.com/6638825/153664484-3c8b8708-cf52-4f1a-afa7-49af0f972725.png)

After: (no memcpy)
<img width="1791" alt="Screen Shot 2022-02-11 at 11 34 49 AM" src="https://user-images.githubusercontent.com/6638825/153664528-edaf1205-b98c-40e3-b4bb-c9815bff24fe.png">



[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Mar 2, 2022
ghstack-source-id: 4544776
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

std::vector<ExprPtr> args_;
};

class TORCH_API ExternalCall2 : public StmtNode<ExternalCall2> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this ExternalCallMemResuse or something like that instead of ExternalCall2, and add a comment to say how its different from ExternalCall ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let it be ExternalCallWithAlloc

Differential Revision: [D33960933](https://our.internmc.facebook.com/intern/diff/D33960933)

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)


The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

1. Introducing ExternalCall2 which has bufs_out and bufs_in.
The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
first bufs_out slots are for output values of the result tensor buffers ptrs,
bufs_in - tensor buffer ptrs of input arguments
last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

2. Changing logic of memReuse:
 - No allocate Bufs that will be allocated in external function calls
 - Adding FreeExt for Bufs allocated with external function calls, 

3. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2


Before:
![Screen Shot 2022-02-11 at 11 57 09 AM](https://user-images.githubusercontent.com/6638825/153664484-3c8b8708-cf52-4f1a-afa7-49af0f972725.png)

After: (no memcpy)
<img width="1791" alt="Screen Shot 2022-02-11 at 11 34 49 AM" src="https://user-images.githubusercontent.com/6638825/153664528-edaf1205-b98c-40e3-b4bb-c9815bff24fe.png">



[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Mar 8, 2022
ghstack-source-id: 9eaa8d6
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D33960933](https://our.internmc.facebook.com/intern/diff/D33960933)

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)


The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

1. Introducing ExternalCall2 which has bufs_out and bufs_in.
The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
first bufs_out slots are for output values of the result tensor buffers ptrs,
bufs_in - tensor buffer ptrs of input arguments
last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

2. Changing logic of memReuse:
 - No allocate Bufs that will be allocated in external function calls
 - Adding FreeExt for Bufs allocated with external function calls, 

3. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2


Before:
![Screen Shot 2022-02-11 at 11 57 09 AM](https://user-images.githubusercontent.com/6638825/153664484-3c8b8708-cf52-4f1a-afa7-49af0f972725.png)

After: (no memcpy)
<img width="1791" alt="Screen Shot 2022-02-11 at 11 34 49 AM" src="https://user-images.githubusercontent.com/6638825/153664528-edaf1205-b98c-40e3-b4bb-c9815bff24fe.png">



[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Mar 8, 2022
ghstack-source-id: 758a665
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D33960933](https://our.internmc.facebook.com/intern/diff/D33960933)

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)


The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

1. Introducing ExternalCall2 which has bufs_out and bufs_in.
The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
first bufs_out slots are for output values of the result tensor buffers ptrs,
bufs_in - tensor buffer ptrs of input arguments
last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

2. Changing logic of memReuse:
 - No allocate Bufs that will be allocated in external function calls
 - Adding FreeExt for Bufs allocated with external function calls, 

3. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2


Before:
![Screen Shot 2022-02-11 at 11 57 09 AM](https://user-images.githubusercontent.com/6638825/153664484-3c8b8708-cf52-4f1a-afa7-49af0f972725.png)

After: (no memcpy)
<img width="1791" alt="Screen Shot 2022-02-11 at 11 34 49 AM" src="https://user-images.githubusercontent.com/6638825/153664528-edaf1205-b98c-40e3-b4bb-c9815bff24fe.png">



[ghstack-poisoned]
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Differential Revision: [D33960933](https://our.internmc.facebook.com/intern/diff/D33960933)

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)


The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

1. Introducing ExternalCall2 which has bufs_out and bufs_in.
The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
first bufs_out slots are for output values of the result tensor buffers ptrs,
bufs_in - tensor buffer ptrs of input arguments
last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

2. Changing logic of memReuse:
 - No allocate Bufs that will be allocated in external function calls
 - Adding FreeExt for Bufs allocated with external function calls, 

3. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2


Before:
![Screen Shot 2022-02-11 at 11 57 09 AM](https://user-images.githubusercontent.com/6638825/153664484-3c8b8708-cf52-4f1a-afa7-49af0f972725.png)

After: (no memcpy)
<img width="1791" alt="Screen Shot 2022-02-11 at 11 34 49 AM" src="https://user-images.githubusercontent.com/6638825/153664528-edaf1205-b98c-40e3-b4bb-c9815bff24fe.png">



[ghstack-poisoned]
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@IvanKobzarev IvanKobzarev changed the title [tensorexp] ExternalCall2 without memcpy [tensorexp] ExternalCallWithAlloc (take ownership of aten Tensor, no memcpy ) Mar 8, 2022
…Tensor, no memcpy )"


Differential Revision: [D33960933](https://our.internmc.facebook.com/intern/diff/D33960933)

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)


The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

1. Introducing ExternalCall2 which has bufs_out and bufs_in.
The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
first bufs_out slots are for output values of the result tensor buffers ptrs,
bufs_in - tensor buffer ptrs of input arguments
last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

2. Changing logic of memReuse:
 - No allocate Bufs that will be allocated in external function calls
 - Adding FreeExt for Bufs allocated with external function calls, 

3. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2


Before:
![Screen Shot 2022-02-11 at 11 57 09 AM](https://user-images.githubusercontent.com/6638825/153664484-3c8b8708-cf52-4f1a-afa7-49af0f972725.png)

After: (no memcpy)
<img width="1791" alt="Screen Shot 2022-02-11 at 11 34 49 AM" src="https://user-images.githubusercontent.com/6638825/153664528-edaf1205-b98c-40e3-b4bb-c9815bff24fe.png">



[ghstack-poisoned]
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…Tensor, no memcpy )"


Differential Revision: [D33960933](https://our.internmc.facebook.com/intern/diff/D33960933)

Result:
quantized segmentation model mean inference goes down from 49ms to 22ms, the same as original model :)


The ultimate goal is to reduce memcpy in external_functions for the at operators without out variants (no control for caller on where output tensor will be placed).

1. Introducing ExternalCall2 which has bufs_out and bufs_in.
The size of buf_ptrs is bufs_out + bufs_in + bufs_out,
first bufs_out slots are for output values of the result tensor buffers ptrs,
bufs_in - tensor buffer ptrs of input arguments
last bufs_out slots to store TensorImpl* as void* to release buffer ptrs after usage of buf_out ptrs.

In external_functions implementation we are doing c10::intrusive_ptr::inc_ref to keep result tensor alive, it will be freed with FreeExt IR statement calling external function nnc_aten_free

2. Changing logic of memReuse:
 - No allocate Bufs that will be allocated in external function calls
 - Adding FreeExt for Bufs allocated with external function calls, 

3. As output buffer are preallocated and we can not return buffers that are external Allocated and we did incref - decref => Transformation ExternalCall -> ExternalCall2 happens in codegen; ExtCalls which result buffer is output buffer are not converted to ExternalCall2


Before:
![Screen Shot 2022-02-11 at 11 57 09 AM](https://user-images.githubusercontent.com/6638825/153664484-3c8b8708-cf52-4f1a-afa7-49af0f972725.png)

After: (no memcpy)
<img width="1791" alt="Screen Shot 2022-02-11 at 11 34 49 AM" src="https://user-images.githubusercontent.com/6638825/153664528-edaf1205-b98c-40e3-b4bb-c9815bff24fe.png">



[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Mar 8, 2022
ghstack-source-id: dd5240d
Pull Request resolved: #72225
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 9, 2022
Summary: Pull Request resolved: #72225

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33960933

Pulled By: IvanKobzarev

fbshipit-source-id: fc73a3de9e5150919e3806516065b4a6c8316000
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2022

Hey @IvanKobzarev.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@facebook-github-bot facebook-github-bot deleted the gh/ivankobzarev/111/head branch March 13, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants