Skip to content

Conversation

@christophe-murphy
Copy link
Contributor

A bug was found in the jit calculation tree where under certain circumstances the incorrect result would be found when both an array and a transpose of the same array are used in the same evaluation. When the transpose method is used on an array the treatment depends on the structure of the jit tree of that array. In the particular case that the array is linear (e.g. not a sub-array) and is not simply a buffer (i.e. a combination of buffers and/or scalars with some operators) a moddim node will be added to the root of the tree with the new dimensions. When the tree is evaluated the dimensions of the child buffer(s) of a moddim node are changed and the moddim node is deleted. If an evaluation happens to include both an array and the transpose of that same array then when the moddim node is applied it changes the dimensions of the children for the remaining lifetime of the evaluation including any subsequent use of these nodes. As a result if the transpose of an array is used in an expression followed by the array itself the second instance will also be transposed.

In this fix instead of creating a moddims node a deep copy of the calculation tree is made at that point with the transpose applied to the child buffer nodes. These buffer node copies will have the transposed dimensions and strides but will still point to the same memory location for the data so no copy of the underlying data needs to be made.

Checklist

  • Rebased on latest master
  • Code compiles
  • Tests pass

A bug was found in the jit calculation tree where under certain circumstances the incorrect result would be found when both an array and a transpose of the same array are used in the same evaluation. When the transpose method is used on an array the treatment depends on the structure of the jit tree of that array. In the particular case that the array is linear (e.g. not a sub-array) and is not simply a buffer (i.e. a combination of buffers and/or scalars with some operators) a moddim node will be added to the root of the tree with the new dimensions. When the tree is evaluated the dimensions of the child buffer(s) of a moddim node are changed and the moddim node is deleted. If an evaluation happens to include both an array and the transpose of that same array then when the moddim node is applied it changes the dimensions of the children for the remaining lifetime of the evaluation including any subsequent use of these nodes. As a result if the transpose of an array is used in an expression followed by the array itself the second instance will also be transposed.

In this fix instead of creating a moddims node a deep copy of the calculation tree is made at that point with the transpose applied to the child buffer nodes. These buffer node copies will have the transposed dimensions and strides but will still point to the same memory location for the data so no copy of the underlying data needs to be made.
@christophe-murphy christophe-murphy linked an issue Jun 23, 2025 that may be closed by this pull request
@christophe-murphy christophe-murphy added this to the 3.10 milestone Jun 24, 2025
}

template<typename T>
Array<T> moddimOp(const Array<T> &in, af::dim4 outDim) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like moddims is not a jittable operation anymore with this change. what are the performance implications of this?

@edwinsolisf
Copy link
Contributor

edwinsolisf commented Jul 13, 2025

I tested the effects of this PR on the OpenCL backend on the Intel Arc B580 by comparing the output of the jit kernels from the moddims and the jit test. (Note: I removed the cache kernels to make sure they are regenerated from scratch)

These are the results for moddims:
test_output_after_changes.txt
test_output_before_changes.txt

These are the results for jit:
before_jit.txt
after_jit.txt

Things to note:

  • The tests pass in both cases
  • Some of the jit kernels are different after the changes. The main difference observed in the moddims test is that more __noop operations are added which from the rough test runtime seem to get optimize or have minimal impact in the Release build. (More test to confirm optimization is needed). However, on the jit test, specifically the added test evaluateBothArrayAndItsTranspose, with the changes two more concise kernels are created instead of three.
  • The size of the compiled kernels are very similar but very slightly larger (which is expected?). While I could not compare individual kernels due to the name hashing, I compared the total space occupied by all the cached kernels generated. In my case I observed that it increased from 31.8 MB (33,386,496 bytes) to 31.9 MB (33,488,896 bytes), an increase of 0.3%.

I also tested these changes on the oneAPI backend and they also pass the tests (no jit kernels generated)

@christophe-murphy
Copy link
Contributor Author

I tested the effects of this PR on the OpenCL backend on the Intel Arc B580 by comparing the output of the jit kernels from the moddims and the jit test. (Note: I removed the cache kernels to make sure they are regenerated from scratch)

These are the results for moddims: test_output_after_changes.txt test_output_before_changes.txt

These are the results for jit: before_jit.txt after_jit.txt

Things to note:

* The tests pass in both cases

* Some of the jit kernels are different after the changes. The main difference observed in the moddims test is that more __noop operations are added which from the rough test runtime seem to get optimize or have minimal impact in the Release build. (More test to confirm optimization is needed). However, on the jit test, specifically the added test evaluateBothArrayAndItsTranspose, with the changes two more concise kernels are created instead of three.

* The size of the compiled kernels are very similar but very slightly larger (which is expected?). While I could not compare individual kernels due to the name hashing, I compared the total space occupied by all the cached kernels generated. In my case I observed that it increased from 31.8 MB (33,386,496 bytes) to 31.9 MB (33,488,896 bytes), an increase of 0.3%.

I also tested these changes on the oneAPI backend and they also pass the tests (no jit kernels generated)

Yeah, the __noop operations were a result of the moddims nodes added to the JIT tree but those are gone now.

@christophe-murphy christophe-murphy merged commit 3ae9f04 into master Jul 17, 2025
1 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] sum two array result is not correct

4 participants