-
Notifications
You must be signed in to change notification settings - Fork 552
lookup disgarded idx offset in OpenCL #3611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I noticed that in many tests, some basics are missing, like
If desired, I can go over all tests and make sure those base tests are included. |
src/backend/cpu/lookup.cpp
Outdated
| Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, | ||
| const unsigned dim) { | ||
| const dim4 &iDims = input.dims(); | ||
| assert(dim <= 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you replace the asserts with the ARG_ASSERT macro to make it consistent with our current argument checking? For example, for this one: ARG_ASSERT(1, (dim <= 3))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes done.
Rebased to master and force pushed.
edwinsolisf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works, tested on Windows
src/backend/cpu/lookup.cpp
Outdated
| Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, | ||
| const unsigned dim) { | ||
| const dim4 &iDims = input.dims(); | ||
| ARG_ASSERT(1, (af_dtype)dtype_traits<idx_t>::af_type != c32 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks are already performed in src/api/c/index.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
- These checks are in place for the end-user, and now extended to block crashes and buffer (af_array) leaks.
- They are not for the arrayfire developer, who is working in the 3th layer (1=C++, 2=C, 3=Array). Here we have multiple problems:
- The context which is prepared in the C layer now has to be presented by the calling Array layer. Last year, I already corrected twice such errors.
- The requirements are not documented for this layer, so it is logical that errors as in point 1 occur.
- The requirements are not enforced, so the developer is not informed of his error during testing. Many violations just return a result, although not always the correct one.
In my original proposal, I used asserts so that above problems were solved without performance impact. Those checks are not needed for the end-user, since those has to be covered in the C-layer.
I can convert them all to asserts again. Let me know the common preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will discuss with @umar456 how we should go about this.
src/api/c/index.cpp
Outdated
| ARG_ASSERT(2, (idxType != b8)); | ||
|
|
||
| af_array output = 0; | ||
| // Try to release ASAP in the hope that the data buffer, if any, will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen a performance improvement by doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
af_array output=0; is replaced by af_array output=nullptr;
In the snapshot above, only the upper section of the update is shown.
Code wise, I assume both will have the similar instructions and certainly the same timing.
Type wise, the first is an assignment of an integer to a pointer. In my eyes a nullptr needs special attention/treatment, because it is frequently the cause of many crashes (seg faults). Therefor they should be clearly visible iso hidden behind a standard integer initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should have been more clear. I see you added extra logic to try to reuse a buffer where possible. I was wondering if you saw any benefit from doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While testing, I fell myself into the trap by reusing the out array in the next loop. As result, I ran out of GPU memory.
I called however af_release_array at the end of the function, although not inside the loop, so I was sure that all memory was correctly freed. I see regularly people complaining about the high memory usage of arrayfire, perhaps this is a possible cause. By the way, nowhere in the documentation is mentioned that you should provided a nullptr and not the address of a real array.
Searching further for the cause of my out of memory situation, I remarked that the provided out array could be released at the start of the af_function, resulting in possible reuse if size permits.
There is 1 downside. If the provided out array is not initialized (only af_array arr;) you will get an error because a random address will be used in the internal releasing. Those using the C++ interface, will never have this problem since the array object always exists, even when empty.
PS: I discovered some cases of non-initialization in other sections of arrayfire. In case you guys, agree with my reasoning I will go over all functions including checks and extra tests to increase the robustness and certainty of result. I already discovered 2 other cases where random results are returned (missing offset in OCL & sync error with JIT array in CPU). More after this commit, since it is a further development on this.
|
The initial fix looks good. With regard to the af_lookup changes. I don't think we can take that as the way you have it. Even though you are not changing the API, the contract we have with the user changes. With most other ArrayFire C functions we accept uninitialized af_array objects(i.e. With the new function you are accepting an already allocated array. This is fine but you are changing its meaning and it would break user code. I suggest we make a new function if you want this type of functionality. Either af_lookup_v2 or af_lookup_inplace. This function willl write to an already allocated array and will not allocate internally. This would require you to make sure that the types and size/shape are acceptable. I am okay if you want to make this change but I would prefer it in a different PR. |
|
OK. I drop the release_array inside the function and return back to the original version. |
|
To move forward with this PR we have decided to keep the solution to the bug and the tests for it, so could just leave in the fix commit [e4de910](e4de910 and the tests specific to this problem? I see 3e28fff modifies some of our previous tests/setup, could you point which specific test checks this problem? Regarding the other changes, you may submit them in another PR so we can discuss such changes separately. |
|
The existing tests are not modified. It generates different types of arrays to be overwritten by the function. I extended the concept for input arrays, since multiple functions fail when a none common array is provided as input array. The most common errors are with sub arrays, empty arrays, JIT arrays, etc. I added some methods to TestOutputArrayInfo for reporting purposes, like current array type generated. On my local version, I started including the tests for other functions and already found following errors as a result:
I did not continue, because I wanted first feedback on the lookup case (sub array problems on OpenCL). I will submit a separate PR, after the new release is finished. |
|
@willyborn Yeah, we prefer that being its own PR. Sounds good, I'll take that commit. I will also add some tests for lookup with different offsets. |
edwinsolisf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass on all backends, checked machine learning examples using lookup as well
|
Hey @willyborn, I mistakenly closed this PR, could you reopen it again? I apologize for the incovenience. Thanks! |
|
@edwinsolisf . I like to reopen, although I do not see any option allowing me. I hope this comment will reopen as a consequence. |
|
I have recreated this PR in #3650 that includes and references your fix . I apologize for the incovenience this caused. |
|
Thanks.
The important thing is that it is fixed.
…On Wed, Apr 2, 2025, 05:34 Edwin Lester Solís Fuentes < ***@***.***> wrote:
I have recreated this PR in #3650
<#3650> that includes and
references your fix . I apologize for the incovenience this caused.
—
Reply to this email directly, view it on GitHub
<#3611 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPHZUMPKULZUNAKKDSD2XNLENAVCNFSM6AAAAABPKQDPA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGI2TMMZXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: edwinsolisf]*edwinsolisf* left a comment
(arrayfire/arrayfire#3611)
<#3611 (comment)>
I have recreated this PR in #3650
<#3650> that includes and
references your fix . I apologize for the incovenience this caused.
—
Reply to this email directly, view it on GitHub
<#3611 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPHZUMPKULZUNAKKDSD2XNLENAVCNFSM6AAAAABPKQDPA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGI2TMMZXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The OpenCL code did not take the idx offset into account resulting in the usage of the wrong data set, without reporting any error.
This was the reason for the differences of the bagging example, on the different platforms.
Description
Following commits are included
-->
Fixes: [BUG] Results for bagging different for opencl #3606
Changes to Users
No changes for programs using the FULL_ARRAY (offset = 0). The correct result will be given for all other case as well now.
Checklist