lookup disgarded idx offset in OpenCL #3611

willyborn · 2024-10-03T19:57:17Z

The OpenCL code did not take the idx offset into account resulting in the usage of the wrong data set, without reporting any error.
This was the reason for the differences of the bagging example, on the different platforms.

Description

Following commits are included

e4de910 Fixed missing offset handling in lookup

This contains the fix in the OpenCL code

80dc4ed Extends ARG checking & buffer leak for af_lookup

For all platforms
Extra checks included on arguments: nullptr as address of output
ASAP release of output array if previously allocated
Returning NULL_ARRAY when input array or idx array are a NULL_ARRAY

98631b8 Adds argument checking for internal debugging

For all platforms
Internal code calls directly the back-end code, without passing a check on their arguments. This check is performed with asserts so that performance of Release code is unaffected.

3e28fff Extends test cases for lookup

For all platforms
All tests on lookup are now including a check on buffer leaks
Arrays of type: NULL_ARRAY, FULL_ARRAY, SUB_ARRAY and REORDERED_ARRAY are used as input type
Check for nullptr for address of output array
Check for assignment of an argument array to the output array

Is this a new feature or a bug fix? Bug
Why these changes are necessary. To produce a correct result in all situations
Potential impact on specific hardware, software or backends. Only OpenCL is impacted
New functions and their functionality. None.
Can this PR be backported to older versions? Yes
Future changes not implemented in this PR.
-->
Fixes: [BUG] Results for bagging different for opencl #3606

Changes to Users

No changes for programs using the FULL_ARRAY (offset = 0). The correct result will be given for all other case as well now.

Checklist

willyborn · 2024-10-03T20:14:51Z

I noticed that in many tests, some basics are missing, like

nullptr checks for addresses
support for all types of arrays
buffer leak checks

If desired, I can go over all tests and make sure those base tests are included.

edwinsolisf · 2025-01-09T21:34:16Z

src/backend/cpu/lookup.cpp

 Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
                   const unsigned dim) {
-    const dim4 &iDims = input.dims();
+    assert(dim <= 3);


Could you replace the asserts with the ARG_ASSERT macro to make it consistent with our current argument checking? For example, for this one: ARG_ASSERT(1, (dim <= 3))

Changes done.
Rebased to master and force pushed.

edwinsolisf

Works, tested on Windows

christophe-murphy · 2025-01-21T01:38:47Z

src/backend/cpu/lookup.cpp

 Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
                   const unsigned dim) {
-    const dim4 &iDims = input.dims();
+    ARG_ASSERT(1, (af_dtype)dtype_traits<idx_t>::af_type != c32 &&


These checks are already performed in src/api/c/index.cpp

Correct.

These checks are in place for the end-user, and now extended to block crashes and buffer (af_array) leaks.

They are not for the arrayfire developer, who is working in the 3th layer (1=C++, 2=C, 3=Array). Here we have multiple problems:

The context which is prepared in the C layer now has to be presented by the calling Array layer. Last year, I already corrected twice such errors.

The requirements are not documented for this layer, so it is logical that errors as in point 1 occur.

The requirements are not enforced, so the developer is not informed of his error during testing. Many violations just return a result, although not always the correct one.

In my original proposal, I used asserts so that above problems were solved without performance impact. Those checks are not needed for the end-user, since those has to be covered in the C-layer.
I can convert them all to asserts again. Let me know the common preference.

OK, I will discuss with @umar456 how we should go about this.

christophe-murphy · 2025-01-21T01:39:32Z

src/api/c/index.cpp

        ARG_ASSERT(2, (idxType != b8));

-        af_array output = 0;
+        // Try to release ASAP in the hope that the data buffer, if any, will be


Have you seen a performance improvement by doing this?

af_array output=0; is replaced by af_array output=nullptr;
In the snapshot above, only the upper section of the update is shown.

Code wise, I assume both will have the similar instructions and certainly the same timing.
Type wise, the first is an assignment of an integer to a pointer. In my eyes a nullptr needs special attention/treatment, because it is frequently the cause of many crashes (seg faults). Therefor they should be clearly visible iso hidden behind a standard integer initialization.

Sorry, should have been more clear. I see you added extra logic to try to reuse a buffer where possible. I was wondering if you saw any benefit from doing so.

While testing, I fell myself into the trap by reusing the out array in the next loop. As result, I ran out of GPU memory.
I called however af_release_array at the end of the function, although not inside the loop, so I was sure that all memory was correctly freed. I see regularly people complaining about the high memory usage of arrayfire, perhaps this is a possible cause. By the way, nowhere in the documentation is mentioned that you should provided a nullptr and not the address of a real array.

Searching further for the cause of my out of memory situation, I remarked that the provided out array could be released at the start of the af_function, resulting in possible reuse if size permits.

There is 1 downside. If the provided out array is not initialized (only af_array arr;) you will get an error because a random address will be used in the internal releasing. Those using the C++ interface, will never have this problem since the array object always exists, even when empty.

PS: I discovered some cases of non-initialization in other sections of arrayfire. In case you guys, agree with my reasoning I will go over all functions including checks and extra tests to increase the robustness and certainty of result. I already discovered 2 other cases where random results are returned (missing offset in OCL & sync error with JIT array in CPU). More after this commit, since it is a further development on this.

umar456 · 2025-01-21T21:57:50Z

The initial fix looks good.

With regard to the af_lookup changes. I don't think we can take that as the way you have it. Even though you are not changing the API, the contract we have with the user changes. With most other ArrayFire C functions we accept uninitialized af_array objects(i.e. af_array blah = 0;) and this implies that ArrayFire will allocate a new af_array.

With the new function you are accepting an already allocated array. This is fine but you are changing its meaning and it would break user code. I suggest we make a new function if you want this type of functionality. Either af_lookup_v2 or af_lookup_inplace. This function willl write to an already allocated array and will not allocate internally. This would require you to make sure that the types and size/shape are acceptable. I am okay if you want to make this change but I would prefer it in a different PR.

willyborn · 2025-01-21T22:04:06Z

OK.

I drop the release_array inside the function and return back to the original version.

edwinsolisf · 2025-03-21T00:08:56Z

To move forward with this PR we have decided to keep the solution to the bug and the tests for it, so could just leave in the fix commit [e4de910](e4de910 and the tests specific to this problem? I see 3e28fff modifies some of our previous tests/setup, could you point which specific test checks this problem?

Regarding the other changes, you may submit them in another PR so we can discuss such changes separately.

willyborn · 2025-03-21T21:49:21Z

The existing tests are not modified. It generates different types of arrays to be overwritten by the function.

I extended the concept for input arrays, since multiple functions fail when a none common array is provided as input array. The most common errors are with sub arrays, empty arrays, JIT arrays, etc. I added some methods to TestOutputArrayInfo for reporting purposes, like current array type generated.
To remain consistent with the current way of testing I thought that it would be best to extend the same idea by creating a new class for TestInputArrayInfo although with same usage as the existing TestOutputArrayInfo class.

On my local version, I started including the tests for other functions and already found following errors as a result:

reduce is having problems with JIT arrays on CPU and sub arrays on OpenCL.

I did not continue, because I wanted first feedback on the lookup case (sub array problems on OpenCL).

I will submit a separate PR, after the new release is finished.
You can include only e4de910 and leave all the other changes on tests out for 3.10 .
Later for 3.11 you can decide for a more general approach (also for lookup). As mentioned for reduce, I am convinced that more errors will be discovered. The same go for the argument checking at the internal level, which also generated multiple errors in the past (most recent join wrongly called by field example: PR3605).

edwinsolisf · 2025-03-21T22:43:58Z

@willyborn Yeah, we prefer that being its own PR. Sounds good, I'll take that commit. I will also add some tests for lookup with different offsets.

edwinsolisf

Tests pass on all backends, checked machine learning examples using lookup as well

edwinsolisf · 2025-03-31T23:29:01Z

Hey @willyborn, I mistakenly closed this PR, could you reopen it again? I apologize for the incovenience. Thanks!

willyborn · 2025-04-01T06:47:59Z

@edwinsolisf . I like to reopen, although I do not see any option allowing me. I hope this comment will reopen as a consequence.
Perhaps lack of permissions?

edwinsolisf · 2025-04-01T19:32:38Z

I see the problem, it seems I overwrote your commits on your branch, so no changes are present. Could you try pulling my lookup branch and pushing it to your remote loookup branch? I think that should allow you to reopen the PR.

edwinsolisf · 2025-04-02T03:34:07Z

I have recreated this PR in #3650 that includes and references your fix . I apologize for the incovenience this caused.

willyborn · 2025-04-02T05:54:35Z

Thanks. The important thing is that it is fixed.

…

On Wed, Apr 2, 2025, 05:34 Edwin Lester Solís Fuentes < ***@***.***> wrote: I have recreated this PR in #3650 <#3650> that includes and references your fix . I apologize for the incovenience this caused. — Reply to this email directly, view it on GitHub <#3611 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPHZUMPKULZUNAKKDSD2XNLENAVCNFSM6AAAAABPKQDPA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGI2TMMZXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***> [image: edwinsolisf]*edwinsolisf* left a comment (arrayfire/arrayfire#3611) <#3611 (comment)> I have recreated this PR in #3650 <#3650> that includes and references your fix . I apologize for the incovenience this caused. — Reply to this email directly, view it on GitHub <#3611 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPHZUMPKULZUNAKKDSD2XNLENAVCNFSM6AAAAABPKQDPA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONZRGI2TMMZXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

willyborn force-pushed the loookup branch from 5aafcf3 to c4a3917 Compare October 8, 2024 09:06

edwinsolisf reviewed Jan 9, 2025

View reviewed changes

willyborn force-pushed the loookup branch from c4a3917 to 40844d0 Compare January 13, 2025 21:39

edwinsolisf self-requested a review January 14, 2025 05:01

edwinsolisf previously approved these changes Jan 14, 2025

View reviewed changes

christophe-murphy reviewed Jan 21, 2025

View reviewed changes

willyborn dismissed edwinsolisf’s stale review via 3e28fff January 22, 2025 15:25

willyborn force-pushed the loookup branch from 40844d0 to 3e28fff Compare January 22, 2025 15:25

melonakos added this to the 3.10 milestone Feb 5, 2025

edwinsolisf force-pushed the loookup branch from 3e28fff to 01da29f Compare March 27, 2025 00:32

edwinsolisf approved these changes Mar 27, 2025

View reviewed changes

edwinsolisf closed this Mar 31, 2025

edwinsolisf force-pushed the loookup branch from 01da29f to d63c391 Compare March 31, 2025 23:15

edwinsolisf mentioned this pull request Apr 1, 2025

Fix for lookup offset in opencl and non-column indices #3650

Merged

5 tasks

lookup disgarded idx offset in OpenCL #3611

lookup disgarded idx offset in OpenCL #3611

Uh oh!

Conversation

willyborn commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes to Users

Checklist

Uh oh!

willyborn commented Oct 3, 2024

Uh oh!

edwinsolisf Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

willyborn Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

edwinsolisf left a comment

Choose a reason for hiding this comment

Uh oh!

christophe-murphy Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

willyborn Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christophe-murphy Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

christophe-murphy Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

willyborn Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

christophe-murphy Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

willyborn Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umar456 commented Jan 21, 2025

Uh oh!

willyborn commented Jan 21, 2025

Uh oh!

edwinsolisf commented Mar 21, 2025

Uh oh!

willyborn commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwinsolisf commented Mar 21, 2025

Uh oh!

edwinsolisf left a comment

Choose a reason for hiding this comment

Uh oh!

edwinsolisf commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willyborn commented Apr 1, 2025

Uh oh!

edwinsolisf commented Apr 1, 2025

Uh oh!

edwinsolisf commented Apr 2, 2025

Uh oh!

willyborn commented Apr 2, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

willyborn commented Oct 3, 2024 •

edited

Loading

willyborn Jan 21, 2025 •

edited

Loading

willyborn Jan 21, 2025 •

edited

Loading

willyborn commented Mar 21, 2025 •

edited

Loading

edwinsolisf commented Mar 31, 2025 •

edited

Loading