Skip to content

Conversation

@willyborn
Copy link
Contributor

@willyborn willyborn commented Oct 3, 2024

The OpenCL code did not take the idx offset into account resulting in the usage of the wrong data set, without reporting any error.
This was the reason for the differences of the bagging example, on the different platforms.

Description

Following commits are included

  • e4de910 Fixed missing offset handling in lookup

This contains the fix in the OpenCL code

  • 80dc4ed Extends ARG checking & buffer leak for af_lookup

For all platforms
Extra checks included on arguments: nullptr as address of output
ASAP release of output array if previously allocated
Returning NULL_ARRAY when input array or idx array are a NULL_ARRAY

  • 98631b8 Adds argument checking for internal debugging

For all platforms
Internal code calls directly the back-end code, without passing a check on their arguments. This check is performed with asserts so that performance of Release code is unaffected.

  • 3e28fff Extends test cases for lookup

For all platforms
All tests on lookup are now including a check on buffer leaks
Arrays of type: NULL_ARRAY, FULL_ARRAY, SUB_ARRAY and REORDERED_ARRAY are used as input type
Check for nullptr for address of output array
Check for assignment of an argument array to the output array

  • Is this a new feature or a bug fix? Bug
  • Why these changes are necessary. To produce a correct result in all situations
  • Potential impact on specific hardware, software or backends. Only OpenCL is impacted
  • New functions and their functionality. None.
  • Can this PR be backported to older versions? Yes
  • Future changes not implemented in this PR.
    -->
    Fixes: [BUG] Results for bagging different for opencl #3606

Changes to Users

No changes for programs using the FULL_ARRAY (offset = 0). The correct result will be given for all other case as well now.

Checklist

  • Rebased on latest master
  • Code compiles
  • Tests pass
  • Functions added to unified API
  • Functions documented

@willyborn
Copy link
Contributor Author

I noticed that in many tests, some basics are missing, like

  • nullptr checks for addresses
  • support for all types of arrays
  • buffer leak checks

If desired, I can go over all tests and make sure those base tests are included.

Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
const unsigned dim) {
const dim4 &iDims = input.dims();
assert(dim <= 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace the asserts with the ARG_ASSERT macro to make it consistent with our current argument checking? For example, for this one: ARG_ASSERT(1, (dim <= 3))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes done.
Rebased to master and force pushed.

edwinsolisf
edwinsolisf previously approved these changes Jan 14, 2025
Copy link
Contributor

@edwinsolisf edwinsolisf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works, tested on Windows

Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
const unsigned dim) {
const dim4 &iDims = input.dims();
ARG_ASSERT(1, (af_dtype)dtype_traits<idx_t>::af_type != c32 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks are already performed in src/api/c/index.cpp

Copy link
Contributor Author

@willyborn willyborn Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

  • These checks are in place for the end-user, and now extended to block crashes and buffer (af_array) leaks.
  • They are not for the arrayfire developer, who is working in the 3th layer (1=C++, 2=C, 3=Array). Here we have multiple problems:
    1. The context which is prepared in the C layer now has to be presented by the calling Array layer. Last year, I already corrected twice such errors.
    2. The requirements are not documented for this layer, so it is logical that errors as in point 1 occur.
    3. The requirements are not enforced, so the developer is not informed of his error during testing. Many violations just return a result, although not always the correct one.

In my original proposal, I used asserts so that above problems were solved without performance impact. Those checks are not needed for the end-user, since those has to be covered in the C-layer.
I can convert them all to asserts again. Let me know the common preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will discuss with @umar456 how we should go about this.

ARG_ASSERT(2, (idxType != b8));

af_array output = 0;
// Try to release ASAP in the hope that the data buffer, if any, will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen a performance improvement by doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

af_array output=0; is replaced by af_array output=nullptr;
In the snapshot above, only the upper section of the update is shown.

Code wise, I assume both will have the similar instructions and certainly the same timing.
Type wise, the first is an assignment of an integer to a pointer. In my eyes a nullptr needs special attention/treatment, because it is frequently the cause of many crashes (seg faults). Therefor they should be clearly visible iso hidden behind a standard integer initialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should have been more clear. I see you added extra logic to try to reuse a buffer where possible. I was wondering if you saw any benefit from doing so.

Copy link
Contributor Author

@willyborn willyborn Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing, I fell myself into the trap by reusing the out array in the next loop. As result, I ran out of GPU memory.
I called however af_release_array at the end of the function, although not inside the loop, so I was sure that all memory was correctly freed. I see regularly people complaining about the high memory usage of arrayfire, perhaps this is a possible cause. By the way, nowhere in the documentation is mentioned that you should provided a nullptr and not the address of a real array.

Searching further for the cause of my out of memory situation, I remarked that the provided out array could be released at the start of the af_function, resulting in possible reuse if size permits.

There is 1 downside. If the provided out array is not initialized (only af_array arr;) you will get an error because a random address will be used in the internal releasing. Those using the C++ interface, will never have this problem since the array object always exists, even when empty.

PS: I discovered some cases of non-initialization in other sections of arrayfire. In case you guys, agree with my reasoning I will go over all functions including checks and extra tests to increase the robustness and certainty of result. I already discovered 2 other cases where random results are returned (missing offset in OCL & sync error with JIT array in CPU). More after this commit, since it is a further development on this.

@umar456
Copy link
Member

umar456 commented Jan 21, 2025

The initial fix looks good.

With regard to the af_lookup changes. I don't think we can take that as the way you have it. Even though you are not changing the API, the contract we have with the user changes. With most other ArrayFire C functions we accept uninitialized af_array objects(i.e. af_array blah = 0;) and this implies that ArrayFire will allocate a new af_array.

With the new function you are accepting an already allocated array. This is fine but you are changing its meaning and it would break user code. I suggest we make a new function if you want this type of functionality. Either af_lookup_v2 or af_lookup_inplace. This function willl write to an already allocated array and will not allocate internally. This would require you to make sure that the types and size/shape are acceptable. I am okay if you want to make this change but I would prefer it in a different PR.

@willyborn
Copy link
Contributor Author

OK.

I drop the release_array inside the function and return back to the original version.

@edwinsolisf
Copy link
Contributor

To move forward with this PR we have decided to keep the solution to the bug and the tests for it, so could just leave in the fix commit [e4de910](e4de910 and the tests specific to this problem? I see 3e28fff modifies some of our previous tests/setup, could you point which specific test checks this problem?

Regarding the other changes, you may submit them in another PR so we can discuss such changes separately.

@willyborn
Copy link
Contributor Author

willyborn commented Mar 21, 2025

The existing tests are not modified. It generates different types of arrays to be overwritten by the function.

I extended the concept for input arrays, since multiple functions fail when a none common array is provided as input array. The most common errors are with sub arrays, empty arrays, JIT arrays, etc. I added some methods to TestOutputArrayInfo for reporting purposes, like current array type generated.
To remain consistent with the current way of testing I thought that it would be best to extend the same idea by creating a new class for TestInputArrayInfo although with same usage as the existing TestOutputArrayInfo class.

On my local version, I started including the tests for other functions and already found following errors as a result:

  • reduce is having problems with JIT arrays on CPU and sub arrays on OpenCL.

I did not continue, because I wanted first feedback on the lookup case (sub array problems on OpenCL).

I will submit a separate PR, after the new release is finished.
You can include only e4de910 and leave all the other changes on tests out for 3.10 .
Later for 3.11 you can decide for a more general approach (also for lookup). As mentioned for reduce, I am convinced that more errors will be discovered. The same go for the argument checking at the internal level, which also generated multiple errors in the past (most recent join wrongly called by field example: PR3605).

@edwinsolisf
Copy link
Contributor

@willyborn Yeah, we prefer that being its own PR. Sounds good, I'll take that commit. I will also add some tests for lookup with different offsets.

Copy link
Contributor

@edwinsolisf edwinsolisf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass on all backends, checked machine learning examples using lookup as well

@edwinsolisf
Copy link
Contributor

edwinsolisf commented Mar 31, 2025

Hey @willyborn, I mistakenly closed this PR, could you reopen it again? I apologize for the incovenience. Thanks!

@willyborn
Copy link
Contributor Author

@edwinsolisf . I like to reopen, although I do not see any option allowing me. I hope this comment will reopen as a consequence.
Perhaps lack of permissions?

@edwinsolisf
Copy link
Contributor

I see the problem, it seems I overwrote your commits on your branch, so no changes are present. Could you try pulling my lookup branch and pushing it to your remote loookup branch? I think that should allow you to reopen the PR.

@edwinsolisf
Copy link
Contributor

I have recreated this PR in #3650 that includes and references your fix . I apologize for the incovenience this caused.

@willyborn
Copy link
Contributor Author

willyborn commented Apr 2, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Results for bagging different for opencl

5 participants