-
Notifications
You must be signed in to change notification settings - Fork 552
Improves sort_by_key instantiations #1397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There HAS to be a better way to do this. |
|
I'm open to ANY other way you may have for this. I don't like it either. But right now, this is the only way its going to compile well on all architectures. |
|
You can make multiple .o targets and pass a different -D flag to each one using the same source file. It will require some changes to the CMake files but it will be hell of a lot better then adding 20 files. |
|
Ok, that might work. I'll look into it. |
|
build arrayfire tegrak1 ci |
|
|
||
| #include <kernel/sort_by_key_impl.hpp> | ||
|
|
||
| // SBK_TYPES:float double int uint intl uintl short ushort char uchar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice
|
I don't know if there's anything that can be done, but it became extremely slow to compile arrayfire after this merge. |
|
Well, actually it became slow after #1373 because its like adding a bunch of new kernels as sort_by_key requires All-to-All mapping for types. With #1373 the number of instantiations per file were so many that NVCC was crashing even on the best machines. This fixed the crashing part. I have been thinking about this and it might make a difference if we remove the dim template from the batched function. The dim template is not actually used in any of the actual sort calls. I'm looking into the time saved - its about a minute and a half reduction (mostly on the CUDA backend). |
|
@FilipeMaia Can you verify if it is after this PR ? We had other PRs with similar names. |
|
@shehzan10 is probably correct. I haven't tried just after #1373 but it's the new sort kernels that cause the slowness, so the reason must be #1373. |
|
@FilipeMaia that is a problem we have to deal with because of using a template based library like thrust.. :-/ Too many instantiations. Need to figure out if there is any easier way to deal with this. |
CPU and OpenCL use single source file to compile to separate object files.
CUDA requires generating files using CMake based on a template file which is then added as source.