Skip to content

Conversation

@shehzan10
Copy link
Member

@shehzan10 shehzan10 commented Apr 26, 2016

CPU and OpenCL use single source file to compile to separate object files.
CUDA requires generating files using CMake based on a template file which is then added as source.

@shehzan10 shehzan10 added the CUDA label Apr 26, 2016
@shehzan10 shehzan10 added this to the 3.4.0 milestone Apr 26, 2016
@umar456
Copy link
Member

umar456 commented Apr 26, 2016

There HAS to be a better way to do this.

@shehzan10
Copy link
Member Author

I'm open to ANY other way you may have for this. I don't like it either. But right now, this is the only way its going to compile well on all architectures.

@umar456
Copy link
Member

umar456 commented Apr 26, 2016

You can make multiple .o targets and pass a different -D flag to each one using the same source file. It will require some changes to the CMake files but it will be hell of a lot better then adding 20 files.

@shehzan10
Copy link
Member Author

Ok, that might work. I'll look into it.

@shehzan10 shehzan10 changed the title Split CUDA sort_by_key instantiations into 2 files for each type Improves sort_by_key instantiations Apr 26, 2016
@shehzan10
Copy link
Member Author

build arrayfire tegrak1 ci
build arrayfire tegrax1 ci

@shehzan10 shehzan10 added build and removed CUDA labels Apr 27, 2016

#include <kernel/sort_by_key_impl.hpp>

// SBK_TYPES:float double int uint intl uintl short ushort char uchar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

@pavanky pavanky merged commit 4041410 into arrayfire:devel Apr 27, 2016
@FilipeMaia
Copy link
Contributor

I don't know if there's anything that can be done, but it became extremely slow to compile arrayfire after this merge.

@shehzan10
Copy link
Member Author

shehzan10 commented Apr 30, 2016

Well, actually it became slow after #1373 because its like adding a bunch of new kernels as sort_by_key requires All-to-All mapping for types. With #1373 the number of instantiations per file were so many that NVCC was crashing even on the best machines. This fixed the crashing part.

I have been thinking about this and it might make a difference if we remove the dim template from the batched function. The dim template is not actually used in any of the actual sort calls. I'm looking into the time saved - its about a minute and a half reduction (mostly on the CUDA backend).

@pavanky
Copy link
Member

pavanky commented Apr 30, 2016

@FilipeMaia Can you verify if it is after this PR ? We had other PRs with similar names.

@FilipeMaia
Copy link
Contributor

@shehzan10 is probably correct. I haven't tried just after #1373 but it's the new sort kernels that cause the slowness, so the reason must be #1373.
I have no solution, just pointing out that the slowness is at a level where it becomes a practical concern.

@pavanky
Copy link
Member

pavanky commented Apr 30, 2016

@FilipeMaia that is a problem we have to deal with because of using a template based library like thrust.. :-/ Too many instantiations.

Need to figure out if there is any easier way to deal with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants