Skip to content

Conversation

@Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Jun 8, 2021

Since the computation of the two slices of the feature transformer output are independent we can try to run them on separate streams. On my GTX750 I can notice a slight performance increase with very small FT sizes, and cuda profiler shows some overlap between kernels.
obraz

This may increase performance on beefier GPUs like V100, but that remains to be tested.

Note that we can in fact run two separate streams for backward too, even though they operate on the same output buffer, because all writes are atomic.

@vondele
Copy link
Member

vondele commented Jun 8, 2021

that looks good to me, exposing the parallelism can only help.

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 8, 2021

On V100 with 1 thread and 1 worker

1 thread, 1 worker:
before: 48.29 at 1000, 47.94 at 2000, 47.84 at 3000
after: 48.29 at 1000, 47.93 at 2000, 47.82 at 3000

8 threads, 4 workers:
before: 57.04 at 1000, 57.04 at 2000, 57.16 at 3000
after: 56.37 at 1000, 56.41 at 2000, 56.52 at 3000

so doesn't help, at least for now. But also doesn't do harm.

@vondele
Copy link
Member

vondele commented Jun 8, 2021

we kind of now that on V100 and above it is limited by the CPU.

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 26, 2021

I'd like to see some benchmarks from other people before pushing this.

@Sopel97 Sopel97 added the help wanted Extra attention is needed label Jun 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants