-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
The bernoulli numbers in current version are generated in series by a single, naive and serial stream. The performance of OPs like Dropout which call the steam is poor on CPU. The evidence of some case shows that the performance on CPU is 250X slower than that on GPU. But the gap should not be that great in consideration of the peak theoretical performance.
The code shows it is clear that bernoulli number generation of GPU is not restricted by the only thread. Furtherly, the code also calls the lib of cuda to create random number(curand_uniform_double).
Actually, Intel-Caffe takes advantage of VSL math lib and openmp to do the same work on CPU as that on GPU in PyTorch. We may borrow the code to help PyTorch to get better performance on CPU. However, I also notice that the seed of random number stream of current version could be set mannually. We maybe change the code in Intel-Caffe slightly if necessary.
@cpuhrsch Your advice is important to us because of your effort on CPU. Could you spare some time to look into the part of code? Looking forward to your point.