-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enabling Infiniband support for Gloo data channel with auto IB detection #4795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
04a3ab9 to
8647463
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
setup.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
setup.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
30788a8 to
97b49a6
Compare
|
@apaszke Now we auto-detect IB and enables the IB build when detected, user can force IB build by also using |
97b49a6 to
e2c4fd2
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
3c6939b to
a67ea5a
Compare
a82e11e to
fb4eb7f
Compare
|
CI failure is unrelated, fixed in #4826 |
fb4eb7f to
9a143f7
Compare
9a143f7 to
b5a4201
Compare
|
@pytorchbot retest this please |
tools/setup_helpers/ib_detect.py
Outdated
| IB_DEVINFO_CMD = "ibv_devinfo" | ||
|
|
||
|
|
||
| def get_command_path(command): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
tools/setup_helpers/ib_detect.py
Outdated
| if len(res) != 1: | ||
| raise Exception("-- IB_detect: unexpected parsing error while " | ||
| "trying to find the number of available devices.") | ||
| return int(res[0]) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
setup.py
Outdated
| my_env["CUDNN_LIBRARY"] = CUDNN_LIBRARY | ||
| my_env["CUDNN_INCLUDE_DIR"] = CUDNN_INCLUDE_DIR | ||
|
|
||
| if WITH_DISTRIBUTED and (WITH_IB_DEVICES or |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
tools/setup_helpers/ib_detect.py
Outdated
| WITH_IB_DEVICES = True | ||
|
|
||
| else: | ||
| print("-- IB_detect: no IB device detected, compiling with no IB support " |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
454ae08 to
1ef007c
Compare
1ef007c to
b45de3a
Compare
This PR enables the proper and easy use of Infiniband support for Gloo backend of distributed training.
Now simply just building PyTorch with
python ./setup.py installwill take care of everything by automatically detecting IB devices on the system.
this helper function of gloo ::gloo::transport::ibverbs::getDeviceNames was added earlier by me to automatically find all IB interfaces in the system.
For the Gloo data channel and cache. We now use a vector to store all the devices (not used currently, but will be able to easily extend in the future to support multiple IB devices).
Also fixed the format error of the bcast gpu direct checking.
Tested for both TCP and IB, both works fine.
Snippet of build logs:
PLUS
Added a helper script to automatically detect IB devices in the system and enable IB build by default. The user can have the option to force IB build as well using
USE_GLOO_IBVERBS python ./setup.py installIB detected
No IB detected
No IB tool found