Skip to content

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented Sep 13, 2020

per title, to make it easier to track the creation of stray contexts:

python -c "import torch; a=torch.randn(1, device='cuda'); print(torch.cuda.memory.list_gpu_processes(0)); print(torch.cuda.memory.list_gpu_processes(1))"
GPU:0
process      79749 uses      601.000 MB GPU memory
GPU:1
no processes are running

@ngimel ngimel requested review from mcarilli and mruberry September 13, 2020 17:49
@dr-ci
Copy link

dr-ci bot commented Sep 13, 2020

💊 CI failures summary and remediations

As of commit 9c069ee (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 14 01:20:12 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Sep 14 01:20:12 At: 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 01:20:12  
Sep 14 01:20:12 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 14 01:20:12  
Sep 14 01:20:12 At: 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 01:20:12  
Sep 14 01:20:12 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 14 01:20:12  
Sep 14 01:20:12 At: 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 01:20:12  
Sep 14 01:20:12 ok (1.638s) 
Sep 14 01:20:14   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.638s) 
Sep 14 01:20:16   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.645s) 
Sep 14 01:20:17   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ERROR:root:Caught exception:  
Sep 14 01:20:17 Traceback (most recent call last): 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 9 times.

handling out-of-memory exceptions.
Arguments:
device (torch.device or int, optional): selected device. Returns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this could accept a string, too.

def _get_device_index(device: Union[Device, str, int], optional: bool = False,

"""

try:
import pynvml # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint failure is real

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 95a69a7.

xuzhao9 pushed a commit that referenced this pull request Sep 18, 2020
Summary:
per title, to make it easier to track the creation of stray contexts:
```
python -c "import torch; a=torch.randn(1, device='cuda'); print(torch.cuda.memory.list_gpu_processes(0)); print(torch.cuda.memory.list_gpu_processes(1))"
GPU:0
process      79749 uses      601.000 MB GPU memory
GPU:1
no processes are running
```

Pull Request resolved: #44616

Reviewed By: mruberry

Differential Revision: D23675739

Pulled By: ngimel

fbshipit-source-id: ffa14cad9d7144e883de13b1c2c6817bd432f53a
@ngimel ngimel deleted the list_gpu_processes branch September 30, 2020 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants