adds list_gpu_processes function #44616

ngimel · 2020-09-13T17:49:01Z

per title, to make it easier to track the creation of stray contexts:

python -c "import torch; a=torch.randn(1, device='cuda'); print(torch.cuda.memory.list_gpu_processes(0)); print(torch.cuda.memory.list_gpu_processes(1))"
GPU:0
process      79749 uses      601.000 MB GPU memory
GPU:1
no processes are running

dr-ci · 2020-09-13T19:18:18Z

💊 CI failures summary and remediations

As of commit 9c069ee (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 14 01:20:12 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future

Sep 14 01:20:12 At: 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 01:20:12  
Sep 14 01:20:12 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 14 01:20:12  
Sep 14 01:20:12 At: 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 01:20:12  
Sep 14 01:20:12 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 14 01:20:12  
Sep 14 01:20:12 At: 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 01:20:12   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 01:20:12  
Sep 14 01:20:12 ok (1.638s) 
Sep 14 01:20:14   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.638s) 
Sep 14 01:20:16   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.645s) 
Sep 14 01:20:17   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ERROR:root:Caught exception:  
Sep 14 01:20:17 Traceback (most recent call last):

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 9 times.

mruberry · 2020-09-13T23:20:05Z

torch/cuda/memory.py

+    handling out-of-memory exceptions.
+
+    Arguments:
+        device (torch.device or int, optional): selected device. Returns


nit: I think this could accept a string, too.

pytorch/torch/cuda/_utils.py

Line 8 in 8d570bc

def _get_device_index(device: Union[Device, str, int], optional: bool = False,

mruberry · 2020-09-13T23:21:37Z

torch/cuda/memory.py

+    """
+
+    try:
+        import pynvml # type: ignore


lint failure is real

mruberry

Cool!

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-09-14T18:17:09Z

@ngimel merged this pull request in 95a69a7.

Summary: per title, to make it easier to track the creation of stray contexts: ``` python -c "import torch; a=torch.randn(1, device='cuda'); print(torch.cuda.memory.list_gpu_processes(0)); print(torch.cuda.memory.list_gpu_processes(1))" GPU:0 process 79749 uses 601.000 MB GPU memory GPU:1 no processes are running ``` Pull Request resolved: #44616 Reviewed By: mruberry Differential Revision: D23675739 Pulled By: ngimel fbshipit-source-id: ffa14cad9d7144e883de13b1c2c6817bd432f53a

adds list_gpu_processes function:

20d816f

ngimel requested review from mcarilli and mruberry September 13, 2020 17:49

lint

fe0e544

mypy

a1a5104

mruberry reviewed Sep 13, 2020

View reviewed changes

torch/cuda/memory.py Outdated

"""

try:

import pynvml # type: ignore

Copy link

Collaborator

mruberry Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint failure is real

mruberry approved these changes Sep 13, 2020

View reviewed changes

lint

9c069ee

facebook-github-bot reviewed Sep 14, 2020

View reviewed changes

facebook-github-bot closed this in 95a69a7 Sep 14, 2020

facebook-github-bot added the merged label Sep 14, 2020

ngimel deleted the list_gpu_processes branch September 30, 2020 04:32

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adds list_gpu_processes function #44616

adds list_gpu_processes function #44616

Uh oh!

ngimel commented Sep 13, 2020

Uh oh!

dr-ci bot commented Sep 13, 2020 •

edited

Loading

Uh oh!

mruberry Sep 13, 2020

Uh oh!

mruberry Sep 13, 2020

Uh oh!

mruberry left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Sep 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adds list_gpu_processes function #44616

adds list_gpu_processes function #44616

Uh oh!

Conversation

ngimel commented Sep 13, 2020

Uh oh!

dr-ci bot commented Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)

Uh oh!

mruberry Sep 13, 2020

Choose a reason for hiding this comment

Uh oh!

mruberry Sep 13, 2020

Choose a reason for hiding this comment

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dr-ci bot commented Sep 13, 2020 •

edited

Loading