Skip to content

Conversation

@qiuminxu
Copy link
Contributor

@qiuminxu qiuminxu commented Aug 15, 2018

#This allows tensorboard to read profiling data from parent directory to
see multiple models.
#This PR fixes issue #1290

parent_logs

This allows tensorboard to read profiling data from parent directory to
see multiple models.
Example:
log/
run1/
model1/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not familiar with the details of how profile plugin data is generated, but it seems surprising if #1290 is asking about this type of directory hierarchy, with a new level of subdirectories under plugins/profile.

What I would expect is that the desire is to be able to use a new level (or multiple levels) of subdirectories above the current single log directory expected by the plugin. I.e. where instead of root/run#/plugins/profile, the user would have root/model#/run#/plugins/profile. Basically, the request is just that any subdirectory of the logdir root can be considered a "run" by the profile plugin, rather than only first-level subdirectories. This would be consistent with the way the subdirectory traversal works for the standard summary-based plugins. It would also match the TODO(ioeric): comment below.

Also, I could be mistaken but it looks like the code below actually requires there to be subdirectories under plugins/profile now, which would break compatibility with older log directories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nfelt The code actually searches for any subdirectories under root that called profile.
A run subdirectory was always created under the plugins/profile/ directory, for each profile runs. Note here the run means profile runs rather than model runs: when 1 model is running the user can capture a TPU profile for multiple times. Therefore, it makes sense to have the run subdirectories under plugins/profile. We expect the users to have root/model#/plugins/profile/run#, where root/model# is the model checkpoints. I think it is a expected user behavior when running cloud TPU models in GCE, and that's what #1290 is asking for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record - I misread the code here due to confusion about the terminology of the run subdirectories, this is fine and we're on the same page.

{"model1,run1": ["trace_viewer"], "model1,run2": ["trace_viewer"]}
for the example.
"""
# TODO(ioeric): use the following structure and use EventMultiplexer so that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the code below seems to be doing breadth-first expansion over the entire logdir directory tree in search of profile directories, which could be extremely slow/expensive when the logdir is large and on a remote filesystem. This has come up before in real world cases, e.g. see the description of PR #1087 (the code here is basically is doing the equivalent of tf.gfile.Walk() except by hand).

This is a concern because the code is part of index_impl() which is invoked by is_active(), and is_active() runs for every plugin - even those that aren't active - every time a user reloads their TensorBoard, so we've been pushing for them to be lightweight (see #625).

I think the best way to avoid the issue would be what this TODO suggests - using EventMultiplexer, which would mean having the profile data generation side emit an events file with a summary that serves to identify that subdirectory as a "run" with profile data (the summary doesn't need to actually contain the profile data, though doing so would be a bit more future-proof). That way the shared EventMultiplexer logic can centralize the job of searching the directory tree for data, and the profile plugin's is_active() can just do a quick and efficient call to the EventMultiplexer to see if any profile data has been discovered yet.

Another approach that would at least alleviate the time penalty is to just push the logic here into a separate thread that runs asynchronously from is_active() as discussed on #625, but that isn't ideal because then users are still paying the cost of doing this potentially expensive traversal of the full directory hierarchy on every reload (redundantly from the main EventMultiplexer traversal) - all it fixes is the latency issue but not other possible issues like sending a lot of extra GCS API calls if the logdir is on GCS, etc.

I'm happy to discuss these options in more detail if that's helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code simply tries to find a sub directory under root, that matches with "profile". EventMultiplexer is a good suggestion, the only thing is that we are saving traces rather than event files. Let's discuss offline to see how we should make it work for profile plugins. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's see if we can use EventMultiplexer logic to detect the subdirectories to load data from.

@nfelt
Copy link
Contributor

nfelt commented Feb 28, 2019

@qiuminxu @qqfish I think I may have a cleaner way to do this that I can try out. Do you know if the demo data generated by this script is still valid? https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/profile/profile_demo.py

If not, is there something I could use as test data to check if the approach I'm considering will work?

@wchargin
Copy link
Contributor

The data generated by that script is what I have been using for testing.
It appears to work; I am assuming that it is still valid. (If it’s not,
please advise!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants