Skip to content

Conversation

@trisolaran
Copy link
Contributor

trace viewer normally involve huge amount of data which browser can not
handle. streaming trace viewer use grpc to connect to a tpu profiler
analysis service to get small part of whole trace and display to user.
the amount of data depends on current view port (defined by a start and
end timestamp) and resolution (defined by zoom level).
currently the server reside on google owned virtual machine. there are
plans to open source those part.

to avoid confusing user, the patch is done in a backward compatible
fashion. user are expected to use old trace viewer (which limit the
event to 1 million), if the user specify the MASTER_TPU environment to
the IP address of the green VM, we will try to estabilish a grpc stub to
that IP:8466 port. only if stub is successfully connected we will use
new tools 'trace_viewer@' to OVERRIDE the old trace_viewer. Override
will hide old trace viewer tools to avoid confusing user.

Upon receiving the tool/data probing in python, we will determine which
tools to use. using 'trace_viewer@' will signal the downstream
componenet that streaming trace viewer is in use.

Most of the trace viewer related code already exist within google3. I
removed host filters etc for the reason that we don't support multiple
hosts yet.

trace viewer normally involve huge amount of data which browser can not
handle. streaming trace viewer use grpc to connect to a tpu profiler
analysis service to get small part of whole trace and display to user.
the amount of data depends on current view port (defined by a start and
end timestamp) and resolution (defined by zoom level).
currently the server reside on google owned virtual machine. there are
plans to open source those part.

to avoid confusing user, the patch is done in a backward compatible
fashion. user are expected to use old trace viewer (which limit the
event to 1 million), if the user specify the MASTER_TPU environment to
the IP address of the green VM, we will try to estabilish a grpc stub to
that IP:8466 port. only if stub is successfully connected we will use
new tools 'trace_viewer@' to OVERRIDE the old trace_viewer. Override
will hide old trace viewer tools to avoid confusing user.

Upon receiving the tool/data probing in python, we will determine which
tools to use. using 'trace_viewer@' will signal the downstream
componenet that streaming trace viewer is in use.

Most of the trace viewer related code already exist within google3. I
removed host filters etc for the reason that we don't support multiple
hosts yet.
@chihuahua chihuahua self-requested a review April 2, 2018 21:14
_loadedRange: Object,
_loadedTraceEents: Object,
_fullBounds: Object,
_isLoading: { type:Boolean, value: false },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Space before types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

this.traceDataUrl = decodeURIComponent(components[1]);
break;
} else if (components[0] == "is_streaming") {
this._isStreaming = components[1] == 'true' ? true : false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to do a string comparison here? Why is that list item not parsed as a boolean value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is result of string split in line 97, no parsing is done.
I change this according to
https://stackoverflow.com/questions/263965/how-can-i-convert-a-string-to-boolean-in-javascript

req.responseType = 'arraybuffer';
}

req.onreadystatechange = function(event) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip: If you use arrow functions, ie

event => {
  ...
}

no binding is necessary. This is optional though.

"use strict";

/* streaming mode helper functions */
var ZOOM_RATIO = 8; // Amount of zooming allowed before re-fetching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with declaring variables and functions here is that they are placed in the global scope of TensorBoard. Could you please place these functions and constants within a TypeScript file within this directory? Say tf-trace-viewer-helper.ts. And then surround these properties with a namespace. It should look something like this:

/**
 * @fileoverview Helper utilities for the trace viewer within TensorBoard's profile plugin.
 */

module tf.profile.traceViewer {
  /** Amount of zooming allowed before re-fetching. */
  export const ZOOM_RATIO = 8;

  /** Minimum safety buffer relative to viewport size. */
  export const PRESERVE_RATIO = 2;

  /** Amount to fetch relative to viewport size. */
  export const FETCH_RATIO = 3;

  export interface Range {
    min: number;
    max: number;
  }

  /**
   * A brief docstring ...
   */
  export function expand(range: Range, scale: number): Range {
    var width = range.max - range.min;
    var mid = range.min + width / 2;
    return {
      min: mid - scale * width / 2,
      max: mid + scale * width / 2,
    };
  }
  /**
   * A brief docstring ...
   */
  export function within(range: Range, bounds: Range): boolean {
    return bounds.min <= range.min && range.max <= bounds.max;
  }
  /**
   * A brief docstring ...
   */
  export function length(range: Range): number {
    return range.max - range.min;
  }
  /**
   * A brief docstring ...
   */
  function intersect(range: Range, bounds: Range): Range {
    return {
      min: Math.max(range.min, bounds.min),
      max: Math.min(range.max, bounds.max),
    };
  }
  /**
   *  Compute the {min, max} range of a trackView.
   *  TODO: Spec out an interface for the trackView.
   */
  function trackViewRange(trackView): Range {
    var xfm = trackView.viewport.currentDisplayTransform;
    const pixelRatio = window.devicePixelRatio || 1;
    const devicePixelWidth = pixelRatio * trackView.viewWidth_;
    return {
      min: xfm.xViewToWorld(0),
      max: xfm.xViewToWorld(devicePixelWidth),
    };
  }
  
} // close module tf-profile

And then, you'd include the script within the HTML file. See how utils.js is used here:
https://github.com/tensorflow/tensorboard/tree/master/tensorboard/plugins/profile/tf_op_profile

Within the HTML, you'd access properties via the namespace, ie tf.profile.traceViewer.trackViewRange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PHAL

caisq and others added 4 commits April 3, 2018 14:36
* [Debugger Plugin] Fix wrapper name for keras snippet
Without this flag, Bazel basically builds the entire codebase twice. Once for
AMD k8 CPUs from 2003, and a second time for -march=native (host).
var queryString = window.location.href.split("?")[1];
if (queryString) {
var parts = queryString.split('&')
console.log(queryString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to leave this in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it.

# 2. the logdir is on google cloud storage.
self.stub = None
if 'MASTER_TPU' in os.environ and self.logdir.startswith('gs://'):
master_tpu = os.environ['MASTER_TPU']
Copy link
Contributor

@jart jart Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than an environment variable, I would prefer this be configured using a --profile_tpu_master=URI flag. I would recommend doing this by defining a group of grpc URI schemes that encompasses the following connectivity options

  1. Port defaults to 8466 if not specified.
  2. If scheme is empty, None, or grpc it prints a list asking the user to choose one of the following schemes.
  3. The grpc+insecure scheme must be present in order for grpc.insecure_channel to be called. For example: --profile_tpu_master=grpc+insecure:localhost. Please note this will canonicalize to either grpc+insecure:[::1]:8466 or grpc+insecure:127.0.0.1:8466 depending on the system.
  4. The grpc+ssl allows one to use ssl_channel_credentials(**{k: open(v).read() for k, v in urlparse.parse_qs(uri.params)]). For example grpc+ssl:myservice.example.com:443?root_certificates=roots.pem
  5. The grpc+google scheme will use GoogleCredentials.getApplicationDefault(). Query parameters to specialize this can be defined in the future.

Those are the only ones worth supporting at the moment it seems.

This is how you parse URIs in Python 2+3.

>>> try:
...   from urllib import parse as urlparse
... except ImportError:
...   import urlparse
>>> urlparse.urlparse('grpc:localhost?insecure')
ParseResult(scheme='grpc', netloc='', path='localhost', params='', query='insecure', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse('grpc:localhost?insecure').query, keep_blank_values=True)
{'insecure': ['']}

See also: https://tools.ietf.org/html/rfc3986#section-3.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we punt this later, because it is not clear to me now that how can we receive master TPU. it might be handy to use os.environ.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can punt. I'd like to learn more about why os.environ is more helpful than a flag?

I'll approve if the following things happen:

  1. PR uses a flag with "insecure" in the name.
  2. PR does not create gRPC channel in constructor, nor is_active, but lazily once the Profile dashboard is activated by the user, and needs this functionality.
  3. I have your support for deleting that flag, as soon as I implement a common library that opens gRPC channels using the URI scheme I defined above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed 1/2. I can refactor or test when 3 is ready. PHAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trisolaran, I think @jart has a point about how TensorBoard lacks a precedence for reading environment variables.

What do you think about directly passing the master_tpu as a flag to TensorBoard? Then, the user could obtain the master TPU via bash and pass it to the flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already did it.

# 2. the logdir is on google cloud storage.
self.stub = None
if 'MASTER_TPU' in os.environ and self.logdir.startswith('gs://'):
master_tpu = os.environ['MASTER_TPU']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can punt. I'd like to learn more about why os.environ is more helpful than a flag?

I'll approve if the following things happen:

  1. PR uses a flag with "insecure" in the name.
  2. PR does not create gRPC channel in constructor, nor is_active, but lazily once the Profile dashboard is activated by the user, and needs this functionality.
  3. I have your support for deleting that flag, as soon as I implement a common library that opens gRPC channels using the URI scheme I defined above.

chihuahua and others added 4 commits April 3, 2018 20:32
We add a `rankDirection` property to the `tf-graph-loader` component.
This property is passed to the dagre layout engine as the 'rankdir'
graph option. We hence let the user change which way the graph flows.
For instance, we can let the graph flow from top to bottom.

This PR is a small part of generalizing the graph explorer.
Tensorflow's C API now will now raise this error using SWIG without having to raise it in explicitly in Python.
Previously, TensorBoard found runs within the log directory by
performing a `tf.gfile.Walk` across directories. That turned out to be
very slow for users with expansive log directories (with many
subdirectories and files) because `tf.gfile.Walk` makes many calls at
the python level to C++ logic for reading data from disk (as opposed to
performing much of that logic directly at the C++ level). Disk reads can
be very costly for certain (such as remote and distributed) file systems.

We find runs (subdirectories) much more quickly by iteratively glob-ing
deeper and deeper directory levels. For a certain internal user, this
reduced the time it took to find all runs from 20 minutes to 20 seconds.

We already have tests that check whether the multiplexer correctly finds
subdirectories that are runs, so this pull request adds no further
tests. We preserve that correctness and just speed up that logic. This
change may result in us finding runs in different orders, but that is
fine - TensorBoard documents no guarantees in terms of the order in
which runs are discovered.

Thank you to @nfelt for this idea and the fruitful brainstorming
session.
@trisolaran
Copy link
Contributor Author

I had done 1/2, I will refactor and retest when 3 is ready. PHAL

@trisolaran
Copy link
Contributor Author

Hi, thanks for your time to review this. what further things I need to do here?

options = [('grpc.max_message_length', gigabyte),
('grpc.max_send_message_length', gigabyte),
('grpc.max_receive_message_length', gigabyte)]
tpu_profiler_port = FLAGS.master_tpu_unsecure_channel + ':8466'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this flag!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you :)

@trisolaran
Copy link
Contributor Author

friendly ping

requestURL.searchParams.set("end_time_ms", requestedRange.max);
}

return fetch(requestURL.toString()).then(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look, could we not refactor the XMLHTTPRequest to use fetch? I like XMLHttpRequest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

function(response) {
if (response.status !== 200) {
var msg = requestURL + ' could not be loaded';
if (response.headers.get('Content-Type').startsWith('text/plain')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you want to fail if it isn't application/json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it throw a Error in line 239. this again is old code, my guess is that when http return code is not 200 and text/plain contains a verbose error message that need to be appended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the result content type to application/json

}.bind(this), 0);
}.bind(this);
req.send(null);
console.log('loaded:');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

var req = new XMLHttpRequest();
var is_binary = / [.] gz$ /.test(this.traceDataUrl) ||
/ [.] zip$ /.test(this.traceDataUrl);
req.overrideMimeType('text/plain; charset=x-user-defined');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the mime type being set to text/plain; charset=x-user-defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, I don't know. probably not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done .removed.

if (!this._isStreaming) {
// Send HTTP request to get the trace data.
var req = new XMLHttpRequest();
var is_binary = / [.] gz$ /.test(this.traceDataUrl) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there spaces in this regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. I don't know the answer.
it probably don't needed in my opinion.
It was in the original code, which I am not the author, I am not sure whether it is used.
I think the plan was having json and gzip file (to reduce the communication size? to alleviate the grpc message size limit?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done removed.

window.setTimeout(() => {
if (req.status === 200) {
this._throbber.className = "inactive";
this.set("_traceData", is_binary ? req.response : req.responseText);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is binary and textual responses handled by the same business logic? Does this XHR really fetch both the HTML page in addition to streaming binary data? What sort of streaming binary data is it fetching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess is_binary never set to true in the old code; and I feel uncomfortable to change it right now in this CL.
my intention is to maintain backward compatibilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. removed.

1. remove code path related to trace.json.gz.
2. change fetch api to XHR.
@trisolaran
Copy link
Contributor Author

PHAL.

jart and others added 2 commits April 9, 2018 16:37
Previously, when the reload_interval flag was set to 0, TensorBoard
would reload the multiplexer once at the beginning of TensorBoard
startup on the main thread. Now, TensorBoard performs that reload on a
separate daemon thread. As comments note, this behavior prevents joining
threads that start while TensorBoard is reloading from blocking
TensorBoard from quitting after the user hits Ctrl + C.

This PR introduces a change in behavior. When the user starts a
TensorBoard with reload_interval set to 0 (no continuous reloading),
TensorBoard server starts up before all data is read from disk.
Previously, TensorBoard would read all events files (reload
multiplexers) before starting its server. A few tests were modified
accordingly.

Test Plan:

Start TensorBoard pointed at logs from mnist_with_summaries.py with the
reload interval set to 0. Verify that TensorBoard starts up successfully
and that the multiplexer only reloads once in the beginning.

Start TensorBoard with the reload interval set to 5 seconds. Verify that
TensorBoard starts up successfully and that the multiplexer continuously
reloads with a 5-second interval in between reloads.

Start TensorBoard with the reload interval set to -1. Verify that
TensorBoard never performs a reload.
@trisolaran
Copy link
Contributor Author

friendly pong

qiuminxu and others added 21 commits April 11, 2018 12:50
This change fixes a regression that might have been introduced by a TypeScript
upgrade a few months ago, where Facets wasn't able to compile, possibly due to
tsc.js no longer adding .js to ES6 imports. The solution was to set module
resolution to NODE in the Closure Compiler. The Polymer dep was upstreamed for
more completeness for Facets.

This change also ensures that the development web server, when we `bazel run` a
`tf_web_library` or `tensorboard_html_binary` rule, will display sources in the
correct order we had originally, before Bazel changed from set() to depset().
trace viewer normally involve huge amount of data which browser can not
handle. streaming trace viewer use grpc to connect to a tpu profiler
analysis service to get small part of whole trace and display to user.
the amount of data depends on current view port (defined by a start and
end timestamp) and resolution (defined by zoom level).
currently the server reside on google owned virtual machine. there are
plans to open source those part.

to avoid confusing user, the patch is done in a backward compatible
fashion. user are expected to use old trace viewer (which limit the
event to 1 million), if the user specify the MASTER_TPU environment to
the IP address of the green VM, we will try to estabilish a grpc stub to
that IP:8466 port. only if stub is successfully connected we will use
new tools 'trace_viewer@' to OVERRIDE the old trace_viewer. Override
will hide old trace viewer tools to avoid confusing user.

Upon receiving the tool/data probing in python, we will determine which
tools to use. using 'trace_viewer@' will signal the downstream
componenet that streaming trace viewer is in use.

Most of the trace viewer related code already exist within google3. I
removed host filters etc for the reason that we don't support multiple
hosts yet.
1. remove code path related to trace.json.gz.
2. change fetch api to XHR.
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

@trisolaran trisolaran closed this Apr 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants