supporting streaming trace viewer in tensorboard. #1096

trisolaran · 2018-04-02T18:22:53Z

trace viewer normally involve huge amount of data which browser can not
handle. streaming trace viewer use grpc to connect to a tpu profiler
analysis service to get small part of whole trace and display to user.
the amount of data depends on current view port (defined by a start and
end timestamp) and resolution (defined by zoom level).
currently the server reside on google owned virtual machine. there are
plans to open source those part.

to avoid confusing user, the patch is done in a backward compatible
fashion. user are expected to use old trace viewer (which limit the
event to 1 million), if the user specify the MASTER_TPU environment to
the IP address of the green VM, we will try to estabilish a grpc stub to
that IP:8466 port. only if stub is successfully connected we will use
new tools 'trace_viewer@' to OVERRIDE the old trace_viewer. Override
will hide old trace viewer tools to avoid confusing user.

Upon receiving the tool/data probing in python, we will determine which
tools to use. using 'trace_viewer@' will signal the downstream
componenet that streaming trace viewer is in use.

Most of the trace viewer related code already exist within google3. I
removed host filters etc for the reason that we don't support multiple
hosts yet.

trace viewer normally involve huge amount of data which browser can not handle. streaming trace viewer use grpc to connect to a tpu profiler analysis service to get small part of whole trace and display to user. the amount of data depends on current view port (defined by a start and end timestamp) and resolution (defined by zoom level). currently the server reside on google owned virtual machine. there are plans to open source those part. to avoid confusing user, the patch is done in a backward compatible fashion. user are expected to use old trace viewer (which limit the event to 1 million), if the user specify the MASTER_TPU environment to the IP address of the green VM, we will try to estabilish a grpc stub to that IP:8466 port. only if stub is successfully connected we will use new tools 'trace_viewer@' to OVERRIDE the old trace_viewer. Override will hide old trace viewer tools to avoid confusing user. Upon receiving the tool/data probing in python, we will determine which tools to use. using 'trace_viewer@' will signal the downstream componenet that streaming trace viewer is in use. Most of the trace viewer related code already exist within google3. I removed host filters etc for the reason that we don't support multiple hosts yet.

Here's the diff of how this changed the TensorBoard HTML page: https://gist.github.com/jart/8884ce9d4d5f60f5698640b35034f7eb Editing this file is easy with Emacs: jart/justinemacs@b3fe4fa

chihuahua · 2018-04-03T01:05:48Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

+      _loadedRange: Object,
+      _loadedTraceEents: Object,
+      _fullBounds: Object,
+      _isLoading: { type:Boolean, value: false },


nit: Space before types.

chihuahua · 2018-04-03T01:06:37Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

            this.traceDataUrl = decodeURIComponent(components[1]);
-            break;
+          } else if (components[0] == "is_streaming") {
+            this._isStreaming = components[1] == 'true' ? true : false;


Why do we have to do a string comparison here? Why is that list item not parsed as a boolean value?

it is result of string split in line 97, no parsing is done.
I change this according to
https://stackoverflow.com/questions/263965/how-can-i-convert-a-string-to-boolean-in-javascript

chihuahua · 2018-04-03T01:09:20Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

+          req.responseType = 'arraybuffer';
+        }
+
+        req.onreadystatechange = function(event) {


Tip: If you use arrow functions, ie

event => { ... }

no binding is necessary. This is optional though.

chihuahua · 2018-04-03T02:53:15Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

  "use strict";

+  /* streaming mode helper functions */
+  var ZOOM_RATIO = 8;      // Amount of zooming allowed before re-fetching.


One issue with declaring variables and functions here is that they are placed in the global scope of TensorBoard. Could you please place these functions and constants within a TypeScript file within this directory? Say tf-trace-viewer-helper.ts. And then surround these properties with a namespace. It should look something like this:

/** * @fileoverview Helper utilities for the trace viewer within TensorBoard's profile plugin. */ module tf.profile.traceViewer { /** Amount of zooming allowed before re-fetching. */ export const ZOOM_RATIO = 8; /** Minimum safety buffer relative to viewport size. */ export const PRESERVE_RATIO = 2; /** Amount to fetch relative to viewport size. */ export const FETCH_RATIO = 3; export interface Range { min: number; max: number; } /** * A brief docstring ... */ export function expand(range: Range, scale: number): Range { var width = range.max - range.min; var mid = range.min + width / 2; return { min: mid - scale * width / 2, max: mid + scale * width / 2, }; } /** * A brief docstring ... */ export function within(range: Range, bounds: Range): boolean { return bounds.min <= range.min && range.max <= bounds.max; } /** * A brief docstring ... */ export function length(range: Range): number { return range.max - range.min; } /** * A brief docstring ... */ function intersect(range: Range, bounds: Range): Range { return { min: Math.max(range.min, bounds.min), max: Math.min(range.max, bounds.max), }; } /** * Compute the {min, max} range of a trackView. * TODO: Spec out an interface for the trackView. */ function trackViewRange(trackView): Range { var xfm = trackView.viewport.currentDisplayTransform; const pixelRatio = window.devicePixelRatio || 1; const devicePixelWidth = pixelRatio * trackView.viewWidth_; return { min: xfm.xViewToWorld(0), max: xfm.xViewToWorld(devicePixelWidth), }; } } // close module tf-profile

And then, you'd include the script within the HTML file. See how utils.js is used here:
https://github.com/tensorflow/tensorboard/tree/master/tensorboard/plugins/profile/tf_op_profile

Within the HTML, you'd access properties via the namespace, ie tf.profile.traceViewer.trackViewRange

* [Debugger Plugin] Fix wrapper name for keras snippet

Without this flag, Bazel basically builds the entire codebase twice. Once for AMD k8 CPUs from 2003, and a second time for -march=native (host).

jart · 2018-04-03T22:48:04Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

      var queryString = window.location.href.split("?")[1];
      if (queryString) {
        var parts = queryString.split('&')
+        console.log(queryString);


Did you intend to leave this in?

I will remove it.

jart · 2018-04-03T22:50:02Z

tensorboard/plugins/profile/profile_plugin.py

+    # 2. the logdir is on google cloud storage.
+    self.stub = None
+    if 'MASTER_TPU' in os.environ and self.logdir.startswith('gs://'):
+      master_tpu = os.environ['MASTER_TPU']


Rather than an environment variable, I would prefer this be configured using a --profile_tpu_master=URI flag. I would recommend doing this by defining a group of grpc URI schemes that encompasses the following connectivity options

Port defaults to 8466 if not specified.

If scheme is empty, None, or grpc it prints a list asking the user to choose one of the following schemes.

The grpc+insecure scheme must be present in order for grpc.insecure_channel to be called. For example: --profile_tpu_master=grpc+insecure:localhost. Please note this will canonicalize to either grpc+insecure:[::1]:8466 or grpc+insecure:127.0.0.1:8466 depending on the system.

The grpc+ssl allows one to use ssl_channel_credentials(**{k: open(v).read() for k, v in urlparse.parse_qs(uri.params)]). For example grpc+ssl:myservice.example.com:443?root_certificates=roots.pem

The grpc+google scheme will use GoogleCredentials.getApplicationDefault(). Query parameters to specialize this can be defined in the future.

Those are the only ones worth supporting at the moment it seems.

This is how you parse URIs in Python 2+3.

>>> try: ... from urllib import parse as urlparse ... except ImportError: ... import urlparse >>> urlparse.urlparse('grpc:localhost?insecure') ParseResult(scheme='grpc', netloc='', path='localhost', params='', query='insecure', fragment='') >>> urlparse.parse_qs(urlparse.urlparse('grpc:localhost?insecure').query, keep_blank_values=True) {'insecure': ['']}

See also: https://tools.ietf.org/html/rfc3986#section-3.1

can we punt this later, because it is not clear to me now that how can we receive master TPU. it might be handy to use os.environ.

We can punt. I'd like to learn more about why os.environ is more helpful than a flag?

I'll approve if the following things happen:

PR uses a flag with "insecure" in the name.

PR does not create gRPC channel in constructor, nor is_active, but lazily once the Profile dashboard is activated by the user, and needs this functionality.

I have your support for deleting that flag, as soon as I implement a common library that opens gRPC channels using the URI scheme I defined above.

I fixed 1/2. I can refactor or test when 3 is ready. PHAL

@trisolaran, I think @jart has a point about how TensorBoard lacks a precedence for reading environment variables.

What do you think about directly passing the master_tpu as a flag to TensorBoard? Then, the user could obtain the master TPU via bash and pass it to the flag?

I already did it.

jart · 2018-04-03T23:53:29Z

tensorboard/plugins/profile/profile_plugin.py

+    # 2. the logdir is on google cloud storage.
+    self.stub = None
+    if 'MASTER_TPU' in os.environ and self.logdir.startswith('gs://'):
+      master_tpu = os.environ['MASTER_TPU']


We can punt. I'd like to learn more about why os.environ is more helpful than a flag?

I'll approve if the following things happen:

PR uses a flag with "insecure" in the name.

PR does not create gRPC channel in constructor, nor is_active, but lazily once the Profile dashboard is activated by the user, and needs this functionality.

I have your support for deleting that flag, as soon as I implement a common library that opens gRPC channels using the URI scheme I defined above.

We add a `rankDirection` property to the `tf-graph-loader` component. This property is passed to the dagre layout engine as the 'rankdir' graph option. We hence let the user change which way the graph flows. For instance, we can let the graph flow from top to bottom. This PR is a small part of generalizing the graph explorer.

Tensorflow's C API now will now raise this error using SWIG without having to raise it in explicitly in Python.

@nfelt

Previously, TensorBoard found runs within the log directory by performing a `tf.gfile.Walk` across directories. That turned out to be very slow for users with expansive log directories (with many subdirectories and files) because `tf.gfile.Walk` makes many calls at the python level to C++ logic for reading data from disk (as opposed to performing much of that logic directly at the C++ level). Disk reads can be very costly for certain (such as remote and distributed) file systems. We find runs (subdirectories) much more quickly by iteratively glob-ing deeper and deeper directory levels. For a certain internal user, this reduced the time it took to find all runs from 20 minutes to 20 seconds. We already have tests that check whether the multiplexer correctly finds subdirectories that are runs, so this pull request adds no further tests. We preserve that correctness and just speed up that logic. This change may result in us finding runs in different orders, but that is fine - TensorBoard documents no guarantees in terms of the order in which runs are discovered. Thank you to @nfelt for this idea and the fruitful brainstorming session.

trisolaran · 2018-04-04T20:43:44Z

I had done 1/2, I will refactor and retest when 3 is ready. PHAL

trisolaran · 2018-04-05T21:04:53Z

Hi, thanks for your time to review this. what further things I need to do here?

chihuahua · 2018-04-05T22:28:46Z

tensorboard/plugins/profile/profile_plugin.py

+        options = [('grpc.max_message_length', gigabyte),
+                   ('grpc.max_send_message_length', gigabyte),
+                   ('grpc.max_receive_message_length', gigabyte)]
+        tpu_profiler_port = FLAGS.master_tpu_unsecure_channel + ':8466'


Thank you for adding this flag!

thank you :)

trisolaran · 2018-04-07T04:26:31Z

friendly ping

jart · 2018-04-07T07:02:42Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

+        requestURL.searchParams.set("end_time_ms", requestedRange.max);
+      }
+
+      return fetch(requestURL.toString()).then(


Taking a closer look, could we not refactor the XMLHTTPRequest to use fetch? I like XMLHttpRequest.

jart · 2018-04-07T07:04:27Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

+        function(response) {
+          if (response.status !== 200) {
+            var msg = requestURL + ' could not be loaded';
+            if (response.headers.get('Content-Type').startsWith('text/plain')) {


Wouldn't you want to fail if it isn't application/json?

it throw a Error in line 239. this again is old code, my guess is that when http return code is not 200 and text/plain contains a verbose error message that need to be appended.

changed the result content type to application/json

jart · 2018-04-07T07:05:50Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

-        }.bind(this), 0);
-      }.bind(this);
-      req.send(null);
+          console.log('loaded:');


Please remove this line.

jart · 2018-04-07T07:07:19Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

+        var req = new XMLHttpRequest();
+        var is_binary = / [.] gz$ /.test(this.traceDataUrl) ||
+                        / [.] zip$ /.test(this.traceDataUrl);
+        req.overrideMimeType('text/plain; charset=x-user-defined');


Why is the mime type being set to text/plain; charset=x-user-defined?

ditto, I don't know. probably not used.

done .removed.

jart · 2018-04-07T07:07:52Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

+      if (!this._isStreaming) {
+        // Send HTTP request to get the trace data.
+        var req = new XMLHttpRequest();
+        var is_binary = / [.] gz$ /.test(this.traceDataUrl) ||


Why are there spaces in this regex?

good question. I don't know the answer.
it probably don't needed in my opinion.
It was in the original code, which I am not the author, I am not sure whether it is used.
I think the plan was having json and gzip file (to reduce the communication size? to alleviate the grpc message size limit?).

done removed.

jart · 2018-04-07T07:10:17Z

tensorboard/components/tf_trace_viewer/tf-trace-viewer.html

+	  window.setTimeout(() => {
+            if (req.status === 200) {
+              this._throbber.className = "inactive";
+              this.set("_traceData", is_binary ? req.response : req.responseText);


Why is binary and textual responses handled by the same business logic? Does this XHR really fetch both the HTML page in addition to streaming binary data? What sort of streaming binary data is it fetching?

I guess is_binary never set to true in the old code; and I feel uncomfortable to change it right now in this CL.
my intention is to maintain backward compatibilities.

done. removed.

1. remove code path related to trace.json.gz. 2. change fetch api to XHR.

trisolaran · 2018-04-09T18:34:36Z

PHAL.

Previously, when the reload_interval flag was set to 0, TensorBoard would reload the multiplexer once at the beginning of TensorBoard startup on the main thread. Now, TensorBoard performs that reload on a separate daemon thread. As comments note, this behavior prevents joining threads that start while TensorBoard is reloading from blocking TensorBoard from quitting after the user hits Ctrl + C. This PR introduces a change in behavior. When the user starts a TensorBoard with reload_interval set to 0 (no continuous reloading), TensorBoard server starts up before all data is read from disk. Previously, TensorBoard would read all events files (reload multiplexers) before starting its server. A few tests were modified accordingly. Test Plan: Start TensorBoard pointed at logs from mnist_with_summaries.py with the reload interval set to 0. Verify that TensorBoard starts up successfully and that the multiplexer only reloads once in the beginning. Start TensorBoard with the reload interval set to 5 seconds. Verify that TensorBoard starts up successfully and that the multiplexer continuously reloads with a 5-second interval in between reloads. Start TensorBoard with the reload interval set to -1. Verify that TensorBoard never performs a reload.

trisolaran · 2018-04-11T17:31:07Z

friendly pong

This change fixes a regression that might have been introduced by a TypeScript upgrade a few months ago, where Facets wasn't able to compile, possibly due to tsc.js no longer adding .js to ES6 imports. The solution was to set module resolution to NODE in the Closure Compiler. The Polymer dep was upstreamed for more completeness for Facets. This change also ensures that the development web server, when we `bazel run` a `tf_web_library` or `tensorboard_html_binary` rule, will display sources in the correct order we had originally, before Bazel changed from set() to depset().

trace viewer normally involve huge amount of data which browser can not handle. streaming trace viewer use grpc to connect to a tpu profiler analysis service to get small part of whole trace and display to user. the amount of data depends on current view port (defined by a start and end timestamp) and resolution (defined by zoom level). currently the server reside on google owned virtual machine. there are plans to open source those part. to avoid confusing user, the patch is done in a backward compatible fashion. user are expected to use old trace viewer (which limit the event to 1 million), if the user specify the MASTER_TPU environment to the IP address of the green VM, we will try to estabilish a grpc stub to that IP:8466 port. only if stub is successfully connected we will use new tools 'trace_viewer@' to OVERRIDE the old trace_viewer. Override will hide old trace viewer tools to avoid confusing user. Upon receiving the tool/data probing in python, we will determine which tools to use. using 'trace_viewer@' will signal the downstream componenet that streaming trace viewer is in use. Most of the trace viewer related code already exist within google3. I removed host filters etc for the reason that we don't support multiple hosts yet.

1. remove code path related to trace.json.gz. 2. change fetch api to XHR.

…into streaming

googlebot · 2018-04-12T01:19:15Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

chihuahua self-requested a review April 2, 2018 21:14

chihuahua added the plugin:profile label Apr 2, 2018

trisolaran and others added 5 commits April 2, 2018 15:30

fix test

98dde11

Roll forward some Polymer dependencies (#1090)

998adf2

Here's the diff of how this changed the TensorBoard HTML page: https://gist.github.com/jart/8884ce9d4d5f60f5698640b35034f7eb Editing this file is easy with Emacs: jart/justinemacs@b3fe4fa

fix format

50ade02

import order fix

c934a46

more import order

da6e921

chihuahua requested changes Apr 3, 2018

View reviewed changes

caisq and others added 4 commits April 3, 2018 14:36

[Debugger Plugin] Fix wrapper name for keras snippet (#1100)

aec57d9

* [Debugger Plugin] Fix wrapper name for keras snippet

address comment. move helper function to a module.

2a2ed20

Add summary doc links to readme (#1101)

d5417eb

Reduce build latency on Travis (#1099)

2b4d96a

Without this flag, Bazel basically builds the entire codebase twice. Once for AMD k8 CPUs from 2003, and a second time for -march=native (host).

jart suggested changes Apr 3, 2018

View reviewed changes

trisolaran added 2 commits April 3, 2018 16:33

change to typescript.

39be414

remove console log

562ce99

jart suggested changes Apr 4, 2018

View reviewed changes

chihuahua and others added 4 commits April 3, 2018 20:32

Don't pass a status to _reader.GetNext() (#1086)

8b43c79

Tensorflow's C API now will now raise this error using SWIG without having to raise it in explicitly in Python.

lazy import and initialize stub.

0bfa3d3

trisolaran added 4 commits April 4, 2018 15:05

shutup pylint

6162573

fix build

a3169c9

take 2 for pylint

38ded7b

typo

9d56275

chihuahua approved these changes Apr 5, 2018

View reviewed changes

jart suggested changes Apr 7, 2018

View reviewed changes

trisolaran added 2 commits April 7, 2018 09:37

remove console.log

1e41d28

address review comments.

e5a7888

1. remove code path related to trace.json.gz. 2. change fetch api to XHR.

jart and others added 2 commits April 9, 2018 16:37

Ignore .DS_Store files (#1112)

aacedfc

qiuminxu and others added 21 commits April 11, 2018 12:50

Extend the profile plugin to take per host profile traces (#1117)

16029cb

Fix the input_pipeline_analyzer plugin for TensorFlow 1.8 (#1126)

8fe61b0

merge head

b36c1cb

fix test

13aa72a

fix format

f51f60f

import order fix

1c45e5d

more import order

05a2f6f

address comment. move helper function to a module.

c4b4312

change to typescript.

73f082c

remove console log

0850934

lazy import and initialize stub.

f092038

shutup pylint

be1a96a

fix build

2c4a12c

take 2 for pylint

13581a1

typo

960f1c3

remove console.log

203323c

address review comments.

ff768dc

1. remove code path related to trace.json.gz. 2. change fetch api to XHR.

rebase and fix

d6d091f

Merge branch 'streaming' of https://github.com/trisolaran/tensorboard …

d8f38d1

…into streaming

trisolaran closed this Apr 12, 2018

supporting streaming trace viewer in tensorboard. #1096

supporting streaming trace viewer in tensorboard. #1096

Uh oh!

Conversation

trisolaran commented Apr 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jart Apr 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trisolaran commented Apr 4, 2018

Uh oh!

trisolaran commented Apr 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trisolaran commented Apr 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jart Apr 3, 2018 •

edited

Loading