The output of Chimbuko’s AD module is serialized in JSON format. This includes the streaming output sent between the parameter server and the visualization module, and the contents of the provenance database. In this section we provide the schema for these data.

Provenance Database Schema

Main database

Below we describe the JSON schema for the anomalies, normalexecs and metadata collections of the main database component of the provenance database.

Function event schema

This section describes the JSON schema for the anomalies and normalexecs collections. The fields of the JSON object are bolded, and a brief description follows the colon (:).

Function execution “events” in Chimbuko are labeled by a unique (for each process) string of following form “$RANK:$IO_STEP:$IDX” (eg “0:12:225”), where RANK, IO_STEP and IDX are the MPI rank, the io step and an integer index, respectively, and $VAL indicates the numerical value of the variable VAL. We will refer to such a string as an “event label” below.


{ start of schema
“__id”: Record index assigned by Sonata,
“version”: The schema version,
“call_stack”: Function execution call stack (starting with anomalous function execution),
[
{
“entry”: timestamp of function execution entry ,
“exit”: timestamp of function execution exit (0 if has not exited at time of write) ,
“fid”: Global function index (can be used as a key instead of function name),
“func”: function name,
“event_id”: Event label (see above),
“is_anomaly”: True/false depending on whether event is anomalous (applies only to executions that have exited by time of write)
},
….
],
“counter_events”: [ An array of counter data received on the specific process thread during function execution
{
“counter_idx”: An index used internally to index counters,
“counter_name”: A string describing the counter,
“counter_value”: The value of the counter (integer),
“pid”: process index,
“rid”: process rank,
“tid”: process thread,
“ts”: timestamp
},
],
“entry”: Timestamp of function execution entry,
“exit”: Timestamp of function execution exit,
“event_id”: Event label (see above),
“fid”: Global function index (can be used as a key instead of function name),
“func”: function name,
“algo_params”: The parameters used by the outlier detection algorithm to classify this event (format is algorithm dependent, see below),
“is_gpu_event”: true or false depending on whether function executed on a GPU
“gpu_location”: if a GPU event, a JSON description of the context (see below), otherwise null,
“gpu_parent”: if a GPU event, a JSON description of the parent CPU function (see below), otherwise null,
“pid”: process index,
“rid”: process rank,
“tid”: thread index
“hostname”: The hostname of the node on which the application was executing
“runtime_exclusive”: Function execution time exclusive of children,
“runtime_total”: Function total execution time,
“io_step”: IO step index,
“io_step_tstart”: Time of start of IO step,
“io_step_tend”: Time of end of IO step,
“outlier_score”: The anomaly score of the execution reflecting how unlikely the event is (algorithm dependent, larger is more anomalous),
“outlier_severity”: The severity of the anomaly, reflecting how important the anomaly was,
“event_window”: Capture of function executions and MPI comms events occuring in window around anomaly on same thread (object)
{
“exec_window”: The function executions in the window arranged in order of their entry time (array)
[
{
“entry”: timestamp of function execution entry ,
“exit”: timestamp of function execution exit (0 if has not exited at time of write) ,
“fid”: Global function index (can be used as a key instead of function name),
“func”: function name,
“event_id”: Event label (see above),
“parent_event_id”: Event label of parent function execution,
“is_anomaly”: True/false depending on whether event is anomalous (applies only to executions that have exited by time of write)
},
],
“comm_window”: The MPI communications in the window (array)
[
{
type: SEND or RECV,
pid: process index,
rid: rank of current process,
tid: thread idx,
src: message origin rank,
tar: message target rank,
bytes: message size,
tag: an integer tag associated with the message,
timestamp: time MPI function executed,
execdata_key: the ID label of the parent function
},
]
} end of event_window object
“node_state”: The state of the node provided by TAU’s monitoring plugin. This is null if no state information is available. (object)
{
“data”: A list of fields and values (list)
[
{
“field”: The field name, e.g. “Memory Available (MB)”
“value”: The value
},
],
timestamp: The timestamp of the most recent state update
}
} end of schema

For the SSTD (original) algorithm, the algo_params field has the following format:

{
“accumulate”: not used at present,
“count”: number of times function encountered (global),
“kurtosis”: kurtosis of distribution,
“maximum”: maximum of distribution,
“mean”: mean of distribution,
“minimum”: minimum of distribution,
“skewness”: skewness of distribution,
“stddev”: standard deviation of distribution
}

For the HBOS and COPOD algorithms, the algo_params field has the following format:

{
“histogram”: the histogram,
{
“Histogram Bin Counts” : the height of the histogram bin (array) ,
“Histogram Bin Edges” : the edges of the bins starting with the lower edge of bin 0 followed by the upper edges of bins 0..N (array)
},
“internal_global_threshold” : a score threshold for anomaly detection used internally
}

The schema for the gpu_location field is as follows:


{
“context”: GPU device context (NVidia terminology),
“device”: GPU device index,
“stream”: GPU device stream (NVidia terminology),
“thread”: virtual thread index assigned to this context/device/stream by Tau
}

and for the gpu_parent field:


{
“event_id”: The event label (see above) of the parent function execution,
“tid”: Thread index for CPU parent function,
“call_stack”: Parent function call stack (starting with parent function execution),
[
{
“entry”: timestamp of function execution entry ,
“exit”: timestamp of function execution exit (0 if has not exited at time of write) ,
“fid”: Global function index (can be used as a key instead of function name),
“func”: function name,
“event_id”: The event label
},
….
]
}

Note that Tau considers a GPU device/context/stream much in the same way as a CPU thread, and assigns it a unique index. This index is the “thread index” for GPU events.

Metadata schema

Metadata are stored in the metadata collection in the following JSON schema:


{
“descr”: String description (key) of metadata entry
“pid”: Program index from which metadata originated,
“rid”: Process rank from which metadata originated,
“tid”: Process thread associated with metadata,
“value”: Value of the metadata entry,
“__id”: Record index assigned by Sonata*
}

Note that the tid (thread index) for metadata is usually 0, apart from for metadata associated with a GPU context/device/stream, for which the index is the virtual thread index assigned by Tau to the context/device/stream.

Global database

Below we describe the JSON schema for the func_stats, counter_stats and ad_model collections of the global database component of the provenance database.

A common data structure RunStats is used extensively to represent statistics (mean, min/max, std. dev., etc) of some quantity. It has the following schema:

{
‘accumulate’: The sum of all values (same as mean * count). In some cases this entry is not populated,
‘count’: The number of values,
‘kurtosis’: kurtosis of the distribution of values,
‘maximum’: maximum value,
‘mean’: average value,
‘minimum’: minimum value,
‘skewness’: skewness of distribution of values,
‘stddev’: standard deviation of distribution of values
}

Function profile statistics schema

func_stats contains aggregated profile information and anomaly information for all functions. The JSON schema is as follows:

{
“__id”: record index,
“app”: application/program index,
“fid”: function index,
“fname”: function name,
“anomaly_metrics”: statistics on anomalies for this function (object). Note this entry is null if no anomalies were detected
{
“anomaly_count”: statistics on the anomaly count for time steps in which anomalies were detected, as well as the total number of anomalies (RunStats)
“first_io_step”: the first IO step in which an anomaly was detected,
“last_io_step”: the last IO step in which an anomaly was detected,
“max_timestamp”: the last anomaly’s timestamp,
“min_timestamp”: the first anomaly’s timestamp,
“score”: statistics on the scores for the anomalies (RunStats),
“severity”: statistics on the severity of the anomalies (RunStats),
},
“runtime_profile”: statistics on function runtime (i.e. the function profile) (object)
{
“exclusive_runtime”: statistics on the runtime excluding child function calls (RunStats),
“inclusive_runtime”: statistics on the runtime including child function calls (RunStats)
}
}

Counter statistics schema

The counter_stats collection has the following schema:

{
‘app’: Program index,
‘counter’: Counter description,
‘stats’: Global aggregated statistics on counter values since start of run (RunStats)
}

AD model schema

The ad_model collection contains the final AD model for each function. It has the following schema:

{
“__id”: A unique record index,
“pid”: The program index,
“fid”: The function index,
“func_name”: The function name,
“model” : The model for this function, form dependent on algorithm used (object)
}

The “model” entry has the same form as the “algo_params” entry of the main database, and is documented above.

Parameter Server Streaming Output

Every IO frame the AD instances send three pieces of information to the pserver:

  1. For every function execution in the IO frame the inclusive and exclusive runtime and the number of anomalies for this function. These are aggregated over all IO frames and ranks on the parameter server and represent the function profile.

  2. The total number of anomalies detected in the IO frame.

  3. Statistics on the values of each counter over the IO step (e.g. for a memory usage counter this would be the mean, std.dev., etc of the memory usage over the IO frame. These are aggregated over all IO frams and ranks on the parameter server.

The parameter server optionally sends data to an external webserver as JSON-formatted packets via http using libcurl at some fixed frequency (independent of the frequency of IO steps in the trace data collection). This communication is handled by the PSstatSender class. The data packet is a JSON object comprising three payloads: anomaly_stats, anomaly_metrics and counter_stats. Note, counters are integer valued quantities that are typically hardware counters but include information on MPI communications packet sizes, etc.

A common data structure RunStats is used extensively to represent statistics (mean, min/max, std. dev., etc) of some quantity. It has the following schema:

{
‘accumulate’: The sum of all values (same as mean * count). In some cases this entry is not populated,
‘count’: The number of values,
‘kurtosis’: kurtosis of the distribution of values,
‘maximum’: maximum value,
‘mean’: average value,
‘minimum’: minimum value,
‘skewness’: skewness of distribution of values,
‘stddev’: standard deviation of distribution of values
}

The full parameter server data packet JSON object has the following schema:


{
‘version’: The schema version
‘created_at’: UNIX timestamp given in milliseconds relative to epoch
‘anomaly_stats’: Statistics of anomalies (object with schema given below). This field will not appear if no data has been received from the AD instances since the last send
‘anomaly_metrics’ : Statistics of anomaly metrics by pid/rid/fid (array of objects with schema below).
‘counter_stats’: Statistics of counter values aggregated over all ranks (array). This field will not appear if no counters were ever collected
[
{
‘app’: Program index,
‘counter’: Counter description,
‘stats’: Global aggregated statistics on counter values since start of run (RunStats)
},
]
}

Note that the anomaly_stats entry will only be present if data has been received from the AD instances since the last send, and the counter_stats array will only appear if counters have ever been collected.

The schema for the ‘anomaly_stats’ object is as follows:


{
‘created_at’: UNIX timestamp given in milliseconds relative to epoch,
‘anomaly’: Statistics on anomalies by process/rank (array)
[
{
‘data’: Number of anomalies and anomaly time window for process/rank broken down by io step (array)
[
{
‘app’: Program index,
‘max_timestamp’: Latest time of anomaly in io step,
‘min_timestamp’: Earliest time of anomaly in io step,
‘n_anomalies’: Number of anomalies in io step,
‘rank’: Process rank,
‘step’: io step index,
‘outlier_scores’: Statistics on the outlier scores for the outliers collected in this step (RunStats),
},
],
‘key’: A string label of the form “$PROGRAM ID:$RANK” (eg “0:12”),
‘stats’: Statistics on anomalies on this process/rank over all steps to date (RunStats). Note the ‘accumulate’ field represents the total number of anomalies, and ‘count’ the number of IO steps to-date,
},
], end of anomaly array
‘func’: Statistics on anomalies broken down by function, collected over entire run to-date (array)
[
{
‘app’: program index,
‘fid’: global function index,
‘name’: function name,
‘exclusive’: Statistics of runtime exclusive of children (RunStats),
‘inclusive’: Statistics of runtime inclusive of children (RunStats),
‘stats’: Statistics on the number of anomalies for this function per timestep observed in run to-date (RunStats)
},
], end of func array
}

The ‘anomaly_metrics’ structure contains statistics on anomalies (count, score, severity) broken down over rank, function and program. The schema is as follows:


{
‘app’: Application,
‘rank’: Program rank,
‘fid’: function ID,
‘fname’: function name,
‘_id’: a global index to track each (app, rank, func), for internal use,
‘new_data’: Statistics of anomaly metrics aggregated over multiple IO steps since the last pserver->viz send
{
‘first_io_step’: first io step in sum
‘last_io_step’: last io step in sum
‘max_timestamp’: max timestamp of last IO step of this period
‘min_timestamp’: min timestamp of first IO step of this period
‘severity’: Statistics on the anomaly severity (RunStats)
‘score’: Statistics on the anomaly score (RunStats)
‘count’: Statistics on the anomaly count per IO step (RunStats)
}
‘all_data’: Statistics of anomaly metrics aggregated since the beginning of the run
{
‘first_io_step’: first io step in sum
‘last_io_step’: last io step in sum
‘max_timestamp’: max timestamp of last IO step since start of run
‘min_timestamp’: min timestamp of first IO step since start of run
‘severity’: Statistics on the anomaly severity (RunStats)
‘score’: Statistics on the anomaly score (RunStats)
‘count’: Statistics on the anomaly count per IO step (RunStats)
}
}