The output of Chimbuko’s AD module is serialized in JSON format. This includes the streaming output sent between the parameter server and the visualization module, and the contents of the provenance database. In this section we provide the schema for these data.
Provenance Database Schema
Main database
Below we describe the JSON schema for the anomalies, normalexecs and metadata collections of the main database component of the provenance database.
Function event schema
This section describes the JSON schema for the anomalies and normalexecs collections. The fields of the JSON object are bolded, and a brief description follows the colon (:).
Function execution “events” in Chimbuko are labeled by a unique (for each process) string of following form “$RANK:$IO_STEP:$IDX” (eg “0:12:225”), where RANK, IO_STEP and IDX are the MPI rank, the io step and an integer index, respectively, and $VAL indicates the numerical value of the variable VAL. We will refer to such a string as an “event label” below.
For the SSTD (original) algorithm, the algo_params field has the following format:
For the HBOS and COPOD algorithms, the algo_params field has the following format:
The schema for the gpu_location field is as follows:
and for the gpu_parent field:
Note that Tau considers a GPU device/context/stream much in the same way as a CPU thread, and assigns it a unique index. This index is the “thread index” for GPU events.
Metadata schema
Metadata are stored in the metadata collection in the following JSON schema:
Note that the tid (thread index) for metadata is usually 0, apart from for metadata associated with a GPU context/device/stream, for which the index is the virtual thread index assigned by Tau to the context/device/stream.
Global database
Below we describe the JSON schema for the func_stats, counter_stats and ad_model collections of the global database component of the provenance database.
A common data structure RunStats is used extensively to represent statistics (mean, min/max, std. dev., etc) of some quantity. It has the following schema:
Function profile statistics schema
func_stats contains aggregated profile information and anomaly information for all functions. The JSON schema is as follows:
Counter statistics schema
The counter_stats collection has the following schema:
AD model schema
The ad_model collection contains the final AD model for each function. It has the following schema:
The “model” entry has the same form as the “algo_params” entry of the main database, and is documented above.
Parameter Server Streaming Output
Every IO frame the AD instances send three pieces of information to the pserver:
For every function execution in the IO frame the inclusive and exclusive runtime and the number of anomalies for this function. These are aggregated over all IO frames and ranks on the parameter server and represent the function profile.
The total number of anomalies detected in the IO frame.
Statistics on the values of each counter over the IO step (e.g. for a memory usage counter this would be the mean, std.dev., etc of the memory usage over the IO frame. These are aggregated over all IO frams and ranks on the parameter server.
The parameter server optionally sends data to an external webserver as JSON-formatted packets via http using libcurl at some fixed frequency (independent of the frequency of IO steps in the trace data collection). This communication is handled by the PSstatSender class. The data packet is a JSON object comprising three payloads: anomaly_stats, anomaly_metrics and counter_stats. Note, counters are integer valued quantities that are typically hardware counters but include information on MPI communications packet sizes, etc.
A common data structure RunStats is used extensively to represent statistics (mean, min/max, std. dev., etc) of some quantity. It has the following schema:
The full parameter server data packet JSON object has the following schema:
Note that the anomaly_stats entry will only be present if data has been received from the AD instances since the last send, and the counter_stats array will only appear if counters have ever been collected.
The schema for the ‘anomaly_stats’ object is as follows:
The ‘anomaly_metrics’ structure contains statistics on anomalies (count, score, severity) broken down over rank, function and program. The schema is as follows: