Parse summary of evaluator logs into data model by angelapwen · Pull Request #1405 · github/vscode-codeql

angelapwen · 2022-06-23T15:43:47Z

This PR is the first step towards adding a visualizer for evaluator log data. It includes the following logic:

running the command to generate log summaries, after a query is complete, in JSON format for ease of parsing
parsing each interesting JSON object from the log summary into a data model (where interesting means the object contains all fields that our data model require)
adding a data model encapsulating fields that we expect to surface in our visualizer

After this PR is merged, we will merge a PR displaying the data in the data model in a VSCode TreeView.

The PR is marked as draft due to the number of comments labeled REVIEW. I am assigning reviewers who I believe may have opinions about some of the items marked for review, especially the data model of what items folks looking at the visualizer may eventually want to see, and which items in the summary log we are/are not interested in (and therefore skip parsing).

Checklist

CHANGELOG.md has been updated to incorporate all user visible changes made by this pull request.
Issues have been created for any UI or other user-facing changes made by this pull request.
[Maintainers only] If this pull request makes user-facing changes that require documentation changes, open a corresponding docs pull request in the github/codeql repo and add the ready-for-doc-review label there.

extensions/ql-vscode/src/pure/log-summary-parser.ts

edoardopirovano · 2022-06-23T16:31:09Z

extensions/ql-vscode/src/pure/log-summary-parser.ts

+      const jsonObj = JSON.parse(obj);
+
+      // Only convert log items that have an RA and millis field
+      // REVIEW: Not sure this is exactly what we want. Are we now excluding too many items?


I think that's reasonable. The only predicates missing a millis field are ones where we don't have a meaningful evaluation time for them (e.g. because we skipped evaluating due to a sentinel being empty or a cache hit), which aren't likely to be ones a user debugging performance cares about. The only ones missing ra are extensionals (that is: predicates that are part of the actual database). Those do have a meaningful evaluation time (because we have to load them from disk and into the cache), but it's usually pretty small so again not likely to be something a user debugging performance cares about.

Thanks for sharing your thoughts on this! Glad to hear this is on the right track. I have a feeling there may be other edge cases that I'm not seeing (and therefore might fail while parsing) that I'll catch when I make the tests more robust.

edoardopirovano · 2022-06-23T16:36:36Z

extensions/ql-vscode/src/run-queries.ts

+
+          // REVIEW: it seems clunky to have to run this command again just to get the same summary file
+          // in JSON format, but I'm not sure it's worth changing the way it works in the CLI for just
+          // this use case? 


I think this is fine for now. I do think it's looking like we want a way to specify which types of output to send where in codeql generate log-summary. Then we could write a single invocation like codeql generate log-summary --output "end-table:some-file" --output "summary:some-other-file" --output "human-readable:yet-another-file". But having something like that would be a pretty big rewrite of the already quite complex codeql generate log-summary.

Agreed. Another option I just suggested verbally was to have an --output ide or similar, which produces all the different output formats the extension expects at this point, saved in separately named output files. But this can all wait for the future.

Sounds good. Just filed https://github.com/github/codeql-core/issues/2480 to track for later and move discussion there!

edoardopirovano · 2022-06-23T16:37:16Z

extensions/ql-vscode/src/run-queries.ts

+              // into another class later on?
+
+              // REVIEW: We may want to stream this later on; we are generating the human-readable
+              // file in memory above but the JSON formatted one should be larger?


I don't think the human-readable summary ever ends up in memory. Both can be pretty big, so I think we definitely want to stream this here.

We are using streaming json elsewhere, so you can do something similar here. See the sarif-parser module.

After discussion with @adityasharad, decided to stream in a follow-up PR given this will go in under canary flag and make this PR ready for review. I have documented this as a TODO in the code comments.

edoardopirovano · 2022-06-23T16:41:09Z

extensions/ql-vscode/src/pure/log-summary-parser.ts

+          predicateName: jsonObj.predicateName,
+          timeInMillis: jsonObj.millis,
+          resultSize: jsonObj.resultSize
+          // TODO: need to also parse RA, pipeline arrays into the object. 


Similarly to the below comment, I think if we parse everything here and put it in memory we'll be creating very significant memory pressure. A better solution, I think, would be to record some barebones information that we need later on in the tree view when its first opened (presumably things like name and evaluation time), along with a position in the file (in bytes) where we can find the event. Then, when the user expands a given predicate, we can jump right back to the correct position and read the event to load that information.

Thanks for this comment, I hadn't considered saving something like the position in file in bytes. During 🍐 with @aeisenberg we discussed that this would be a good idea but will follow-on to the PR (and the front end visualizer PR) with this change, so that we can view simple/small logs in the visualizer first. The change is still under a canary flag 🐦 so we don't expect the memory pressure to be too pressing at the moment.

extensions/ql-vscode/src/pure/log-summary-parser.ts

aeisenberg · 2022-06-23T16:45:58Z

extensions/ql-vscode/src/run-queries.ts

+              // into another class later on?
+
+              // REVIEW: We may want to stream this later on; we are generating the human-readable
+              // file in memory above but the JSON formatted one should be larger?


We are using streaming json elsewhere, so you can do something similar here. See the sarif-parser module.

extensions/ql-vscode/src/run-queries.ts

adityasharad · 2022-06-23T18:14:11Z

extensions/ql-vscode/src/pure/log-summary-parser.ts

+
+// REVIEW: Suggestions on other fields that are useful for a performance 
+// debugger would be welcome!
+export interface EvaluatorLogData {


One suggestion here: it might make your life easier to have the field names exactly match the fields from the evaluator log. Then it's easy to update this file (we could even derive it in future from the JSON schema, but that's a future idea).

That makes sense — I think I'd named them whatever was intuitive to me, but it'd be better to keep it consistent with the actual log for future maintenance. Deriving it from a JSON schema certainly seems like it'd make the parsing logic less brittle. I'll update it now to match the evaluator log fields.

adityasharad · 2022-06-23T18:16:27Z

extensions/ql-vscode/src/run-queries.ts

+
+          // REVIEW: it seems clunky to have to run this command again just to get the same summary file
+          // in JSON format, but I'm not sure it's worth changing the way it works in the CLI for just
+          // this use case? 


Agreed. Another option I just suggested verbally was to have an --output ide or similar, which produces all the different output formats the extension expects at this point, saved in separately named output files. But this can all wait for the future.

adityasharad · 2022-06-23T18:18:26Z

extensions/ql-vscode/test/pure-tests/log-summary-parser.test.ts

+import { parseVisualizerData } from '../../src/pure/log-summary-parser';
+
+describe('Parsing Valid Summary Text', () => {
+    const validSummaryText = `


At the risk of complicating your test code by introducing file reading, I think it would be useful to keep the JSON test data in standalone files checked into the repo. Then each test case only needs to know the input file name and assertions on the expected output.

I've made this change. I think it would still be good to make the tests more robust (right now I am testing for the edge cases that I have paid attention to and caught, but it may be good to test an entire full summary file to make sure nothing in the JSON file crashes). This may be a pretty large file to check in though, and I'm not entirely sure what I'd test for other than that it doesn't crash — given that it would be kind of laborious to manually check how many objects I'd expect to be parsed out of it.

Maybe it would be good to just continuously add individual test cases that test different configurations of the JSON objects? 🤔

extensions/ql-vscode/src/pure/log-summary-parser.ts

Co-authored-by: Aditya Sharad <6874315+adityasharad@users.noreply.github.com>

angelapwen · 2022-06-24T17:16:02Z

extensions/ql-vscode/src/pure/log-summary-parser.ts

+interface Pipeline {
+    // Key: pipeline identifier; Value: array of pipeline steps 
+    pipeline: Map<string, string[]>;
+}


I have simplified the model behind Pipeline to match what the log actually emits more closely (a map where the key is the name of the pipeline and the value is an array of pipeline steps).

Previously I had PipelineStep as a separate interface that only held a body of strings, but that additional layer of indirection seemed unnecessary. Open to other thoughts here though!

This does not appear to match your test data, where ra is an object with a single field called pipeline, whose value is an array of strings.

Ah, there are actually cases where ra contains many pipelines with different names, which is why I used this schema. I've added an example of this case to the valid-summary.jsonl file but for convenience here is the related part:

"ra" : { "base" : [ " {2} r1 = namespaces AND NOT namespacembrs_1#antijoin_rhs(Lhs.0)", " return r1" ], "standard" : [ " {2} r1 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs#1 ON FIRST 1 OUTPUT Rhs.1, Lhs.1", "", " {2} r2 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs ON FIRST 1 OUTPUT Rhs.1, (Lhs.1 ++ \"::\" ++ Rhs.2)", "", " {2} r3 = r1 UNION r2", " {2} r4 = r3 AND NOT QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev(Lhs.0, Lhs.1)", " return r4" ], "order_500000" : [ " {2} r1 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs#1 ON FIRST 1 OUTPUT Rhs.1, Lhs.1", "", " {2} r2 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH namespacembrs ON FIRST 1 OUTPUT Rhs.1, Lhs.1", " {2} r3 = JOIN r2 WITH namespaces ON FIRST 1 OUTPUT Lhs.0, (Lhs.1 ++ \"::\" ++ Rhs.1)", "", " {2} r4 = r1 UNION r3", " {2} r5 = r4 AND NOT QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev(Lhs.0, Lhs.1)", " return r5" ] },

From this I thought it would be most accurate for ra to be an array of pipeline objects, that have a key (the name of the pipeline such as base) and a value of the array of strings.

I think a closer representation would be for ra to be a map/object itself, rather than an array. The keys of this map would be the (string) pipeline names, and the values of this map would be arrays of RA log strings for each pipeline.

This may also explain what you are seeing when trying to parse the JSON logs into Pipeline[].

I see; for some reason I added an extra layer of arrays in my mental model. I have renamed the interface Pipelines to indicate that the object contains all pipelines rather than an array of Pipelines. This did indeed fix the Pipeline[] error. Thank you!

extensions/ql-vscode/src/pure/log-summary-parser.ts

adityasharad

I think this is shaping up well and you've correctly limited the scope for this PR, but I've confused myself about the type/structure of the expected data. Several questions and naming suggestions.

extensions/ql-vscode/src/pure/log-summary-parser.ts

extensions/ql-vscode/src/run-queries.ts

adityasharad · 2022-07-01T18:52:14Z

extensions/ql-vscode/test/pure-tests/log-summary-parser.test.ts

+        expect(item.ra).to.not.be.empty;
+        if (item.ra !== undefined) {
+          for (const pipeline of Object.entries(item.ra)) {
+            expect (pipeline).to.not.be.empty;


Object.entries returns 2-element arrays of the form [key, value], so each array here will always be non-empty and this test will always pass. Should this be something like expect item.ra.pipeline.to.not.be.empty?

Ah, you're right. I think I actually meant something like

for (const pipeline of item.ra) { expect (pipeline).to.not.be.empty; }

because I wanted to check that each pipeline in the array (item.ra) was non-empty. But when I use the above code I get TypeError: item.ra is not iterable 🤔 which I find odd, because it is of type Pipeline[].

After some insight from @henrymercer on Slack, it seems that a likely culprit of the error is that the JSON parsing at runtime may not actually be parsing into the type Pipeline[]. I will do some 🔬

extensions/ql-vscode/test/pure-tests/log-summary-parser.test.ts

adityasharad · 2022-07-01T19:02:04Z

extensions/ql-vscode/src/pure/log-summary-parser.ts

+interface Pipeline {
+    // Key: pipeline identifier; Value: array of pipeline steps 
+    pipeline: Map<string, string[]>;
+}


This does not appear to match your test data, where ra is an object with a single field called pipeline, whose value is an array of strings.

adityasharad · 2022-07-01T19:05:27Z

extensions/ql-vscode/src/pure/log-summary-parser.ts

+ */
+ export function parseVisualizerData(logSummary: string): EvaluatorLogData[] {
+    // Remove newline delimiters because summary is in .jsonl format.
+    const jsonSummaryObjects: string[] = logSummary.split(os.EOL + os.EOL);


Can you explain why we split on two EOLs? I see your test data is formatted to make each item multiline and has 2 EOLs between items, but is that identical to what the CLI produces?. The JSONL format requires only one EOL between lines.

Yes, I have taken the test data from summary files that the CLI produces. A single EOL delimits each line in the file, so splitting on one splits every single line in the file. Splitting on 2 EOLs splits by every object in the file. Let me know if that makes sense!

aeisenberg

This is looking good to me. I think you've already addressed Aditya's comments.

extensions/ql-vscode/src/run-queries.ts

Co-authored-by: Andrew Eisenberg <aeisenberg@github.com>

angelapwen · 2022-07-12T12:04:22Z

This is looking good to me. I think you've already addressed Aditya's comments.

Thank you all for the reviews! Merging for now ✨

angelapwen added 3 commits June 23, 2022 17:35

Add command to generate JSON log summary file

f4c174b

Generate and parse JSON summary file into data model for visualizer

e99567b

Add unit test for parser function

d4b109f

angelapwen requested review from aeisenberg, dbartol and edoardopirovano June 23, 2022 15:45

github-advanced-security bot found potential problems Jun 23, 2022

View reviewed changes

extensions/ql-vscode/src/pure/log-summary-parser.ts Fixed Show fixed Hide fixed

edoardopirovano reviewed Jun 23, 2022

View reviewed changes

Move interfaces into log-summar-parser.ts

7fe97b5

aeisenberg reviewed Jun 23, 2022

View reviewed changes

Fix linting errors

31039d5

adityasharad reviewed Jun 23, 2022

View reviewed changes

Update data model fields, structure

f44e24e

adityasharad reviewed Jun 24, 2022

View reviewed changes

extensions/ql-vscode/src/pure/log-summary-parser.ts Outdated Show resolved Hide resolved

angelapwen and others added 3 commits June 24, 2022 19:12

Extract log summary logic into separate functions

18d9b02

Remove addressed review comments, adjust spacing

833fc26

Update inequality operator

447da7d

Co-authored-by: Aditya Sharad <6874315+adityasharad@users.noreply.github.com>

angelapwen commented Jun 24, 2022

View reviewed changes

angelapwen added 2 commits June 27, 2022 17:33

Update unit tests with checked in log files

850d616

Remove debug line

d2b3e21

aeisenberg reviewed Jun 27, 2022

View reviewed changes

extensions/ql-vscode/src/pure/log-summary-parser.ts Outdated Show resolved Hide resolved

angelapwen added 4 commits June 28, 2022 12:48

Make newline delimiter OS-agnostic

52172b4

Parse RA and pipelines, update comments

37adab5

Only parse log summary in canary mode

0a914c8

Reword comment

640b387

angelapwen marked this pull request as ready for review July 1, 2022 15:27

angelapwen requested a review from a team as a code owner July 1, 2022 15:27

adityasharad reviewed Jul 1, 2022

View reviewed changes

angelapwen added 2 commits July 11, 2022 13:28

Remove unnecessary RA condition

cb5a0d9

Update error text

28654b6

angelapwen added 4 commits July 11, 2022 13:39

Unmark ra field as optional

684f90f

Rename test variable for clarity

46b133f

Update invalid tests

49bfb4b

Add test data with multiple pipelines

11899a3

aeisenberg approved these changes Jul 11, 2022

View reviewed changes

extensions/ql-vscode/src/run-queries.ts Outdated Show resolved Hide resolved

angelapwen and others added 2 commits July 12, 2022 13:45

Make error message complete sentence

5a6f29c

Co-authored-by: Andrew Eisenberg <aeisenberg@github.com>

Update JSON schema, refine tests

107992e

angelapwen merged commit c9d895e into github:main Jul 12, 2022

angelapwen deleted the eval-log-visualizer-parsing branch July 12, 2022 12:05

angelapwen mentioned this pull request Jul 18, 2022

Add a tree viewer UI for the evaluator logs #1433

Merged

3 tasks

Conversation

angelapwen commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adityasharad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aeisenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

angelapwen commented Jul 12, 2022

Uh oh!

Reviewers

angelapwen commented Jun 23, 2022 •

edited

Loading