Parse summary of evaluator logs into data model#1405
Parse summary of evaluator logs into data model#1405angelapwen merged 23 commits intogithub:mainfrom
Conversation
| const jsonObj = JSON.parse(obj); | ||
|
|
||
| // Only convert log items that have an RA and millis field | ||
| // REVIEW: Not sure this is exactly what we want. Are we now excluding too many items? |
There was a problem hiding this comment.
I think that's reasonable. The only predicates missing a millis field are ones where we don't have a meaningful evaluation time for them (e.g. because we skipped evaluating due to a sentinel being empty or a cache hit), which aren't likely to be ones a user debugging performance cares about. The only ones missing ra are extensionals (that is: predicates that are part of the actual database). Those do have a meaningful evaluation time (because we have to load them from disk and into the cache), but it's usually pretty small so again not likely to be something a user debugging performance cares about.
There was a problem hiding this comment.
Thanks for sharing your thoughts on this! Glad to hear this is on the right track. I have a feeling there may be other edge cases that I'm not seeing (and therefore might fail while parsing) that I'll catch when I make the tests more robust.
|
|
||
| // REVIEW: it seems clunky to have to run this command again just to get the same summary file | ||
| // in JSON format, but I'm not sure it's worth changing the way it works in the CLI for just | ||
| // this use case? |
There was a problem hiding this comment.
I think this is fine for now. I do think it's looking like we want a way to specify which types of output to send where in codeql generate log-summary. Then we could write a single invocation like codeql generate log-summary --output "end-table:some-file" --output "summary:some-other-file" --output "human-readable:yet-another-file". But having something like that would be a pretty big rewrite of the already quite complex codeql generate log-summary.
There was a problem hiding this comment.
Agreed. Another option I just suggested verbally was to have an --output ide or similar, which produces all the different output formats the extension expects at this point, saved in separately named output files. But this can all wait for the future.
There was a problem hiding this comment.
Sounds good. Just filed https://github.com/github/codeql-core/issues/2480 to track for later and move discussion there!
| // into another class later on? | ||
|
|
||
| // REVIEW: We may want to stream this later on; we are generating the human-readable | ||
| // file in memory above but the JSON formatted one should be larger? |
There was a problem hiding this comment.
I don't think the human-readable summary ever ends up in memory. Both can be pretty big, so I think we definitely want to stream this here.
There was a problem hiding this comment.
We are using streaming json elsewhere, so you can do something similar here. See the sarif-parser module.
There was a problem hiding this comment.
After discussion with @adityasharad, decided to stream in a follow-up PR given this will go in under canary flag and make this PR ready for review. I have documented this as a TODO in the code comments.
| predicateName: jsonObj.predicateName, | ||
| timeInMillis: jsonObj.millis, | ||
| resultSize: jsonObj.resultSize | ||
| // TODO: need to also parse RA, pipeline arrays into the object. |
There was a problem hiding this comment.
Similarly to the below comment, I think if we parse everything here and put it in memory we'll be creating very significant memory pressure. A better solution, I think, would be to record some barebones information that we need later on in the tree view when its first opened (presumably things like name and evaluation time), along with a position in the file (in bytes) where we can find the event. Then, when the user expands a given predicate, we can jump right back to the correct position and read the event to load that information.
There was a problem hiding this comment.
Thanks for this comment, I hadn't considered saving something like the position in file in bytes. During 🍐 with @aeisenberg we discussed that this would be a good idea but will follow-on to the PR (and the front end visualizer PR) with this change, so that we can view simple/small logs in the visualizer first. The change is still under a canary flag 🐦 so we don't expect the memory pressure to be too pressing at the moment.
| // into another class later on? | ||
|
|
||
| // REVIEW: We may want to stream this later on; we are generating the human-readable | ||
| // file in memory above but the JSON formatted one should be larger? |
There was a problem hiding this comment.
We are using streaming json elsewhere, so you can do something similar here. See the sarif-parser module.
|
|
||
| // REVIEW: Suggestions on other fields that are useful for a performance | ||
| // debugger would be welcome! | ||
| export interface EvaluatorLogData { |
There was a problem hiding this comment.
One suggestion here: it might make your life easier to have the field names exactly match the fields from the evaluator log. Then it's easy to update this file (we could even derive it in future from the JSON schema, but that's a future idea).
There was a problem hiding this comment.
That makes sense — I think I'd named them whatever was intuitive to me, but it'd be better to keep it consistent with the actual log for future maintenance. Deriving it from a JSON schema certainly seems like it'd make the parsing logic less brittle. I'll update it now to match the evaluator log fields.
|
|
||
| // REVIEW: it seems clunky to have to run this command again just to get the same summary file | ||
| // in JSON format, but I'm not sure it's worth changing the way it works in the CLI for just | ||
| // this use case? |
There was a problem hiding this comment.
Agreed. Another option I just suggested verbally was to have an --output ide or similar, which produces all the different output formats the extension expects at this point, saved in separately named output files. But this can all wait for the future.
| import { parseVisualizerData } from '../../src/pure/log-summary-parser'; | ||
|
|
||
| describe('Parsing Valid Summary Text', () => { | ||
| const validSummaryText = ` |
There was a problem hiding this comment.
At the risk of complicating your test code by introducing file reading, I think it would be useful to keep the JSON test data in standalone files checked into the repo. Then each test case only needs to know the input file name and assertions on the expected output.
There was a problem hiding this comment.
I've made this change. I think it would still be good to make the tests more robust (right now I am testing for the edge cases that I have paid attention to and caught, but it may be good to test an entire full summary file to make sure nothing in the JSON file crashes). This may be a pretty large file to check in though, and I'm not entirely sure what I'd test for other than that it doesn't crash — given that it would be kind of laborious to manually check how many objects I'd expect to be parsed out of it.
Maybe it would be good to just continuously add individual test cases that test different configurations of the JSON objects? 🤔
Co-authored-by: Aditya Sharad <6874315+adityasharad@users.noreply.github.com>
| interface Pipeline { | ||
| // Key: pipeline identifier; Value: array of pipeline steps | ||
| pipeline: Map<string, string[]>; | ||
| } |
There was a problem hiding this comment.
I have simplified the model behind Pipeline to match what the log actually emits more closely (a map where the key is the name of the pipeline and the value is an array of pipeline steps).
Previously I had PipelineStep as a separate interface that only held a body of strings, but that additional layer of indirection seemed unnecessary. Open to other thoughts here though!
There was a problem hiding this comment.
This does not appear to match your test data, where ra is an object with a single field called pipeline, whose value is an array of strings.
There was a problem hiding this comment.
Ah, there are actually cases where ra contains many pipelines with different names, which is why I used this schema. I've added an example of this case to the valid-summary.jsonl file but for convenience here is the related part:
"ra" : {
"base" : [
" {2} r1 = namespaces AND NOT namespacembrs_1#antijoin_rhs(Lhs.0)",
" return r1"
],
"standard" : [
" {2} r1 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs#1 ON FIRST 1 OUTPUT Rhs.1, Lhs.1",
"",
" {2} r2 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs ON FIRST 1 OUTPUT Rhs.1, (Lhs.1 ++ \"::\" ++ Rhs.2)",
"",
" {2} r3 = r1 UNION r2",
" {2} r4 = r3 AND NOT QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev(Lhs.0, Lhs.1)",
" return r4"
],
"order_500000" : [
" {2} r1 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs#1 ON FIRST 1 OUTPUT Rhs.1, Lhs.1",
"",
" {2} r2 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH namespacembrs ON FIRST 1 OUTPUT Rhs.1, Lhs.1",
" {2} r3 = JOIN r2 WITH namespaces ON FIRST 1 OUTPUT Lhs.0, (Lhs.1 ++ \"::\" ++ Rhs.1)",
"",
" {2} r4 = r1 UNION r3",
" {2} r5 = r4 AND NOT QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev(Lhs.0, Lhs.1)",
" return r5"
]
},
From this I thought it would be most accurate for ra to be an array of pipeline objects, that have a key (the name of the pipeline such as base) and a value of the array of strings.
There was a problem hiding this comment.
I think a closer representation would be for ra to be a map/object itself, rather than an array. The keys of this map would be the (string) pipeline names, and the values of this map would be arrays of RA log strings for each pipeline.
This may also explain what you are seeing when trying to parse the JSON logs into Pipeline[].
There was a problem hiding this comment.
I see; for some reason I added an extra layer of arrays in my mental model. I have renamed the interface Pipelines to indicate that the object contains all pipelines rather than an array of Pipelines. This did indeed fix the Pipeline[] error. Thank you!
adityasharad
left a comment
There was a problem hiding this comment.
I think this is shaping up well and you've correctly limited the scope for this PR, but I've confused myself about the type/structure of the expected data. Several questions and naming suggestions.
| expect(item.ra).to.not.be.empty; | ||
| if (item.ra !== undefined) { | ||
| for (const pipeline of Object.entries(item.ra)) { | ||
| expect (pipeline).to.not.be.empty; |
There was a problem hiding this comment.
Object.entries returns 2-element arrays of the form [key, value], so each array here will always be non-empty and this test will always pass. Should this be something like expect item.ra.pipeline.to.not.be.empty?
There was a problem hiding this comment.
Ah, you're right. I think I actually meant something like
for (const pipeline of item.ra) {
expect (pipeline).to.not.be.empty;
}
because I wanted to check that each pipeline in the array (item.ra) was non-empty. But when I use the above code I get TypeError: item.ra is not iterable 🤔 which I find odd, because it is of type Pipeline[].
There was a problem hiding this comment.
After some insight from @henrymercer on Slack, it seems that a likely culprit of the error is that the JSON parsing at runtime may not actually be parsing into the type Pipeline[]. I will do some 🔬
extensions/ql-vscode/test/pure-tests/log-summary-parser.test.ts
Outdated
Show resolved
Hide resolved
extensions/ql-vscode/test/pure-tests/log-summary-parser.test.ts
Outdated
Show resolved
Hide resolved
extensions/ql-vscode/test/pure-tests/log-summary-parser.test.ts
Outdated
Show resolved
Hide resolved
extensions/ql-vscode/test/pure-tests/log-summary-parser.test.ts
Outdated
Show resolved
Hide resolved
| interface Pipeline { | ||
| // Key: pipeline identifier; Value: array of pipeline steps | ||
| pipeline: Map<string, string[]>; | ||
| } |
There was a problem hiding this comment.
This does not appear to match your test data, where ra is an object with a single field called pipeline, whose value is an array of strings.
| */ | ||
| export function parseVisualizerData(logSummary: string): EvaluatorLogData[] { | ||
| // Remove newline delimiters because summary is in .jsonl format. | ||
| const jsonSummaryObjects: string[] = logSummary.split(os.EOL + os.EOL); |
There was a problem hiding this comment.
Can you explain why we split on two EOLs? I see your test data is formatted to make each item multiline and has 2 EOLs between items, but is that identical to what the CLI produces?. The JSONL format requires only one EOL between lines.
There was a problem hiding this comment.
Yes, I have taken the test data from summary files that the CLI produces. A single EOL delimits each line in the file, so splitting on one splits every single line in the file. Splitting on 2 EOLs splits by every object in the file. Let me know if that makes sense!
aeisenberg
left a comment
There was a problem hiding this comment.
This is looking good to me. I think you've already addressed Aditya's comments.
Co-authored-by: Andrew Eisenberg <aeisenberg@github.com>
Thank you all for the reviews! Merging for now ✨ |
This PR is the first step towards adding a visualizer for evaluator log data. It includes the following logic:
After this PR is merged, we will merge a PR displaying the data in the data model in a VSCode TreeView.
The PR is marked as draft due to the number of comments labeled
REVIEW. I am assigning reviewers who I believe may have opinions about some of the items marked for review, especially the data model of what items folks looking at the visualizer may eventually want to see, and which items in the summary log we are/are not interested in (and therefore skip parsing).Checklist
ready-for-doc-reviewlabel there.