Skip to content

Parse summary of evaluator logs into data model#1405

Merged
angelapwen merged 23 commits intogithub:mainfrom
angelapwen:eval-log-visualizer-parsing
Jul 12, 2022
Merged

Parse summary of evaluator logs into data model#1405
angelapwen merged 23 commits intogithub:mainfrom
angelapwen:eval-log-visualizer-parsing

Conversation

@angelapwen
Copy link
Contributor

@angelapwen angelapwen commented Jun 23, 2022

This PR is the first step towards adding a visualizer for evaluator log data. It includes the following logic:

  • running the command to generate log summaries, after a query is complete, in JSON format for ease of parsing
  • parsing each interesting JSON object from the log summary into a data model (where interesting means the object contains all fields that our data model require)
  • adding a data model encapsulating fields that we expect to surface in our visualizer

After this PR is merged, we will merge a PR displaying the data in the data model in a VSCode TreeView.

The PR is marked as draft due to the number of comments labeled REVIEW. I am assigning reviewers who I believe may have opinions about some of the items marked for review, especially the data model of what items folks looking at the visualizer may eventually want to see, and which items in the summary log we are/are not interested in (and therefore skip parsing).

Checklist

  • CHANGELOG.md has been updated to incorporate all user visible changes made by this pull request.
  • Issues have been created for any UI or other user-facing changes made by this pull request.
  • [Maintainers only] If this pull request makes user-facing changes that require documentation changes, open a corresponding docs pull request in the github/codeql repo and add the ready-for-doc-review label there.

const jsonObj = JSON.parse(obj);

// Only convert log items that have an RA and millis field
// REVIEW: Not sure this is exactly what we want. Are we now excluding too many items?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable. The only predicates missing a millis field are ones where we don't have a meaningful evaluation time for them (e.g. because we skipped evaluating due to a sentinel being empty or a cache hit), which aren't likely to be ones a user debugging performance cares about. The only ones missing ra are extensionals (that is: predicates that are part of the actual database). Those do have a meaningful evaluation time (because we have to load them from disk and into the cache), but it's usually pretty small so again not likely to be something a user debugging performance cares about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing your thoughts on this! Glad to hear this is on the right track. I have a feeling there may be other edge cases that I'm not seeing (and therefore might fail while parsing) that I'll catch when I make the tests more robust.


// REVIEW: it seems clunky to have to run this command again just to get the same summary file
// in JSON format, but I'm not sure it's worth changing the way it works in the CLI for just
// this use case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now. I do think it's looking like we want a way to specify which types of output to send where in codeql generate log-summary. Then we could write a single invocation like codeql generate log-summary --output "end-table:some-file" --output "summary:some-other-file" --output "human-readable:yet-another-file". But having something like that would be a pretty big rewrite of the already quite complex codeql generate log-summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Another option I just suggested verbally was to have an --output ide or similar, which produces all the different output formats the extension expects at this point, saved in separately named output files. But this can all wait for the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Just filed https://github.com/github/codeql-core/issues/2480 to track for later and move discussion there!

// into another class later on?

// REVIEW: We may want to stream this later on; we are generating the human-readable
// file in memory above but the JSON formatted one should be larger?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the human-readable summary ever ends up in memory. Both can be pretty big, so I think we definitely want to stream this here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using streaming json elsewhere, so you can do something similar here. See the sarif-parser module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @adityasharad, decided to stream in a follow-up PR given this will go in under canary flag and make this PR ready for review. I have documented this as a TODO in the code comments.

predicateName: jsonObj.predicateName,
timeInMillis: jsonObj.millis,
resultSize: jsonObj.resultSize
// TODO: need to also parse RA, pipeline arrays into the object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the below comment, I think if we parse everything here and put it in memory we'll be creating very significant memory pressure. A better solution, I think, would be to record some barebones information that we need later on in the tree view when its first opened (presumably things like name and evaluation time), along with a position in the file (in bytes) where we can find the event. Then, when the user expands a given predicate, we can jump right back to the correct position and read the event to load that information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment, I hadn't considered saving something like the position in file in bytes. During 🍐 with @aeisenberg we discussed that this would be a good idea but will follow-on to the PR (and the front end visualizer PR) with this change, so that we can view simple/small logs in the visualizer first. The change is still under a canary flag 🐦 so we don't expect the memory pressure to be too pressing at the moment.

// into another class later on?

// REVIEW: We may want to stream this later on; we are generating the human-readable
// file in memory above but the JSON formatted one should be larger?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using streaming json elsewhere, so you can do something similar here. See the sarif-parser module.


// REVIEW: Suggestions on other fields that are useful for a performance
// debugger would be welcome!
export interface EvaluatorLogData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion here: it might make your life easier to have the field names exactly match the fields from the evaluator log. Then it's easy to update this file (we could even derive it in future from the JSON schema, but that's a future idea).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense — I think I'd named them whatever was intuitive to me, but it'd be better to keep it consistent with the actual log for future maintenance. Deriving it from a JSON schema certainly seems like it'd make the parsing logic less brittle. I'll update it now to match the evaluator log fields.


// REVIEW: it seems clunky to have to run this command again just to get the same summary file
// in JSON format, but I'm not sure it's worth changing the way it works in the CLI for just
// this use case?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Another option I just suggested verbally was to have an --output ide or similar, which produces all the different output formats the extension expects at this point, saved in separately named output files. But this can all wait for the future.

import { parseVisualizerData } from '../../src/pure/log-summary-parser';

describe('Parsing Valid Summary Text', () => {
const validSummaryText = `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the risk of complicating your test code by introducing file reading, I think it would be useful to keep the JSON test data in standalone files checked into the repo. Then each test case only needs to know the input file name and assertions on the expected output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change. I think it would still be good to make the tests more robust (right now I am testing for the edge cases that I have paid attention to and caught, but it may be good to test an entire full summary file to make sure nothing in the JSON file crashes). This may be a pretty large file to check in though, and I'm not entirely sure what I'd test for other than that it doesn't crash — given that it would be kind of laborious to manually check how many objects I'd expect to be parsed out of it.

Maybe it would be good to just continuously add individual test cases that test different configurations of the JSON objects? 🤔

Comment on lines 11 to 14
interface Pipeline {
// Key: pipeline identifier; Value: array of pipeline steps
pipeline: Map<string, string[]>;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified the model behind Pipeline to match what the log actually emits more closely (a map where the key is the name of the pipeline and the value is an array of pipeline steps).

Previously I had PipelineStep as a separate interface that only held a body of strings, but that additional layer of indirection seemed unnecessary. Open to other thoughts here though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not appear to match your test data, where ra is an object with a single field called pipeline, whose value is an array of strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, there are actually cases where ra contains many pipelines with different names, which is why I used this schema. I've added an example of this case to the valid-summary.jsonl file but for convenience here is the related part:

"ra" : {
    "base" : [
      "    {2} r1 = namespaces AND NOT namespacembrs_1#antijoin_rhs(Lhs.0)",
      "    return r1"
    ],
    "standard" : [
      "    {2} r1 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs#1 ON FIRST 1 OUTPUT Rhs.1, Lhs.1",
      "",
      "    {2} r2 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs ON FIRST 1 OUTPUT Rhs.1, (Lhs.1 ++ \"::\" ++ Rhs.2)",
      "",
      "    {2} r3 = r1 UNION r2",
      "    {2} r4 = r3 AND NOT QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev(Lhs.0, Lhs.1)",
      "    return r4"
    ],
    "order_500000" : [
      "    {2} r1 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#join_rhs#1 ON FIRST 1 OUTPUT Rhs.1, Lhs.1",
      "",
      "    {2} r2 = JOIN QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev_delta WITH namespacembrs ON FIRST 1 OUTPUT Rhs.1, Lhs.1",
      "    {2} r3 = JOIN r2 WITH namespaces ON FIRST 1 OUTPUT Lhs.0, (Lhs.1 ++ \"::\" ++ Rhs.1)",
      "",
      "    {2} r4 = r1 UNION r3",
      "    {2} r5 = r4 AND NOT QualifiedName::Namespace::getAQualifierForMembers#f0820431#ff#prev(Lhs.0, Lhs.1)",
      "    return r5"
    ]
  },

From this I thought it would be most accurate for ra to be an array of pipeline objects, that have a key (the name of the pipeline such as base) and a value of the array of strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a closer representation would be for ra to be a map/object itself, rather than an array. The keys of this map would be the (string) pipeline names, and the values of this map would be arrays of RA log strings for each pipeline.

This may also explain what you are seeing when trying to parse the JSON logs into Pipeline[].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; for some reason I added an extra layer of arrays in my mental model. I have renamed the interface Pipelines to indicate that the object contains all pipelines rather than an array of Pipelines. This did indeed fix the Pipeline[] error. Thank you!

@angelapwen angelapwen marked this pull request as ready for review July 1, 2022 15:27
@angelapwen angelapwen requested a review from a team as a code owner July 1, 2022 15:27
Copy link
Contributor

@adityasharad adityasharad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is shaping up well and you've correctly limited the scope for this PR, but I've confused myself about the type/structure of the expected data. Several questions and naming suggestions.

expect(item.ra).to.not.be.empty;
if (item.ra !== undefined) {
for (const pipeline of Object.entries(item.ra)) {
expect (pipeline).to.not.be.empty;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object.entries returns 2-element arrays of the form [key, value], so each array here will always be non-empty and this test will always pass. Should this be something like expect item.ra.pipeline.to.not.be.empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I think I actually meant something like

for (const pipeline of item.ra) {
  expect (pipeline).to.not.be.empty;
}

because I wanted to check that each pipeline in the array (item.ra) was non-empty. But when I use the above code I get TypeError: item.ra is not iterable 🤔 which I find odd, because it is of type Pipeline[].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some insight from @henrymercer on Slack, it seems that a likely culprit of the error is that the JSON parsing at runtime may not actually be parsing into the type Pipeline[]. I will do some 🔬

Comment on lines 11 to 14
interface Pipeline {
// Key: pipeline identifier; Value: array of pipeline steps
pipeline: Map<string, string[]>;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not appear to match your test data, where ra is an object with a single field called pipeline, whose value is an array of strings.

*/
export function parseVisualizerData(logSummary: string): EvaluatorLogData[] {
// Remove newline delimiters because summary is in .jsonl format.
const jsonSummaryObjects: string[] = logSummary.split(os.EOL + os.EOL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we split on two EOLs? I see your test data is formatted to make each item multiline and has 2 EOLs between items, but is that identical to what the CLI produces?. The JSONL format requires only one EOL between lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have taken the test data from summary files that the CLI produces. A single EOL delimits each line in the file, so splitting on one splits every single line in the file. Splitting on 2 EOLs splits by every object in the file. Let me know if that makes sense!

Copy link
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. I think you've already addressed Aditya's comments.

angelapwen and others added 2 commits July 12, 2022 13:45
Co-authored-by: Andrew Eisenberg <aeisenberg@github.com>
@angelapwen
Copy link
Contributor Author

This is looking good to me. I think you've already addressed Aditya's comments.

Thank you all for the reviews! Merging for now ✨

@angelapwen angelapwen merged commit c9d895e into github:main Jul 12, 2022
@angelapwen angelapwen deleted the eval-log-visualizer-parsing branch July 12, 2022 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants