Skip to content

Conversation

@ddelnano
Copy link
Member

@ddelnano ddelnano commented Jan 21, 2025

Summary: Make metadata pod lookups more resilient to short lived processes

This is a continuation of the work started from #1989. Since the local_addr column is populated for client side traces, it can be used as a fallback lookup for these traces. This doesn't solve all of the permutations of missing short lived processes (#1638), but provides more coverage than before.

Relevant Issues: #1638

Type of change: /kind bugfix

Test Plan: Verified the following

  • Compared the performance with and without this change with src/e2e_test/vizier/exectime:exectime. This change has a minor performance impact, but it closes the gap on certain situations that previously caused users to distrust Pixie's instrumentation
# Performance baseline
$ ./exectime benchmark -a testing.getcosmic.ai:443 -c <cluster_id> 2>&1 | tee baseline_for_simple_udf_swap_e20880ffd.txt
# Performance of this change
./exectime benchmark -a testing.getcosmic.ai:443 -c <cluster_id> 2>&1 | tee simple_udf_swap_cd217c05c.txt

simple_udf_swap_cd217c05c.txt
baseline_for_simple_udf_swap_e20880ffd.txt

  • Ran for i in $(seq 0 1000); do curl http://google.com/$i; sleep 2; done within a pod and verified that with this change all traces are shown, without this change a significant number of traces are missed. See before and after screenshots below:

vizier-0 14 14-curl-with-missing-data
traces-with-new-fallback

Changelog Message: Fix a certain class of cases where Pixie previously missed protocol traces from short lived connections

This opts the df.ctx['pod'] syntax sugar to try another pod name
lookup if the default upid -> pod name lookup fails. This failure
is common for pods with short lived processes, so using a pod IP
based lookup (local_addr) is attempted if the first lookup fails

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
@ddelnano ddelnano requested a review from a team as a code owner January 21, 2025 23:52
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
@ddelnano ddelnano merged commit 623e988 into pixie-io:main Jan 24, 2025
31 checks passed
@ddelnano ddelnano deleted the ddelnano/use-fallback-udf-to-fix-short-lived-process-pod-name-lookups branch January 24, 2025 21:19
ddelnano added a commit to ddelnano/pixie that referenced this pull request Aug 6, 2025
…xie-io#2094)

Summary: Make metadata pod lookups more resilient to short lived
processes

This is a continuation of the work started from pixie-io#1989. Since the
`local_addr` column is populated for client side traces, it can be used
as a fallback lookup for these traces. This doesn't solve all of the
permutations of missing short lived processes (pixie-io#1638), but provides more
coverage than before.

Relevant Issues: pixie-io#1638

Type of change: /kind bugfix

Test Plan: Verified the following
- [x] Compared the performance with and without this change with
`src/e2e_test/vizier/exectime:exectime`. This change has a minor
performance impact, but it closes the gap on certain situations that
previously caused users to distrust Pixie's instrumentation
```
# Performance baseline
$ ./exectime benchmark -a testing.getcosmic.ai:443 -c <cluster_id> 2>&1 | tee baseline_for_simple_udf_swap_e20880ffd.txt
# Performance of this change
./exectime benchmark -a testing.getcosmic.ai:443 -c <cluster_id> 2>&1 | tee simple_udf_swap_cd217c05c.txt
```

[simple_udf_swap_cd217c05c.txt](https://github.com/user-attachments/files/18497709/simple_udf_swap_cd217c05c.txt)

[baseline_for_simple_udf_swap_e20880ffd.txt](https://github.com/user-attachments/files/18497710/baseline_for_simple_udf_swap_e20880ffd.txt)
- [x] Ran `for i in $(seq 0 1000); do curl http://google.com/$i; sleep
2; done` within a pod and verified that with this change all traces are
shown, without this change a significant number of traces are missed.
See before and after screenshots below:

![vizier-0 14
14-curl-with-missing-data](https://github.com/user-attachments/assets/035b5dcf-d87a-4134-84c1-9e478594927b)

![traces-with-new-fallback](https://github.com/user-attachments/assets/2a84ecbb-83cb-45ae-af85-77b1773efb59)

Changelog Message: Fix a certain class of cases where Pixie previously
missed protocol traces from short lived connections

---------

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
GitOrigin-RevId: 623e988
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants