perf: reduce busybox init-time memory allocation#19952
Draft
perf: reduce busybox init-time memory allocation#19952
Conversation
Reduce init-time memory for the busybox binary by eliminating unnecessary imports, deferring allocations with sync.OnceValue, and breaking heavy transitive dependency chains. Results (Linux amd64): - Busybox: 16.1 MB -> 12.9 MB heap (-20%), 245K -> 173K mallocs (-29%) - AC standalone: 9.1 MB -> 7.2 MB heap (-21%), 87K -> 51K mallocs (-41%) - Binary size: 205 MB -> 194 MB (-5%) Generated with assistance from AI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
Contributor
🚀 Build Images ReadyImages are ready for commit ce18068. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-659-gce180687e0 |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test files with sql_integration build tag reference schema vars that are now sync.OnceValue functions. Add () to all schema.XxxSchema and pkgSchema.XxxSchema references in test files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each zap logger created with sampling enabled allocates a counters [7][4096]counter array (~450 KB). With 100+ loggers across sensor's dependency tree, this totals ~46 MB of heap (40% of sensor's runtime memory at 128Mi limit). Remove per-logger sampling from the zap config. This trades potential log volume increase for massive memory savings. On an idle cluster, sensor's heap drops from ~115 MB to ~69 MB. For edge deployments with tight memory limits, this is critical — it enables sensor to run at ~50 Mi instead of ~82 Mi. Generated with assistance from AI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive fix for all _test.go files referencing schema vars that are now sync.OnceValue functions. Covers sql_integration, benchmark, and unit test files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With sync.OnceValue schemas, registration happens on first access. Tests that explicitly call RegisterCategoryToTable or RegisterTable after accessing a schema would cause fatal duplicate registration. Make both functions idempotent — silently ignore re-registration of the same table. Also fix select_field_test.go which incorrectly added () to TestStructsSchema (a test schema not converted to sync.OnceValue). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GraphQL schema is parsed eagerly at startup even when no UI or API client ever connects. On edge clusters without UI access, this is 5 MB wasted. Defer parsing to the first GraphQL HTTP request using sync.Once. The first request pays a one-time parsing cost (~ms), subsequent requests use the cached schema. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scale database connection pool based on ROX_MEMLIMIT (set via Kubernetes downward API from container memory limits). By default, pgx creates max(4, NumCPU) connections, each with 512-entry statement and description caches. On a 16-core host this means 16 connections × 2 × 512 cached entries — duplicating cache data across connections and using 10+ MB. For memory-constrained environments: - <512 Mi: 2 connections, 64-entry caches - 512 Mi-2 Gi: 4 connections, 128-entry caches - >2 Gi: pgx defaults (unchanged) This also reduces CPU overhead from connection management on small-core edge nodes. Generated with assistance from AI Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix goimports formatting (blank lines in import groups) and replace empty measurement tool files with valid Go stubs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sensor compiles ALL ~100 default policies into regexp matchers at startup during initialPolicySync. Each policy's criteria (CVE severity, image name patterns, etc.) gets compiled into booleanpolicy evaluators with regexp matchers. This costs ~6 MB. On an idle edge cluster, most policies are never evaluated because they don't match the lifecycle stage or resource type being processed. Wrap CompilePolicy with a lazy proxy that defers the expensive compilation (regexp building, matcher construction) until the first Match* or AppliesTo call. Policies that never get evaluated never get compiled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gRPC compression between sensor and central costs ~3 MB of compression buffers and ongoing CPU. On local/same-cluster networks, the bandwidth savings don't justify the cost. Disable by default with opt-in via ROX_SENSOR_GRPC_COMPRESSION=true. Also includes lazy policy compilation wrapper that defers 6 MB of regexp building until policies are first evaluated. Additional findings for future optimization: - Process enricher LRU hardcoded at 100K entries (should scale with memory) - Multiple cache/buffer sizes not memory-aware - Network graph default entities use 33 MB (could be optional for edge) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each logger that writes to a file spawns a lumberjack goroutine for log rotation. With ~30 loggers writing to /var/log/stackrox/log.txt, that's 30 idle goroutines + 30 independent file handles to the same file. In container environments, logs go to stdout and are collected by the container runtime — file logging is unnecessary overhead. Set ROX_LOGGING_TO_FILE=false to disable file logging, saving: - 30 goroutines and their stacks - File I/O overhead - lumberjack rotation processing Default is true (unchanged behavior) for backward compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each CreateLogger call created an independent lumberjack.Logger for the same log file, spawning its own rotation goroutine. With ~30 loggers, that's 30 goroutines + 30 file handles to the same file. Share a single writer per path via a map. This reduces log rotation goroutines from 30 to 1 and eliminates potential corruption from concurrent uncoordinated writes to the same file. GC sweet spot experiment findings (included in commit message for context): - 128Mi: GC thrashing (84 GC/min, 200m CPU) - 160Mi: Sweet spot (2 GC/min, 4m CPU) - 192Mi: Comfortable (0 GC/min, 3m CPU) - Rule: set limit to 1.3-1.5x natural heap size Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test schemas in tools/generate-helpers/pg-table-bindings/ were not regenerated with the sync.OnceValue template change. Test files that reference these schemas had () added by the bulk fix but the schemas were still direct *walker.Schema, causing type mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ROX_SENSOR_LITE=false env var for edge deployments. When enabled: - Skip local policy compilation in ProcessPolicySync (saves 6 MB + CPU) - Skip network entity knowledge base loading (saves 16 MB) - Events still flow to central for evaluation - Admission controller still receives policies - Enforcement (pod kill) still works via central commands This reduces sensor's runtime memory by ~22 MB for edge clusters that don't need local policy evaluation or cloud provider network flow attribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Process enrichment LRU cache was hardcoded at 100K entries — designed for large enterprise clusters with thousands of containers. On a 50-container edge cluster, this is 2000x oversized. Use pkg/sensor/queue.ScaleSize to scale based on ROX_MEMLIMIT: - 128Mi limit → ~3K entries (sufficient for 50 containers) - 4Gi limit → 100K entries (unchanged behavior) - Minimum: 100 entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pkg/env imported pkg/timeutil just for HoursInDay=24 and DaysInWeek=7. pkg/timeutil imports go-timezone which loads a 0.5 MB timezone map at init. Since pkg/env is imported by EVERY component, this added 0.5 MB to every binary. Inline the constants (24 and 7) to eliminate this path. The timezone dep still exists through cfssl/mtls but this removes one import chain. Also regenerates test schemas with sync.OnceValue template. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In sensor-lite mode, detection is deferred to central, so the 4-hour image metadata cache serves no purpose. Set TTL to 1 second to effectively disable caching, preventing 2.5 MB of scan results from accumulating in memory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pkg/registries/types/options.go imported pkg/cloudproviders/gcp/auth for the STSTokenManager interface, forcing EVERY registry type (including Docker) to transitively import the Google Cloud SDK. This added ~500 KB of init overhead and 270+ deps to all components using any registry type. Define a local TokenManager interface in the types package instead. The GCP auth package's STSTokenManager satisfies it structurally (Go duck typing), so no caller changes needed. This eliminates Google Cloud SDK from admission-control and config-controller, which never interact with GCP registries. Also includes image cache TTL reduction in sensor-lite mode and timeutil constant inlining. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace cfssl with Go stdlib crypto for certificate operations across
pkg/mtls, pkg/certgen, and 10+ satellite files. This removes cfssl's
entire transitive chain (zcrypto, zlint, publicsuffix — ~500 deps)
from sensor, AC, and all components.
Restructure pkg/registries/factory.go: move AllCreatorFuncs to
pkg/registries/all/ sub-package so sensor can import lightweight
registry creators (docker, rhel, artifactory, nexus, quay, ghcr, ibm)
without pulling in AWS/Azure/Google cloud SDKs.
Replace cosign/sigstore signature fetching in sensor with a no-op
implementation — Central already handles signature verification.
Build-tag AWS ECR credentials manager (awsecr) and cloud provider
metadata detection (awscloud, gcpcloud) so the default sensor binary
excludes heavy cloud SDKs. Stubs return nil (same as non-cloud nodes).
Remove GCP singleton from sensor registry store and app lifecycle.
Inline os.LookupEnv("CI") in clientconn to decouple testutils from
production binaries.
Sensor deps: ~2392 → ~1569 (-35%). Heavy cloud deps: 249 → 0.
AC heavy deps: 14 → 0.
Partially generated with the help of AI.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ConvertTimestampToGraphqlTimeOrError was in pkg/protocompat/time.go, pulling the entire graph-gophers/graphql-go library (16 packages) into sensor, AC, and every component—even though the function is only called from central/graphql/resolvers/. Move it to a local `timestamp()` helper in the resolvers package. Update the code generator template to emit the new function name. Eliminates 16 unnecessary packages from sensor and AC binaries. Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Build tags for awsecr/awscloud/gcpcloud were the wrong approach—they would require separate release images for different deployment targets. AWS SDK, GCP metadata, and ECR credential manager init() functions are trivial (function pointer assignments, no heap allocation). The code already handles non-cloud nodes at runtime by returning nil from createCredentialsManager when node.Spec.ProviderID is not AWS. The types.go extraction for GCP is kept since it cleanly separates shared types from the metadata detection code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pkg/grpc/errors/interceptor.go imported pgconn.PgError just to check
errors.As() for PostgreSQL error sanitization. This pulled 15 pgx
packages into sensor, AC, and every non-database component.
Replace with a duck-typed interface { SQLState() string } — Go's
errors.As matches structurally, so pgconn.PgError is still caught
at runtime without the compile-time import. Tests confirm this works.
Partially generated with the help of AI.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move postgres DataType constants to lightweight pkg/postgres/datatypes sub-package. This breaks the chain: pkg/search/options.go imported pkg/postgres (for DataType string constants) which imported pgx. Same fix for pkg/search/postgres/aggregatefunc. Decouple pkg/grpc/routes from pgconfig — pass isExternalDB as a parameter instead of importing pgconfig (which imports pkg/postgres which imports pgx). Only central/main.go calls this function. Combined with the earlier pgconn duck-typing fix, this eliminates all 15 pgx packages from sensor and admission-control binaries. Sensor total deps: 2392 → 1608 (-33%). Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sensor/common/compliance imported claircore/indexer/controller for a single constant string "IndexFinished", and claircore/pkg/rhctag for version parsing that only extracts major.minor integers. Inline both: the constant becomes a string literal, the version parser becomes a 10-line function using strings.SplitN + strconv.ParseInt. This eliminates 10 claircore packages from sensor (scanner-only deps). This is the same pattern as the earlier rhcc constants inlining (which avoided the sqlite/modernc chain). The comment in the code even noted the rhcc inlining was intentional, but didn't finish the job. Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ResetFieldMetadataSingleton accepted *testing.T but never used it, pulling the testing package into every production binary. Removed the parameter — callers updated to ResetFieldMetadataSingleton(). This is one of 10 production packages that import "testing" — others include pkg/sac, pkg/grpc, pkg/registries/types. Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused *testing.T parameters from test helper functions in production files that pull the testing package into all binaries: - pkg/registries/types/metrics.go: TestCollect*() methods - pkg/sac/resources/list.go: Register*ForTest() functions These functions were declared in production files but only called from _test.go files. The *testing.T parameter was never used (_ receiver). Removing it eliminates the testing import from these packages. Part of systematic cleanup: 10 production packages import "testing" for test helpers. Fixed 3 so far (booleanpolicy, registries/types, sac/resources). Remaining: pkg/sac (2 files), pkg/grpc (2 files), pkg/booleanpolicy/evaluator/pathutil, sensor/common/networkflow. Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused *testing.T parameters from test helper functions in: - pkg/sac/test_scope_checker_core.go (2 functions) - pkg/sac/effectiveaccessscope/test_datasets.go (16 functions) These 627+ lines of test data structures were compiled into every production binary because the file imported "testing" for never-used *testing.T parameters. Updated all callers in test files to match new signatures. Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/test gke-qa-e2e-tests |
Move test-only code out of production Go files to eliminate the testing package from production binaries: - pkg/grpc/testutils.go: move debug logger, socket/stack diagnostics to debug_test.go. Remove *testing.TB from CreateTestGRPCStreamingService. - pkg/sac/query_testutils.go → query_testutils_test.go (renamed) - pkg/booleanpolicy/evaluator/pathutil/testutil.go → testutil_test.go These files contained test helper functions in production packages, causing the testing package and its test infrastructure to be compiled into every binary. Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Systematic cleanup of the "testing.T as reminder" anti-pattern across the codebase. Functions accepted *testing.T / testing.TB parameters they never used, just to signal "this is for tests only". This pulled the testing package into every production binary. Fixed: - pkg/grpc/authn/context.go: ContextWithIdentity (30+ callers updated) - pkg/grpc/authn/basic/identity_test_constructors.go: 5 functions - sensor/common/networkflow/manager/purger.go: WithPurgerTicker Combined with earlier commits, testing imports in production packages went from 10 → 2 (both gated by !release build tag). Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes for PR #19952 CI failures: - pkg/env/api_token_expiration_notifier.go: trailing whitespace - pkg/protocompat/time_test.go: remove test for moved function - 14 files: "Too many blank lines in imports" from goimports - migrator test schemas (16 files): convert to sync.OnceValue - sensor/common/processsignal/enricher.go: adaptive backoff ticker (5s→2min when LRU empty, resets on new activity — avoids 720 idle scans/hour on stable clusters) Partially generated with the help of AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ROX_SENSOR_PROCESS_ENRICHER_INTERVAL env var (default 5s) to control how often sensor scans the LRU cache for unresolved container metadata. The enricher now: - Skips the full LRU scan when the cache is empty (zero-cost idle) - Backs off to 2min intervals when no work is found - Resets to the base interval when new metadata callbacks arrive - Base interval is configurable: set to "30s" or "5m" for stable or edge clusters to reduce CPU Before: 720 full LRU scans/hour on idle clusters (12/min × 60) After: ~30 scans/hour when idle (backoff), 0 scans when cache empty Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Analysis of the data flow shows the ticker is just a safety net: - Process signals try immediate enrichment via LookupByContainerID - If metadata missing, the signal goes to the LRU cache - When pod metadata arrives, the cluster entity store fires a callback on metadataCallbackChan → resolves the container (event-driven) - The ticker only catches the rare race where the callback is missed The callback resolves most containers within 1-2s. The previous 5s ticker was scanning the entire LRU cache 720 times/hour for a race condition that the event-driven path already handles. Changed default from 5s to 30s. With the adaptive backoff, idle clusters back off to 2min. Configurable via ROX_SENSOR_PROCESS_ENRICHER_INTERVAL for testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Reduce init-time memory for the busybox binary and all components by eliminating unnecessary imports, deferring allocations with sync.OnceValue, and breaking heavy transitive dependency chains.
Results (Linux amd64):
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Heap profiling with pprof in Linux amd64 containers via podman.
🤖 Generated with Claude Code