Skip to content

Improve experience for users of the log aggregation feature #574

@lfrancke

Description

@lfrancke

Description

We had some issues in the past with customers running into limitations with the size of our logging volumes causing Pods to be killed because Usage of EmptyDir volume "log" exceeds the limit.
This issue is about fixing the underlying issue and documenting the behavior and how users can modify settings themselves if needed.

Value

We want this feature so that Pods do not crash arbitrarily under normal circumstances especially when the load is high and lots of logs are produced. This will lead to fewer support cases for us.

Dependencies

  • I (@lfrancke) am unsure about the exact dependencies but I assume this will require changes to the documentation as well as a change in operator-rs.
  • Depending on the exact implementation it might also require changes to each operator

Tasks

- [x] Agree on a new default size for log volumes
- [x] Investigate whether we want to lower the [`checkIncrement`](https://logback.qos.ch/manual/appenders.html#checkIncrement) setting (and equivalent for other logging implementations)
- [x] Potentially implement the changes from the investigation
- [x] Document what customers need to change if they still run into problems, clearly containing the error message
- [ ] https://github.com/stackabletech/operator-rs/pull/853
- [ ] https://github.com/stackabletech/nifi-operator/pull/671

Acceptance Criteria

  • Users can search our documentation for the above mentioned error and will find a page/section telling them where it's coming from and how to solve it
  • NiFi (and other tools) will fail less under load due to log volume size restrictions

(Information Security) Risk Assessment

I can not identify any additional significant risks this would introduce.
Should we increase the default volume size for logs it'd require more resources from customers, risking resource constraints.

Quality

  • This should be tested by getting one of our tools to emit a lot of log statements in a short amount of time (this could be done by e.g. increasing the log level of anything to TRACE) making sure it doesn't fail under these "normal" high load conditions
    • As this was originally reported for NiFi that'd be a good candidate

Release Notes

Apache NiFi: The ephemeral EmptyDir Volumes used to store log files before being aggregated have their size increased from a default of 33 MiB to 500 MiB. Additionally the interval in which Logback checks if the maximum log file size has been reached was lowered from 60 seconds to 5 seconds.
Previously NiFi log files would become larger than the log Volume size between file size checks resulting in a Usage of EmptyDir volume "log" exceeds the limit error and the NiFi Pod being evicted.
This change will not prevent this from ever happening again but it'll decrease the likelihood.
We have also documented this behavior and how to adjust the size of the volumes further if needed.

Remarks

See our internal Slack discussion for more details: https://stackable-workspace.slack.com/archives/C031NP72H7T/p1710514724269639

Metadata

Metadata

Labels

release-noteDenotes a PR that will be considered when it comes time to generate release notes.release/24.11.0

Type

No type

Projects

Status

Done

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions