Conversation
Monitor the kernel audit backlog via NETLINK_AUDIT to detect audit backlog overflow conditions before they cause kernel panics. When audit failure mode is set to 2 (panic), a full backlog triggers an immediate kernel panic with no warning. This collector exposes: - audit.backlog: current queue depth - audit.backlog_utilization: percentage of backlog_limit used - audit.backlog_limit: configured maximum - audit.lost: events dropped due to overflow (rate) - audit.status: enabled, failure mode, rate_limit, daemon pid Motivated by a real incident where an SMPP traffic surge overwhelmed auditd, the backlog overflowed, and failure=2 caused a kernel panic on a CentOS 7 production system with zero pre-crash warning.
There was a problem hiding this comment.
1 issue found across 4 files
Confidence score: 5/5
- This looks low risk to merge: the only finding is low severity (3/10) and does not indicate a functional regression.
- In
src/collectors/debugfs.plugin/module-audit.c,audit_send_obsolete()appears to be unused static dead code, which is mainly a maintainability/cleanup concern rather than a runtime correctness issue. - Pay close attention to
src/collectors/debugfs.plugin/module-audit.c- remove or wireaudit_send_obsolete()to avoid lingering dead code.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/collectors/debugfs.plugin/module-audit.c">
<violation number="1" location="src/collectors/debugfs.plugin/module-audit.c:208">
P3: `audit_send_obsolete()` is dead code: it is a static function introduced in this file but never called.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Plugin as debugfs.plugin
participant AuditMod as NEW: module-audit.c
participant Kernel as Linux Kernel (Netlink)
participant Netdata as Netdata Parent Process
Note over Plugin,Kernel: Collection Cycle (Default: 1s)
Plugin->>AuditMod: do_module_audit()
opt First Run
AuditMod->>Kernel: Open NETLINK_AUDIT socket
AuditMod->>Kernel: Send AUDIT_GET request
alt Success
Kernel-->>AuditMod: Return audit_status
AuditMod->>AuditMod: Mark module available
else Failure / No Permissions
Kernel-->>AuditMod: Error (EACCES/EPROTONOSUPPORT)
AuditMod-->>Plugin: Return 1 (Disable Module)
end
end
AuditMod->>Kernel: NEW: socket(AF_NETLINK, SOCK_RAW, NETLINK_AUDIT)
AuditMod->>Kernel: NEW: send(AUDIT_GET)
alt Response Received
Note right of Kernel: Includes backlog, limit, <br/>lost events, and failure mode
Kernel-->>AuditMod: struct audit_status
else Timeout (2s) or Error
Kernel-->>AuditMod: recvfrom() error
AuditMod-->>Plugin: Return 0 (Retry next cycle)
end
AuditMod->>AuditMod: NEW: Calculate backlog utilization %
Note over AuditMod: Monitor failure=2 (Panic on overflow)
AuditMod->>Plugin: netdata_mutex_lock(&stdout_mutex)
opt Charts Not Created
AuditMod->>Netdata: NEW: CHART audit.backlog, audit.status, etc.
end
AuditMod->>Netdata: NEW: BEGIN/SET/END (backlog, lost, status values)
AuditMod->>Plugin: netdata_mutex_unlock(&stdout_mutex)
AuditMod->>Kernel: close(fd)
AuditMod-->>Plugin: Return 0 (Success)
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
- audit.conf: three alerts for audit health - audit_backlog_utilization: warn >50%, crit >80% of backlog limit - audit_failure_mode_panic: warn when failure=2 (config risk) - audit_lost_events: warn when events are being dropped - metadata.yaml: full module documentation for the audit collector including metrics, alerts, setup, and troubleshooting sections
Addresses review feedback: the function was defined but never called. The audit module retries on transient failures rather than permanently disabling itself, so chart obsoleting is not needed at this stage.
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/collectors/debugfs.plugin/metadata.yaml">
<violation number="1" location="src/collectors/debugfs.plugin/metadata.yaml:430">
P2: The permissions note is misleading: `debugfs.plugin` is documented with `CAP_DAC_READ_SEARCH`, not `CAP_AUDIT_CONTROL`, so saying it "already runs with elevated privileges" can give a false impression that no additional capability/root context is needed for this module.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| This integration requires root privileges or CAP_AUDIT_CONTROL capability | ||
| to query the kernel audit subsystem via netlink. The debugfs.plugin | ||
| already runs with elevated privileges. |
There was a problem hiding this comment.
P2: The permissions note is misleading: debugfs.plugin is documented with CAP_DAC_READ_SEARCH, not CAP_AUDIT_CONTROL, so saying it "already runs with elevated privileges" can give a false impression that no additional capability/root context is needed for this module.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/collectors/debugfs.plugin/metadata.yaml, line 430:
<comment>The permissions note is misleading: `debugfs.plugin` is documented with `CAP_DAC_READ_SEARCH`, not `CAP_AUDIT_CONTROL`, so saying it "already runs with elevated privileges" can give a false impression that no additional capability/root context is needed for this module.</comment>
<file context>
@@ -390,3 +390,147 @@ modules:
+ multi_instance: false
+ additional_permissions:
+ description: >
+ This integration requires root privileges or CAP_AUDIT_CONTROL capability
+ to query the kernel audit subsystem via netlink. The debugfs.plugin
+ already runs with elevated privileges.
</file context>
| This integration requires root privileges or CAP_AUDIT_CONTROL capability | |
| to query the kernel audit subsystem via netlink. The debugfs.plugin | |
| already runs with elevated privileges. | |
| This integration requires root privileges or CAP_AUDIT_CONTROL capability | |
| to query the kernel audit subsystem via netlink. On systems where | |
| debugfs.plugin runs only with CAP_DAC_READ_SEARCH, this module may not | |
| be able to collect audit metrics unless additional privileges are granted. |
There was a problem hiding this comment.
Pull request overview
Adds Linux audit subsystem monitoring to debugfs.plugin via NETLINK_AUDIT, plus metadata and default health alerts to warn on audit backlog pressure and dangerous failure=2 (panic) configurations.
Changes:
- Introduces a new
debugfs.pluginmodule (module-audit.c) that queriesAUDIT_GETand publishes backlog/status/lost metrics. - Adds health alarms (
audit.conf) for backlog utilization, panic failure mode, and lost events. - Registers the new module in
debugfs.pluginbuild/module lists and documents it inmetadata.yaml.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/collectors/debugfs.plugin/module-audit.c | New collector module implementing audit netlink query + charts. |
| src/collectors/debugfs.plugin/debugfs_plugin.c | Registers the new audit module in the module list. |
| src/collectors/debugfs.plugin/debugfs_plugin.h | Exposes the do_module_audit() module entry point. |
| src/collectors/debugfs.plugin/metadata.yaml | Adds integration metadata, metrics, and alert references for audit monitoring. |
| src/health/health.d/audit.conf | Adds default health templates for backlog utilization, failure mode panic, and lost events. |
| CMakeLists.txt | Includes module-audit.c in the debugfs.plugin build sources. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // chart: audit backlog depth | ||
| printf(PLUGINSD_KEYWORD_CHART | ||
| " audit.backlog '' 'Audit Backlog' 'events' 'audit' 'audit.backlog' line 1340 %d '' 'debugfs.plugin' '%s'\n", | ||
| update_every, name); | ||
| printf(PLUGINSD_KEYWORD_DIMENSION " 'backlog' 'backlog' %s 1 1 ''\n", | ||
| RRD_ALGORITHM_ABSOLUTE_NAME); | ||
|
|
||
| // chart: audit backlog utilization (percentage) | ||
| printf(PLUGINSD_KEYWORD_CHART | ||
| " audit.backlog_utilization '' 'Audit Backlog Utilization' '%%' 'audit' 'audit.backlog_utilization' area 1341 %d '' 'debugfs.plugin' '%s'\n", | ||
| update_every, name); | ||
| printf(PLUGINSD_KEYWORD_DIMENSION " 'utilization' 'utilization' %s 1 100 ''\n", | ||
| RRD_ALGORITHM_ABSOLUTE_NAME); | ||
|
|
||
| // chart: audit backlog limit | ||
| printf(PLUGINSD_KEYWORD_CHART | ||
| " audit.backlog_limit '' 'Audit Backlog Limit' 'events' 'audit' 'audit.backlog_limit' line 1342 %d '' 'debugfs.plugin' '%s'\n", | ||
| update_every, name); | ||
| printf(PLUGINSD_KEYWORD_DIMENSION " 'backlog_limit' 'backlog_limit' %s 1 1 ''\n", | ||
| RRD_ALGORITHM_ABSOLUTE_NAME); | ||
|
|
||
| // chart: audit lost events | ||
| printf(PLUGINSD_KEYWORD_CHART | ||
| " audit.lost '' 'Audit Lost Events' 'events/s' 'audit' 'audit.lost' area 1343 %d '' 'debugfs.plugin' '%s'\n", | ||
| update_every, name); | ||
| printf(PLUGINSD_KEYWORD_DIMENSION " 'lost' 'lost' %s 1 1 ''\n", | ||
| RRD_ALGORITHM_INCREMENTAL_NAME); | ||
|
|
||
| // chart: audit status | ||
| printf(PLUGINSD_KEYWORD_CHART | ||
| " audit.status '' 'Audit Status' 'state' 'audit' 'audit.status' line 1344 %d '' 'debugfs.plugin' '%s'\n", | ||
| update_every, name); |
There was a problem hiding this comment.
Chart priorities are hard-coded (1340–1344). Other debugfs.plugin modules use NETDATA_CHART_PRIO_* constants for priorities, which reduces the chance of collisions and keeps ordering consistent. Consider introducing appropriate NETDATA_CHART_PRIO_AUDIT_* constants and using them here.
| This integration requires root privileges or CAP_AUDIT_CONTROL capability | ||
| to query the kernel audit subsystem via netlink. The debugfs.plugin | ||
| already runs with elevated privileges. |
There was a problem hiding this comment.
The integration notes say debugfs.plugin “already runs with elevated privileges”, but in capability-based installs it is typically granted only cap_dac_read_search (not CAP_AUDIT_CONTROL). In that common case the audit module will self-disable unless the binary is SUID root or explicitly granted CAP_AUDIT_CONTROL. Please clarify this here and/or ensure installation docs/scripts set the needed capability when the audit module is enabled.
| This integration requires root privileges or CAP_AUDIT_CONTROL capability | |
| to query the kernel audit subsystem via netlink. The debugfs.plugin | |
| already runs with elevated privileges. | |
| This integration requires either root privileges or the CAP_AUDIT_CONTROL | |
| capability to query the kernel audit subsystem via netlink. In | |
| capability-based installations debugfs.plugin is typically granted only | |
| cap_dac_read_search, which is not sufficient for audit control. In that | |
| case the audit module will detect the missing privilege and disable | |
| itself unless the binary is SUID root or explicitly granted | |
| CAP_AUDIT_CONTROL (for example: setcap cap_dac_read_search,cap_audit_control+ep debugfs.plugin). |
| .s = { 0 }, | ||
| }; | ||
|
|
||
| if (send(fd, &req, req.nlh.nlmsg_len, 0) < 0) { |
There was a problem hiding this comment.
send() is used on an unconnected netlink socket. For datagram sockets (including AF_NETLINK), send() requires the socket to be connected; otherwise it fails with EDESTADDRREQ. Use sendto() (or connect() then send()) to a sockaddr_nl destination with nl_pid = 0 (kernel).
| if (send(fd, &req, req.nlh.nlmsg_len, 0) < 0) { | |
| struct sockaddr_nl kernel_addr = { | |
| .nl_family = AF_NETLINK, | |
| .nl_pid = 0, // kernel | |
| .nl_groups = 0, | |
| }; | |
| if (sendto(fd, &req, req.nlh.nlmsg_len, 0, | |
| (struct sockaddr *)&kernel_addr, sizeof(kernel_addr)) < 0) { |
| struct nlmsghdr *nlh = (struct nlmsghdr *)buf; | ||
| if (!NLMSG_OK(nlh, (size_t)len)) | ||
| continue; | ||
|
|
||
| if (nlh->nlmsg_type == AUDIT_GET) { | ||
| struct audit_status *s = NLMSG_DATA(nlh); | ||
| reply->valid = 1; | ||
| reply->enabled = s->enabled; | ||
| reply->failure = s->failure; | ||
| reply->pid = s->pid; | ||
| reply->rate_limit = s->rate_limit; | ||
| reply->backlog_limit = s->backlog_limit; | ||
| reply->lost = s->lost; | ||
| reply->backlog = s->backlog; | ||
| close(fd); | ||
| return 0; | ||
| } | ||
|
|
||
| // skip NLMSG_ERROR (ACK) and other message types | ||
| if (nlh->nlmsg_type == NLMSG_ERROR) | ||
| continue; |
There was a problem hiding this comment.
Netlink datagrams can contain multiple nlmsghdr messages; this code only inspects the first header in buf. If the kernel returns an ACK (NLMSG_ERROR with error=0) and the AUDIT_GET reply in the same datagram, the reply will be skipped and the query will fail. Iterate over all messages in the received buffer using NLMSG_NEXT() / NLMSG_OK() and handle NLMSG_ERROR by checking its embedded error code.
| struct nlmsghdr *nlh = (struct nlmsghdr *)buf; | |
| if (!NLMSG_OK(nlh, (size_t)len)) | |
| continue; | |
| if (nlh->nlmsg_type == AUDIT_GET) { | |
| struct audit_status *s = NLMSG_DATA(nlh); | |
| reply->valid = 1; | |
| reply->enabled = s->enabled; | |
| reply->failure = s->failure; | |
| reply->pid = s->pid; | |
| reply->rate_limit = s->rate_limit; | |
| reply->backlog_limit = s->backlog_limit; | |
| reply->lost = s->lost; | |
| reply->backlog = s->backlog; | |
| close(fd); | |
| return 0; | |
| } | |
| // skip NLMSG_ERROR (ACK) and other message types | |
| if (nlh->nlmsg_type == NLMSG_ERROR) | |
| continue; | |
| int msg_len = (int)len; | |
| struct nlmsghdr *nlh = (struct nlmsghdr *)buf; | |
| for (; NLMSG_OK(nlh, msg_len); nlh = NLMSG_NEXT(nlh, msg_len)) { | |
| if (nlh->nlmsg_type == AUDIT_GET) { | |
| struct audit_status *s = NLMSG_DATA(nlh); | |
| reply->valid = 1; | |
| reply->enabled = s->enabled; | |
| reply->failure = s->failure; | |
| reply->pid = s->pid; | |
| reply->rate_limit = s->rate_limit; | |
| reply->backlog_limit = s->backlog_limit; | |
| reply->lost = s->lost; | |
| reply->backlog = s->backlog; | |
| close(fd); | |
| return 0; | |
| } | |
| if (nlh->nlmsg_type == NLMSG_ERROR) { | |
| struct nlmsgerr *err = NLMSG_DATA(nlh); | |
| /* ACK: NLMSG_ERROR with error == 0 */ | |
| if (err->error == 0) | |
| continue; | |
| /* real error from kernel */ | |
| close(fd); | |
| return -1; | |
| } | |
| /* ignore other message types */ | |
| } |
| // set a 2-second timeout to avoid blocking forever | ||
| struct timeval tv = { .tv_sec = 2, .tv_usec = 0 }; | ||
| setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); | ||
|
|
||
| for (int attempts = 0; attempts < 5; attempts++) { | ||
| ssize_t len = recvfrom(fd, buf, sizeof(buf), 0, (struct sockaddr *)&from, &fromlen); | ||
| if (len < 0) { | ||
| close(fd); | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
The return value of setsockopt(SO_RCVTIMEO) is ignored. If setting the timeout fails, recvfrom() may block indefinitely and stall debugfs.plugin’s main loop. Check the setsockopt() result and fail fast (or fall back to non-blocking + poll() with a bounded total deadline).
| // ----------------------------------------------------------------------- | ||
| // module entry point | ||
|
|
||
| int do_module_audit(int update_every, const char *name) { | ||
| static int check_if_available = 1; | ||
|
|
||
| if (check_if_available) { | ||
| struct audit_reply r; | ||
| if (audit_netlink_query(&r) < 0 || !r.valid) { | ||
| netdata_log_info("audit: netlink AUDIT_GET query failed, audit module disabled"); | ||
| return 1; // disable this module | ||
| } | ||
| check_if_available = 0; | ||
| } | ||
|
|
||
| struct audit_reply reply; | ||
| if (audit_netlink_query(&reply) < 0 || !reply.valid) | ||
| return 0; // transient failure, keep module enabled | ||
|
|
||
| audit_send_charts(update_every, name); | ||
| audit_send_data(&reply); | ||
|
|
There was a problem hiding this comment.
audit_send_obsolete() is defined but never called. This adds dead code and makes it easy for charts to linger if the module ever gets disabled after charts are created. Either remove the function or call it when returning 1 to disable the module (similar to module-zswap.c’s obsolete chart handling).
|
@ktsaou, I saw Copilot suggested some changes, but you still did not check all the boxes in the OP. |
Summary
module-audit.ctodebugfs.pluginthat monitors the Linux kernel audit subsystem viaNETLINK_AUDITfailure=2(panic) configuration that causes kernel panics when audit backlog overflowsauditctlbinary requiredMotivation
A real customer incident: an SMPP traffic surge overwhelmed
auditdon a CentOS 7 system configured withfailure=2. The audit backlog overflowed after 8 minutes of stress, the kernel panicked, and the system rebooted with no warning and no pre-crash logs. Had the audit backlog been monitored, an alert could have fired minutes before the panic.Charts
audit.backlogaudit.backlog_utilizationaudit.backlog_limitaudit.lostaudit.statusImplementation
NETLINK_AUDITsocket withAUDIT_GETmessage to querystruct audit_statusCAP_AUDIT_CONTROL(debugfs.plugin already runs privileged)Test plan
auditctl -soutput exactlyfailure=2Summary by cubic
Adds Linux audit subsystem monitoring to
debugfs.pluginviaNETLINK_AUDIT, exposing backlog and status metrics with alerts to warn before audit-induced kernel panics whenfailure=2.New Features
module-audit.ccollector usingNETLINK_AUDIT(AUDIT_GET).audit.backlog,audit.backlog_utilization,audit.backlog_limit,audit.lost(rate),audit.status(enabled, failure, rate_limit, pid).failure=2panic mode, and lost events.auditctlrequired. Requires root orCAP_AUDIT_CONTROL.Refactors
audit_send_obsolete(); module retries on transient failures and keeps charts active.Written for commit b3f2d36. Summary will update on new commits.