Skip to content

Add Linux audit subsystem monitoring#22077

Open
ktsaou wants to merge 3 commits intomasterfrom
feat/audit-status-monitoring
Open

Add Linux audit subsystem monitoring#22077
ktsaou wants to merge 3 commits intomasterfrom
feat/audit-status-monitoring

Conversation

@ktsaou
Copy link
Copy Markdown
Member

@ktsaou ktsaou commented Mar 28, 2026

Summary

  • Adds a new module-audit.c to debugfs.plugin that monitors the Linux kernel audit subsystem via NETLINK_AUDIT
  • Exposes backlog depth, backlog utilization (%), backlog limit, lost events rate, and configuration status (enabled, failure mode, rate limit, daemon PID)
  • Detects dangerous failure=2 (panic) configuration that causes kernel panics when audit backlog overflows
  • Zero external dependencies — uses kernel netlink directly, no auditctl binary required

Motivation

A real customer incident: an SMPP traffic surge overwhelmed auditd on a CentOS 7 system configured with failure=2. The audit backlog overflowed after 8 minutes of stress, the kernel panicked, and the system rebooted with no warning and no pre-crash logs. Had the audit backlog been monitored, an alert could have fired minutes before the panic.

Charts

Chart Description Algorithm
audit.backlog Current audit event queue depth absolute
audit.backlog_utilization Percentage of backlog_limit in use absolute (x100)
audit.backlog_limit Configured maximum backlog size absolute
audit.lost Events dropped due to overflow incremental (rate)
audit.status enabled, failure mode, rate_limit, pid absolute

Implementation

  • Uses NETLINK_AUDIT socket with AUDIT_GET message to query struct audit_status
  • Single netlink query per collection cycle — open socket, send, receive, close
  • 2-second receive timeout to prevent blocking
  • Gracefully disables itself if audit is not available (no netlink support, no permissions)
  • Requires root or CAP_AUDIT_CONTROL (debugfs.plugin already runs privileged)

Test plan

  • Builds cleanly with zero warnings
  • Tested on local system — values match auditctl -s output exactly
  • Module self-disables gracefully when audit is unavailable
  • Test on system with audit enabled and failure=2
  • Test on system under audit stress (high backlog)
  • Verify charts appear in Netdata dashboard

Summary by cubic

Adds Linux audit subsystem monitoring to debugfs.plugin via NETLINK_AUDIT, exposing backlog and status metrics with alerts to warn before audit-induced kernel panics when failure=2.

  • New Features

    • New module-audit.c collector using NETLINK_AUDIT (AUDIT_GET).
    • Charts: audit.backlog, audit.backlog_utilization, audit.backlog_limit, audit.lost (rate), audit.status (enabled, failure, rate_limit, pid).
    • Built-in alerts and module metadata: backlog utilization (>50% warn, >80% crit), failure=2 panic mode, and lost events.
    • Single netlink query per cycle with a 2s timeout; self-disables if audit is unavailable.
    • No external dependencies; no auditctl required. Requires root or CAP_AUDIT_CONTROL.
  • Refactors

    • Removed unused audit_send_obsolete(); module retries on transient failures and keeps charts active.

Written for commit b3f2d36. Summary will update on new commits.

Monitor the kernel audit backlog via NETLINK_AUDIT to detect
audit backlog overflow conditions before they cause kernel panics.

When audit failure mode is set to 2 (panic), a full backlog triggers
an immediate kernel panic with no warning. This collector exposes:
- audit.backlog: current queue depth
- audit.backlog_utilization: percentage of backlog_limit used
- audit.backlog_limit: configured maximum
- audit.lost: events dropped due to overflow (rate)
- audit.status: enabled, failure mode, rate_limit, daemon pid

Motivated by a real incident where an SMPP traffic surge overwhelmed
auditd, the backlog overflowed, and failure=2 caused a kernel panic
on a CentOS 7 production system with zero pre-crash warning.
@ktsaou ktsaou requested review from a team, thiagoftsm and vkalintiris as code owners March 28, 2026 01:12
@github-actions github-actions bot added area/collectors Everything related to data collection area/build Build system (autotools and cmake). collectors/debugfs labels Mar 28, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Confidence score: 5/5

  • This looks low risk to merge: the only finding is low severity (3/10) and does not indicate a functional regression.
  • In src/collectors/debugfs.plugin/module-audit.c, audit_send_obsolete() appears to be unused static dead code, which is mainly a maintainability/cleanup concern rather than a runtime correctness issue.
  • Pay close attention to src/collectors/debugfs.plugin/module-audit.c - remove or wire audit_send_obsolete() to avoid lingering dead code.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/collectors/debugfs.plugin/module-audit.c">

<violation number="1" location="src/collectors/debugfs.plugin/module-audit.c:208">
P3: `audit_send_obsolete()` is dead code: it is a static function introduced in this file but never called.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Plugin as debugfs.plugin
    participant AuditMod as NEW: module-audit.c
    participant Kernel as Linux Kernel (Netlink)
    participant Netdata as Netdata Parent Process

    Note over Plugin,Kernel: Collection Cycle (Default: 1s)

    Plugin->>AuditMod: do_module_audit()
    
    opt First Run
        AuditMod->>Kernel: Open NETLINK_AUDIT socket
        AuditMod->>Kernel: Send AUDIT_GET request
        alt Success
            Kernel-->>AuditMod: Return audit_status
            AuditMod->>AuditMod: Mark module available
        else Failure / No Permissions
            Kernel-->>AuditMod: Error (EACCES/EPROTONOSUPPORT)
            AuditMod-->>Plugin: Return 1 (Disable Module)
        end
    end

    AuditMod->>Kernel: NEW: socket(AF_NETLINK, SOCK_RAW, NETLINK_AUDIT)
    AuditMod->>Kernel: NEW: send(AUDIT_GET)
    
    alt Response Received
        Note right of Kernel: Includes backlog, limit, <br/>lost events, and failure mode
        Kernel-->>AuditMod: struct audit_status
    else Timeout (2s) or Error
        Kernel-->>AuditMod: recvfrom() error
        AuditMod-->>Plugin: Return 0 (Retry next cycle)
    end

    AuditMod->>AuditMod: NEW: Calculate backlog utilization %
    Note over AuditMod: Monitor failure=2 (Panic on overflow)

    AuditMod->>Plugin: netdata_mutex_lock(&stdout_mutex)
    
    opt Charts Not Created
        AuditMod->>Netdata: NEW: CHART audit.backlog, audit.status, etc.
    end

    AuditMod->>Netdata: NEW: BEGIN/SET/END (backlog, lost, status values)
    
    AuditMod->>Plugin: netdata_mutex_unlock(&stdout_mutex)
    AuditMod->>Kernel: close(fd)
    AuditMod-->>Plugin: Return 0 (Success)
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

- audit.conf: three alerts for audit health
  - audit_backlog_utilization: warn >50%, crit >80% of backlog limit
  - audit_failure_mode_panic: warn when failure=2 (config risk)
  - audit_lost_events: warn when events are being dropped
- metadata.yaml: full module documentation for the audit collector
  including metrics, alerts, setup, and troubleshooting sections
Addresses review feedback: the function was defined but never called.
The audit module retries on transient failures rather than permanently
disabling itself, so chart obsoleting is not needed at this stage.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/collectors/debugfs.plugin/metadata.yaml">

<violation number="1" location="src/collectors/debugfs.plugin/metadata.yaml:430">
P2: The permissions note is misleading: `debugfs.plugin` is documented with `CAP_DAC_READ_SEARCH`, not `CAP_AUDIT_CONTROL`, so saying it "already runs with elevated privileges" can give a false impression that no additional capability/root context is needed for this module.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +430 to +432
This integration requires root privileges or CAP_AUDIT_CONTROL capability
to query the kernel audit subsystem via netlink. The debugfs.plugin
already runs with elevated privileges.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The permissions note is misleading: debugfs.plugin is documented with CAP_DAC_READ_SEARCH, not CAP_AUDIT_CONTROL, so saying it "already runs with elevated privileges" can give a false impression that no additional capability/root context is needed for this module.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/collectors/debugfs.plugin/metadata.yaml, line 430:

<comment>The permissions note is misleading: `debugfs.plugin` is documented with `CAP_DAC_READ_SEARCH`, not `CAP_AUDIT_CONTROL`, so saying it "already runs with elevated privileges" can give a false impression that no additional capability/root context is needed for this module.</comment>

<file context>
@@ -390,3 +390,147 @@ modules:
+      multi_instance: false
+      additional_permissions:
+        description: >
+          This integration requires root privileges or CAP_AUDIT_CONTROL capability
+          to query the kernel audit subsystem via netlink. The debugfs.plugin
+          already runs with elevated privileges.
</file context>
Suggested change
This integration requires root privileges or CAP_AUDIT_CONTROL capability
to query the kernel audit subsystem via netlink. The debugfs.plugin
already runs with elevated privileges.
This integration requires root privileges or CAP_AUDIT_CONTROL capability
to query the kernel audit subsystem via netlink. On systems where
debugfs.plugin runs only with CAP_DAC_READ_SEARCH, this module may not
be able to collect audit metrics unless additional privileges are granted.
Fix with Cubic

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Linux audit subsystem monitoring to debugfs.plugin via NETLINK_AUDIT, plus metadata and default health alerts to warn on audit backlog pressure and dangerous failure=2 (panic) configurations.

Changes:

  • Introduces a new debugfs.plugin module (module-audit.c) that queries AUDIT_GET and publishes backlog/status/lost metrics.
  • Adds health alarms (audit.conf) for backlog utilization, panic failure mode, and lost events.
  • Registers the new module in debugfs.plugin build/module lists and documents it in metadata.yaml.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/collectors/debugfs.plugin/module-audit.c New collector module implementing audit netlink query + charts.
src/collectors/debugfs.plugin/debugfs_plugin.c Registers the new audit module in the module list.
src/collectors/debugfs.plugin/debugfs_plugin.h Exposes the do_module_audit() module entry point.
src/collectors/debugfs.plugin/metadata.yaml Adds integration metadata, metrics, and alert references for audit monitoring.
src/health/health.d/audit.conf Adds default health templates for backlog utilization, failure mode panic, and lost events.
CMakeLists.txt Includes module-audit.c in the debugfs.plugin build sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +156
// chart: audit backlog depth
printf(PLUGINSD_KEYWORD_CHART
" audit.backlog '' 'Audit Backlog' 'events' 'audit' 'audit.backlog' line 1340 %d '' 'debugfs.plugin' '%s'\n",
update_every, name);
printf(PLUGINSD_KEYWORD_DIMENSION " 'backlog' 'backlog' %s 1 1 ''\n",
RRD_ALGORITHM_ABSOLUTE_NAME);

// chart: audit backlog utilization (percentage)
printf(PLUGINSD_KEYWORD_CHART
" audit.backlog_utilization '' 'Audit Backlog Utilization' '%%' 'audit' 'audit.backlog_utilization' area 1341 %d '' 'debugfs.plugin' '%s'\n",
update_every, name);
printf(PLUGINSD_KEYWORD_DIMENSION " 'utilization' 'utilization' %s 1 100 ''\n",
RRD_ALGORITHM_ABSOLUTE_NAME);

// chart: audit backlog limit
printf(PLUGINSD_KEYWORD_CHART
" audit.backlog_limit '' 'Audit Backlog Limit' 'events' 'audit' 'audit.backlog_limit' line 1342 %d '' 'debugfs.plugin' '%s'\n",
update_every, name);
printf(PLUGINSD_KEYWORD_DIMENSION " 'backlog_limit' 'backlog_limit' %s 1 1 ''\n",
RRD_ALGORITHM_ABSOLUTE_NAME);

// chart: audit lost events
printf(PLUGINSD_KEYWORD_CHART
" audit.lost '' 'Audit Lost Events' 'events/s' 'audit' 'audit.lost' area 1343 %d '' 'debugfs.plugin' '%s'\n",
update_every, name);
printf(PLUGINSD_KEYWORD_DIMENSION " 'lost' 'lost' %s 1 1 ''\n",
RRD_ALGORITHM_INCREMENTAL_NAME);

// chart: audit status
printf(PLUGINSD_KEYWORD_CHART
" audit.status '' 'Audit Status' 'state' 'audit' 'audit.status' line 1344 %d '' 'debugfs.plugin' '%s'\n",
update_every, name);
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chart priorities are hard-coded (1340–1344). Other debugfs.plugin modules use NETDATA_CHART_PRIO_* constants for priorities, which reduces the chance of collisions and keeps ordering consistent. Consider introducing appropriate NETDATA_CHART_PRIO_AUDIT_* constants and using them here.

Copilot uses AI. Check for mistakes.
Comment on lines +430 to +432
This integration requires root privileges or CAP_AUDIT_CONTROL capability
to query the kernel audit subsystem via netlink. The debugfs.plugin
already runs with elevated privileges.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration notes say debugfs.plugin “already runs with elevated privileges”, but in capability-based installs it is typically granted only cap_dac_read_search (not CAP_AUDIT_CONTROL). In that common case the audit module will self-disable unless the binary is SUID root or explicitly granted CAP_AUDIT_CONTROL. Please clarify this here and/or ensure installation docs/scripts set the needed capability when the audit module is enabled.

Suggested change
This integration requires root privileges or CAP_AUDIT_CONTROL capability
to query the kernel audit subsystem via netlink. The debugfs.plugin
already runs with elevated privileges.
This integration requires either root privileges or the CAP_AUDIT_CONTROL
capability to query the kernel audit subsystem via netlink. In
capability-based installations debugfs.plugin is typically granted only
cap_dac_read_search, which is not sufficient for audit control. In that
case the audit module will detect the missing privilege and disable
itself unless the binary is SUID root or explicitly granted
CAP_AUDIT_CONTROL (for example: setcap cap_dac_read_search,cap_audit_control+ep debugfs.plugin).

Copilot uses AI. Check for mistakes.
.s = { 0 },
};

if (send(fd, &req, req.nlh.nlmsg_len, 0) < 0) {
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send() is used on an unconnected netlink socket. For datagram sockets (including AF_NETLINK), send() requires the socket to be connected; otherwise it fails with EDESTADDRREQ. Use sendto() (or connect() then send()) to a sockaddr_nl destination with nl_pid = 0 (kernel).

Suggested change
if (send(fd, &req, req.nlh.nlmsg_len, 0) < 0) {
struct sockaddr_nl kernel_addr = {
.nl_family = AF_NETLINK,
.nl_pid = 0, // kernel
.nl_groups = 0,
};
if (sendto(fd, &req, req.nlh.nlmsg_len, 0,
(struct sockaddr *)&kernel_addr, sizeof(kernel_addr)) < 0) {

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +105
struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
if (!NLMSG_OK(nlh, (size_t)len))
continue;

if (nlh->nlmsg_type == AUDIT_GET) {
struct audit_status *s = NLMSG_DATA(nlh);
reply->valid = 1;
reply->enabled = s->enabled;
reply->failure = s->failure;
reply->pid = s->pid;
reply->rate_limit = s->rate_limit;
reply->backlog_limit = s->backlog_limit;
reply->lost = s->lost;
reply->backlog = s->backlog;
close(fd);
return 0;
}

// skip NLMSG_ERROR (ACK) and other message types
if (nlh->nlmsg_type == NLMSG_ERROR)
continue;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netlink datagrams can contain multiple nlmsghdr messages; this code only inspects the first header in buf. If the kernel returns an ACK (NLMSG_ERROR with error=0) and the AUDIT_GET reply in the same datagram, the reply will be skipped and the query will fail. Iterate over all messages in the received buffer using NLMSG_NEXT() / NLMSG_OK() and handle NLMSG_ERROR by checking its embedded error code.

Suggested change
struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
if (!NLMSG_OK(nlh, (size_t)len))
continue;
if (nlh->nlmsg_type == AUDIT_GET) {
struct audit_status *s = NLMSG_DATA(nlh);
reply->valid = 1;
reply->enabled = s->enabled;
reply->failure = s->failure;
reply->pid = s->pid;
reply->rate_limit = s->rate_limit;
reply->backlog_limit = s->backlog_limit;
reply->lost = s->lost;
reply->backlog = s->backlog;
close(fd);
return 0;
}
// skip NLMSG_ERROR (ACK) and other message types
if (nlh->nlmsg_type == NLMSG_ERROR)
continue;
int msg_len = (int)len;
struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
for (; NLMSG_OK(nlh, msg_len); nlh = NLMSG_NEXT(nlh, msg_len)) {
if (nlh->nlmsg_type == AUDIT_GET) {
struct audit_status *s = NLMSG_DATA(nlh);
reply->valid = 1;
reply->enabled = s->enabled;
reply->failure = s->failure;
reply->pid = s->pid;
reply->rate_limit = s->rate_limit;
reply->backlog_limit = s->backlog_limit;
reply->lost = s->lost;
reply->backlog = s->backlog;
close(fd);
return 0;
}
if (nlh->nlmsg_type == NLMSG_ERROR) {
struct nlmsgerr *err = NLMSG_DATA(nlh);
/* ACK: NLMSG_ERROR with error == 0 */
if (err->error == 0)
continue;
/* real error from kernel */
close(fd);
return -1;
}
/* ignore other message types */
}

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +83
// set a 2-second timeout to avoid blocking forever
struct timeval tv = { .tv_sec = 2, .tv_usec = 0 };
setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));

for (int attempts = 0; attempts < 5; attempts++) {
ssize_t len = recvfrom(fd, buf, sizeof(buf), 0, (struct sockaddr *)&from, &fromlen);
if (len < 0) {
close(fd);
return -1;
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of setsockopt(SO_RCVTIMEO) is ignored. If setting the timeout fails, recvfrom() may block indefinitely and stall debugfs.plugin’s main loop. Check the setsockopt() result and fail fast (or fall back to non-blocking + poll() with a bounded total deadline).

Copilot uses AI. Check for mistakes.
Comment on lines +208 to +229
// -----------------------------------------------------------------------
// module entry point

int do_module_audit(int update_every, const char *name) {
static int check_if_available = 1;

if (check_if_available) {
struct audit_reply r;
if (audit_netlink_query(&r) < 0 || !r.valid) {
netdata_log_info("audit: netlink AUDIT_GET query failed, audit module disabled");
return 1; // disable this module
}
check_if_available = 0;
}

struct audit_reply reply;
if (audit_netlink_query(&reply) < 0 || !reply.valid)
return 0; // transient failure, keep module enabled

audit_send_charts(update_every, name);
audit_send_data(&reply);

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audit_send_obsolete() is defined but never called. This adds dead code and makes it easy for charts to linger if the module ever gets disabled after charts are created. Either remove the function or call it when returning 1 to disable the module (similar to module-zswap.c’s obsolete chart handling).

Copilot uses AI. Check for mistakes.
@thiagoftsm
Copy link
Copy Markdown
Contributor

@ktsaou, I saw Copilot suggested some changes, but you still did not check all the boxes in the OP.
Please take a look at the suggestions, and I will begin testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build Build system (autotools and cmake). area/collectors Everything related to data collection area/health area/metadata Integrations metadata collectors/debugfs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants