Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073) by co-cy · Pull Request #69074 · saltstack/salt

co-cy · 2026-05-06T19:02:45Z

What does this PR do?

Splits the single except in salt.auth.LoadAuth.get_tok into two cases so that a transient backend error no longer looks the same as a corrupt token blob.

Before, when the eauth token backend raised, the exception either propagated up the stack or was treated as "this token is bad". For sites with a network-backed token store (Redis, NFS-shared localfs), a single backend hiccup — Redis container restart, NFS export blip, full disk — would therefore log every authenticated user out, because callers that recover by calling rm_token (or wrap the call in except Exception and delete on any failure) treated a single failed read as "drop this token from the store".

After this PR:

SaltDeserializationError — the stored blob is corrupt and cannot be parsed. The token is permanently unusable, so removing it is correct: leaving it around would make every subsequent get_tok for the same id keep failing. (Existing behaviour, kept.)
OSError (covers IOError) — transient backend error. The token itself is fine; do not delete it. Return {} so the caller treats this request as not-authenticated, and the next request retries against the backend.
Expired (deserialized successfully but past expire) — remove. (Existing behaviour, kept.)

What issues does this PR fix or reference?

Fixes #69073

Previous Behavior

get_tok either propagated OSError to the caller or, where callers recovered by calling rm_token, deleted a perfectly valid token because the backend briefly couldn't be read. A 5-second Redis reconnect logged the entire fleet out.

New Behavior

A backend read failure is classified by cause:

corrupt blob → remove the token (it can never authenticate again)
transient I/O error → keep the token, return empty for this request, let the next request retry

Merge requirements satisfied?

Docs (changelog entry: changelog/69073.fixed.md)
Tests written
Commit messages have a meaningful subject line and body
Refers to an issue ([Bug]: LoadAuth.get_tok deletes valid tokens on transient backend errors, logging users out on every Redis hiccup #69073)

Tests written?

Three behavioural tests in tests/pytests/unit/auth/test_auth.py:

test_get_tok_returns_empty_and_keeps_token_on_io_error — backend raises OSError. get_tok returns {} and rm_token is not called. This test fails on the previous implementation and is the regression this PR exists to prevent.
test_get_tok_removes_token_on_deserialization_error — corrupt blob → rm_token is called.
test_get_tok_removes_expired_token — expired token → rm_token is called.

Anything else reviewers should know?

Behaviour for already-correct paths (corrupt blob, expired token) is preserved unchanged.
No public API change. Same signature, same return type, same caller contract for "this token is not currently usable" ({}).

When the eauth token backend raised any exception, the previous code either propagated it or treated it as "remove the token". For Redis or other network-backed token stores, a transient backend error (connection drop, NFS hang, full disk) would therefore log every authenticated user out -- ``rm_token`` deletes valid tokens that just briefly could not be read. Split the ``except`` into two cases: * ``SaltDeserializationError`` -- the on-disk blob is corrupt and cannot be parsed. The token is permanently unusable, so removing it is correct: leaving it around would make every subsequent ``get_tok`` for the same id keep failing. * ``OSError`` (covers ``IOError``) -- transient backend error. The token itself is fine; do NOT delete it. Return ``{}`` so the caller treats this request as not-authenticated, and the next request retries against the backend. Three behavioural tests guard each branch and the existing expired path. The IOError test fails on the previous implementation, which is the regression this change exists to prevent. Refs: saltstack#69073

twangboy

Please address the Lint failures

bdrx312 · 2026-05-12T17:16:50Z

+        except OSError:
+            # Transient backend error (Redis connection blip, NFS hang,
+            # hung disk). The token itself is fine; do NOT delete it --
+            # that would log every authenticated user out on every
+            # backend hiccup. Return an empty dict so the caller treats
+            # this request as not-authenticated; the next request will
+            # retry against the backend and succeed once it recovers.
+            log.warning(
+                "Token store transient error reading %r; treating as "
+                "not-authenticated for this request without removing the "
+                "token from the store.",
+                tok,
+            )


This not capturing/logging the actual exception information which will make troubleshooting more difficult.

@bdrx312 — good point, the warning currently swallows the exception.

I'm thinking about two options:

except salt.exceptions.SaltDeserializationError: log.warning( "Token %r could not be deserialized; removing it from the store.", tok, exc_info=True, ) self.rm_token(tok) return {} except OSError: log.warning( "Token store transient error reading %r; treating as " "not-authenticated for this request without removing the " "token from the store.", tok, exc_info=True, ) return {}

In this case, the log swells up a lot, with frequent requests or in case of failure, for example, Redis is down.

except salt.exceptions.SaltDeserializationError as exc: log.warning( "Token %r could not be deserialized (%r); removing it from the store.", tok, exc, ) self.rm_token(tok) return {} except OSError as exc: log.warning( "Token store transient error reading %r (%r); treating as " "not-authenticated for this request without removing the " "token from the store.", tok, exc, ) return {}

There is little detailed information in this option.

What do you think ?

Another option would be to log the exception with trace at the debug level. You probably don't want to fill up your logs with the stack traces like you mentioned but want to at least know what kind of oserror is occurring so your second option is probably okay.

Applied your suggestion — thanks, that pattern is much cleaner than either of the two I proposed.

5120434514: exception class+message inline via %r at WARNING, plus a companion log.debug(..., exc_info=True) for the full traceback. Operators get one greppable line per occurrence by default and can flip the level to DEBUG for the full stack when they need it. Costs nothing at WARNING because the log.debug call is a no-op when the level is disabled.

Same shape applied to both handlers (SaltDeserializationError and OSError).

CI saltpylint flagged `salt/auth/__init__.py:253` with `W1404(implicit-str-concat)`. The two adjacent string literals on that line are the result of black collapsing what was originally a two-line string onto a single line, which saltpylint reads as a missing comma. Merge them into a single string literal; the message remains under 88 columns so black leaves it alone. Refs: saltstack#69073

@bdrx312

Address review feedback from @bdrx312 on PR saltstack#69074: the original `log.warning` calls in the SaltDeserializationError and OSError handlers in `LoadAuth.get_tok` did not capture the exception at all, making troubleshooting harder. Two combined changes: 1. Add the exception class+message inline via `%r` so each occurrence is one greppable line that names which subclass (e.g. `ConnectionResetError`) and its message produced the warning -- enough to triage without a full traceback. 2. Add a companion `log.debug(..., exc_info=True)` per @bdrx312's suggestion. Operators who need the full traceback for a specific intermittent failure can raise the level to DEBUG and see it. Costs nothing at the default WARNING level because logging skips exc_info formatting when the level is disabled. This avoids GBs/hour of stack frames during a flapping Redis or NFS outage (the original `exc_info=True` approach considered earlier) while keeping full diagnostic depth one log-level change away. Refs: saltstack#69073

co-cy requested a review from a team as a code owner May 6, 2026 19:02

twangboy approved these changes May 11, 2026

View reviewed changes

twangboy added the test:full Run the full test suite label May 11, 2026

twangboy added this to the Sulpher v3006.25 milestone May 11, 2026

twangboy linked an issue May 11, 2026 that may be closed by this pull request

[Bug]: LoadAuth.get_tok deletes valid tokens on transient backend errors, logging users out on every Redis hiccup #69073

Open

4 tasks

co-cy had a problem deploying to ci May 11, 2026 18:34 — with GitHub Actions Error

twangboy temporarily deployed to ci May 11, 2026 18:34 — with GitHub Actions Inactive

twangboy force-pushed the security-get-tok-distinguish-exceptions branch from 715daff to b74253f Compare May 11, 2026 18:34

twangboy temporarily deployed to ci May 11, 2026 19:29 — with GitHub Actions Inactive

twangboy temporarily deployed to ci May 11, 2026 19:41 — with GitHub Actions Inactive

twangboy requested changes May 11, 2026

View reviewed changes

twangboy temporarily deployed to ci May 11, 2026 20:15 — with GitHub Actions Inactive

bdrx312 reviewed May 12, 2026

View reviewed changes

co-cy temporarily deployed to ci May 12, 2026 20:35 — with GitHub Actions Inactive

co-cy temporarily deployed to ci May 12, 2026 20:47 — with GitHub Actions Inactive

co-cy temporarily deployed to ci May 12, 2026 21:07 — with GitHub Actions Inactive

co-cy temporarily deployed to ci May 13, 2026 15:51 — with GitHub Actions Inactive

co-cy temporarily deployed to ci May 13, 2026 16:02 — with GitHub Actions Inactive

co-cy temporarily deployed to ci May 13, 2026 16:24 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074

Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074
co-cy wants to merge 3 commits into
saltstack:3006.xfrom
co-cy:security-get-tok-distinguish-exceptions

co-cy commented May 6, 2026

Uh oh!

twangboy left a comment

Uh oh!

bdrx312 May 12, 2026

Uh oh!

co-cy May 12, 2026

Uh oh!

bdrx312 May 13, 2026

Uh oh!

co-cy May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

co-cy commented May 6, 2026

What does this PR do?

What issues does this PR fix or reference?

Previous Behavior

New Behavior

Merge requirements satisfied?

Tests written?

Anything else reviewers should know?

Uh oh!

twangboy left a comment

Choose a reason for hiding this comment

Uh oh!

bdrx312 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

co-cy May 12, 2026

Choose a reason for hiding this comment

Uh oh!

bdrx312 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

co-cy May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants