Skip to content

Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074

Open
co-cy wants to merge 3 commits into
saltstack:3006.xfrom
co-cy:security-get-tok-distinguish-exceptions
Open

Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074
co-cy wants to merge 3 commits into
saltstack:3006.xfrom
co-cy:security-get-tok-distinguish-exceptions

Conversation

@co-cy
Copy link
Copy Markdown

@co-cy co-cy commented May 6, 2026

What does this PR do?

Splits the single except in salt.auth.LoadAuth.get_tok into two cases so that a transient backend error no longer looks the same as a corrupt token blob.

Before, when the eauth token backend raised, the exception either propagated up the stack or was treated as "this token is bad". For sites with a network-backed token store (Redis, NFS-shared localfs), a single backend hiccup — Redis container restart, NFS export blip, full disk — would therefore log every authenticated user out, because callers that recover by calling rm_token (or wrap the call in except Exception and delete on any failure) treated a single failed read as "drop this token from the store".

After this PR:

  • SaltDeserializationError — the stored blob is corrupt and cannot be parsed. The token is permanently unusable, so removing it is correct: leaving it around would make every subsequent get_tok for the same id keep failing. (Existing behaviour, kept.)
  • OSError (covers IOError) — transient backend error. The token itself is fine; do not delete it. Return {} so the caller treats this request as not-authenticated, and the next request retries against the backend.
  • Expired (deserialized successfully but past expire) — remove. (Existing behaviour, kept.)

What issues does this PR fix or reference?

Fixes #69073

Previous Behavior

get_tok either propagated OSError to the caller or, where callers recovered by calling rm_token, deleted a perfectly valid token because the backend briefly couldn't be read. A 5-second Redis reconnect logged the entire fleet out.

New Behavior

A backend read failure is classified by cause:

  • corrupt blob → remove the token (it can never authenticate again)
  • transient I/O error → keep the token, return empty for this request, let the next request retry

Merge requirements satisfied?

Tests written?

Three behavioural tests in tests/pytests/unit/auth/test_auth.py:

  • test_get_tok_returns_empty_and_keeps_token_on_io_error — backend raises OSError. get_tok returns {} and rm_token is not called. This test fails on the previous implementation and is the regression this PR exists to prevent.
  • test_get_tok_removes_token_on_deserialization_error — corrupt blob → rm_token is called.
  • test_get_tok_removes_expired_token — expired token → rm_token is called.

Anything else reviewers should know?

  • Behaviour for already-correct paths (corrupt blob, expired token) is preserved unchanged.
  • No public API change. Same signature, same return type, same caller contract for "this token is not currently usable" ({}).

@co-cy co-cy requested a review from a team as a code owner May 6, 2026 19:02
When the eauth token backend raised any exception, the previous code
either propagated it or treated it as "remove the token". For Redis
or other network-backed token stores, a transient backend error
(connection drop, NFS hang, full disk) would therefore log every
authenticated user out -- ``rm_token`` deletes valid tokens that
just briefly could not be read.

Split the ``except`` into two cases:

* ``SaltDeserializationError`` -- the on-disk blob is corrupt and
  cannot be parsed. The token is permanently unusable, so removing
  it is correct: leaving it around would make every subsequent
  ``get_tok`` for the same id keep failing.
* ``OSError`` (covers ``IOError``) -- transient backend error. The
  token itself is fine; do NOT delete it. Return ``{}`` so the
  caller treats this request as not-authenticated, and the next
  request retries against the backend.

Three behavioural tests guard each branch and the existing expired
path. The IOError test fails on the previous implementation, which
is the regression this change exists to prevent.

Refs: saltstack#69073
Copy link
Copy Markdown
Contributor

@twangboy twangboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the Lint failures

Comment thread salt/auth/__init__.py Outdated
Comment on lines +258 to +270
except OSError:
# Transient backend error (Redis connection blip, NFS hang,
# hung disk). The token itself is fine; do NOT delete it --
# that would log every authenticated user out on every
# backend hiccup. Return an empty dict so the caller treats
# this request as not-authenticated; the next request will
# retry against the backend and succeed once it recovers.
log.warning(
"Token store transient error reading %r; treating as "
"not-authenticated for this request without removing the "
"token from the store.",
tok,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not capturing/logging the actual exception information which will make troubleshooting more difficult.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdrx312 — good point, the warning currently swallows the exception.

I'm thinking about two options:

except salt.exceptions.SaltDeserializationError:
    log.warning(
        "Token %r could not be deserialized; removing it from the store.",
        tok,
        exc_info=True,
    )
    self.rm_token(tok)
    return {}
except OSError:
    log.warning(
        "Token store transient error reading %r; treating as "
        "not-authenticated for this request without removing the "
        "token from the store.",
        tok,
        exc_info=True,
    )
    return {}

In this case, the log swells up a lot, with frequent requests or in case of failure, for example, Redis is down.

except salt.exceptions.SaltDeserializationError as exc:
    log.warning(
        "Token %r could not be deserialized (%r); removing it from the store.",
        tok, exc,
    )
    self.rm_token(tok)
    return {}
except OSError as exc:
    log.warning(
        "Token store transient error reading %r (%r); treating as "
        "not-authenticated for this request without removing the "
        "token from the store.",
        tok, exc,
    )
    return {}

There is little detailed information in this option.

What do you think ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to log the exception with trace at the debug level. You probably don't want to fill up your logs with the stack traces like you mentioned but want to at least know what kind of oserror is occurring so your second option is probably okay.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied your suggestion — thanks, that pattern is much cleaner than either of the two I proposed.

5120434514: exception class+message inline via %r at WARNING, plus a companion log.debug(..., exc_info=True) for the full traceback. Operators get one greppable line per occurrence by default and can flip the level to DEBUG for the full stack when they need it. Costs nothing at WARNING because the log.debug call is a no-op when the level is disabled.

Same shape applied to both handlers (SaltDeserializationError and OSError).

CI saltpylint flagged `salt/auth/__init__.py:253` with
`W1404(implicit-str-concat)`. The two adjacent string literals on
that line are the result of black collapsing what was originally a
two-line string onto a single line, which saltpylint reads as a
missing comma. Merge them into a single string literal; the message
remains under 88 columns so black leaves it alone.

Refs: saltstack#69073
Address review feedback from @bdrx312 on PR saltstack#69074: the original
`log.warning` calls in the SaltDeserializationError and OSError
handlers in `LoadAuth.get_tok` did not capture the exception at all,
making troubleshooting harder. Two combined changes:

1. Add the exception class+message inline via `%r` so each
   occurrence is one greppable line that names which subclass
   (e.g. `ConnectionResetError`) and its message produced the
   warning -- enough to triage without a full traceback.
2. Add a companion `log.debug(..., exc_info=True)` per @bdrx312's
   suggestion. Operators who need the full traceback for a specific
   intermittent failure can raise the level to DEBUG and see it.
   Costs nothing at the default WARNING level because logging skips
   exc_info formatting when the level is disabled.

This avoids GBs/hour of stack frames during a flapping Redis or NFS
outage (the original `exc_info=True` approach considered earlier)
while keeping full diagnostic depth one log-level change away.

Refs: saltstack#69073
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: LoadAuth.get_tok deletes valid tokens on transient backend errors, logging users out on every Redis hiccup

3 participants