Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074
Distinguish corrupt vs transient errors in LoadAuth.get_tok (#69073)#69074co-cy wants to merge 3 commits into
Conversation
When the eauth token backend raised any exception, the previous code
either propagated it or treated it as "remove the token". For Redis
or other network-backed token stores, a transient backend error
(connection drop, NFS hang, full disk) would therefore log every
authenticated user out -- ``rm_token`` deletes valid tokens that
just briefly could not be read.
Split the ``except`` into two cases:
* ``SaltDeserializationError`` -- the on-disk blob is corrupt and
cannot be parsed. The token is permanently unusable, so removing
it is correct: leaving it around would make every subsequent
``get_tok`` for the same id keep failing.
* ``OSError`` (covers ``IOError``) -- transient backend error. The
token itself is fine; do NOT delete it. Return ``{}`` so the
caller treats this request as not-authenticated, and the next
request retries against the backend.
Three behavioural tests guard each branch and the existing expired
path. The IOError test fails on the previous implementation, which
is the regression this change exists to prevent.
Refs: saltstack#69073
715daff to
b74253f
Compare
twangboy
left a comment
There was a problem hiding this comment.
Please address the Lint failures
| except OSError: | ||
| # Transient backend error (Redis connection blip, NFS hang, | ||
| # hung disk). The token itself is fine; do NOT delete it -- | ||
| # that would log every authenticated user out on every | ||
| # backend hiccup. Return an empty dict so the caller treats | ||
| # this request as not-authenticated; the next request will | ||
| # retry against the backend and succeed once it recovers. | ||
| log.warning( | ||
| "Token store transient error reading %r; treating as " | ||
| "not-authenticated for this request without removing the " | ||
| "token from the store.", | ||
| tok, | ||
| ) |
There was a problem hiding this comment.
This not capturing/logging the actual exception information which will make troubleshooting more difficult.
There was a problem hiding this comment.
@bdrx312 — good point, the warning currently swallows the exception.
I'm thinking about two options:
except salt.exceptions.SaltDeserializationError:
log.warning(
"Token %r could not be deserialized; removing it from the store.",
tok,
exc_info=True,
)
self.rm_token(tok)
return {}
except OSError:
log.warning(
"Token store transient error reading %r; treating as "
"not-authenticated for this request without removing the "
"token from the store.",
tok,
exc_info=True,
)
return {}In this case, the log swells up a lot, with frequent requests or in case of failure, for example, Redis is down.
except salt.exceptions.SaltDeserializationError as exc:
log.warning(
"Token %r could not be deserialized (%r); removing it from the store.",
tok, exc,
)
self.rm_token(tok)
return {}
except OSError as exc:
log.warning(
"Token store transient error reading %r (%r); treating as "
"not-authenticated for this request without removing the "
"token from the store.",
tok, exc,
)
return {}There is little detailed information in this option.
What do you think ?
There was a problem hiding this comment.
Another option would be to log the exception with trace at the debug level. You probably don't want to fill up your logs with the stack traces like you mentioned but want to at least know what kind of oserror is occurring so your second option is probably okay.
There was a problem hiding this comment.
Applied your suggestion — thanks, that pattern is much cleaner than either of the two I proposed.
5120434514: exception class+message inline via %r at WARNING, plus a companion log.debug(..., exc_info=True) for the full traceback. Operators get one greppable line per occurrence by default and can flip the level to DEBUG for the full stack when they need it. Costs nothing at WARNING because the log.debug call is a no-op when the level is disabled.
Same shape applied to both handlers (SaltDeserializationError and OSError).
CI saltpylint flagged `salt/auth/__init__.py:253` with `W1404(implicit-str-concat)`. The two adjacent string literals on that line are the result of black collapsing what was originally a two-line string onto a single line, which saltpylint reads as a missing comma. Merge them into a single string literal; the message remains under 88 columns so black leaves it alone. Refs: saltstack#69073
Address review feedback from @bdrx312 on PR saltstack#69074: the original `log.warning` calls in the SaltDeserializationError and OSError handlers in `LoadAuth.get_tok` did not capture the exception at all, making troubleshooting harder. Two combined changes: 1. Add the exception class+message inline via `%r` so each occurrence is one greppable line that names which subclass (e.g. `ConnectionResetError`) and its message produced the warning -- enough to triage without a full traceback. 2. Add a companion `log.debug(..., exc_info=True)` per @bdrx312's suggestion. Operators who need the full traceback for a specific intermittent failure can raise the level to DEBUG and see it. Costs nothing at the default WARNING level because logging skips exc_info formatting when the level is disabled. This avoids GBs/hour of stack frames during a flapping Redis or NFS outage (the original `exc_info=True` approach considered earlier) while keeping full diagnostic depth one log-level change away. Refs: saltstack#69073
What does this PR do?
Splits the single
exceptinsalt.auth.LoadAuth.get_tokinto two cases so that a transient backend error no longer looks the same as a corrupt token blob.Before, when the eauth token backend raised, the exception either propagated up the stack or was treated as "this token is bad". For sites with a network-backed token store (Redis, NFS-shared
localfs), a single backend hiccup — Redis container restart, NFS export blip, full disk — would therefore log every authenticated user out, because callers that recover by callingrm_token(or wrap the call inexcept Exceptionand delete on any failure) treated a single failed read as "drop this token from the store".After this PR:
SaltDeserializationError— the stored blob is corrupt and cannot be parsed. The token is permanently unusable, so removing it is correct: leaving it around would make every subsequentget_tokfor the same id keep failing. (Existing behaviour, kept.)OSError(coversIOError) — transient backend error. The token itself is fine; do not delete it. Return{}so the caller treats this request as not-authenticated, and the next request retries against the backend.expire) — remove. (Existing behaviour, kept.)What issues does this PR fix or reference?
Fixes #69073
Previous Behavior
get_tokeither propagatedOSErrorto the caller or, where callers recovered by callingrm_token, deleted a perfectly valid token because the backend briefly couldn't be read. A 5-second Redis reconnect logged the entire fleet out.New Behavior
A backend read failure is classified by cause:
Merge requirements satisfied?
changelog/69073.fixed.md)Tests written?
Three behavioural tests in
tests/pytests/unit/auth/test_auth.py:test_get_tok_returns_empty_and_keeps_token_on_io_error— backend raisesOSError.get_tokreturns{}andrm_tokenis not called. This test fails on the previous implementation and is the regression this PR exists to prevent.test_get_tok_removes_token_on_deserialization_error— corrupt blob →rm_tokenis called.test_get_tok_removes_expired_token— expired token →rm_tokenis called.Anything else reviewers should know?
{}).