Skip to content

Avoid loading certificates from system store for every connection#6418

Open
SReicheltPTV wants to merge 5 commits intonpgsql:mainfrom
SReicheltPTV:main
Open

Avoid loading certificates from system store for every connection#6418
SReicheltPTV wants to merge 5 commits intonpgsql:mainfrom
SReicheltPTV:main

Conversation

@SReicheltPTV
Copy link

@SReicheltPTV SReicheltPTV commented Jan 21, 2026

Cache loaded custom root certificates, in particular if the PGSSLROOTCERT environment variable is set. This fixes #6417.

Original text:

This PR contains two changes:

I hope the changes are OK; unfortunately I can only test them in limited ways.

if (verifyFull && sslPolicyErrors.HasFlag(SslPolicyErrors.RemoteCertificateNameMismatch))
return false;
// Also check against the default systme certificates location on Debian/Ubuntu.
if (certRootPath == "/etc/ssl/certs/ca-certificates.crt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we avoid hard-coding filesystem paths in the source code like this, as different Linux distributions may have their certificate store in completely different places etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. Unfortunately, the SSL_CERT_FILE env var doesn't usually seem to be set. As I wrote in #6417, I'm a bit out of ideas here.
In the meantime, it occurred to me that a completely different alternative could be to just cache the custom certificates similarly to how the system certificates are cached. Then there would no longer be a need to figure out what the path points to. I'll try to come up with something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @vonzshik on all this, who actually knows stuff about all the certificate stuff

@vonzshik
Copy link
Contributor

So, from my understanding, in your particular case you had certificate in store and specified in PGSSLROOTCERT, which after 7320260 resulted in it being verified twice.
If we consider loading certificates from disk per each physical open as the main issue, then we can just cache them in memory (looking at OpenSSL's code, it seems that's exactly what they're doing) after the first load. Building the chain twice should be quite fast, as long as certificate revocation isn't involved (and even then I'm pretty sure the results are cached for all certificates).
Personally, I'm just very hesitant to touch CertificateChainPolicy as it seems to be very easy to introduce potential security issues through it (for example, there is a comment on SslClientAuthenticationOptions.CertificateChainPolicy which says If not null, CertificateRevocationCheckMode and SslCertificateTrust are ignored. And while CertificateRevocationCheckMode seems to be easy enough, though it doesn't say what instead should be set, where is 0 information about SslCertificateTrust).

@roji
Copy link
Member

roji commented Jan 22, 2026

If we consider loading certificates from disk per each physical open as the main issue, then we can just cache them in memory (looking at OpenSSL's code, it seems that's exactly what they're doing) after the first load.

Just noting that this seems somewhat unlikely - a physical connection open is quite a heavy thing (with potentially multiple network roundrips etc.), and I'd expect just loading a certificate from disk to be negligible. It may be a good idea to first prove that there's a problem via a benchmark.

@vonzshik
Copy link
Contributor

One more thing. libpq also supports the special value system for sslrootcert (Root Certificate in npgsql). The way it works is similar to npgsql as if Root Certificate isn't specified, at which point it'll load certificate from the store (and also changes the ssl mode to VerifyFull). We can also add this to allow ignoring environment variables and force using the store.

@SReicheltPTV
Copy link
Author

So, from my understanding, in your particular case you had certificate in store and specified in PGSSLROOTCERT, which after 7320260 resulted in it being verified twice.

That's correct, though I suspect the problem may not be the double verification of this specific certificate, but the size of the system store if PGSSLROOTCERT is set to it.

If we consider loading certificates from disk per each physical open as the main issue, then we can just cache them in memory (looking at OpenSSL's code, it seems that's exactly what they're doing) after the first load. Building the chain twice should be quite fast, as long as certificate revocation isn't involved (and even then I'm pretty sure the results are cached for all certificates).

I'll check whether caching the loaded certificates is sufficient, and will report back on this tomorrow.

Personally, I'm just very hesitant to touch CertificateChainPolicy as it seems to be very easy to introduce potential security issues through it (for example, there is a comment on SslClientAuthenticationOptions.CertificateChainPolicy which says If not null, CertificateRevocationCheckMode and SslCertificateTrust are ignored. And while CertificateRevocationCheckMode seems to be easy enough, though it doesn't say what instead should be set, where is 0 information about SslCertificateTrust).

While I don't have a good grasp on these details, note that the current Npgsql code ignores the result of the verification against the system certificates if PGSSLROOTCERT is set, and then performs its own verification using the custom chain policy. So my (uncertain) impression is that whatever could be ignored is already being ignored.

One more thing. libpq also supports the special value system for sslrootcert (Root Certificate in npgsql). The way it works is similar to npgsql as if Root Certificate isn't specified, at which point it'll load certificate from the store (and also changes the ssl mode to VerifyFull). We can also add this to allow ignoring environment variables and force using the store.

That sounds useful, but just to clarify: In my case, simply unsetting PGSSLROOTCERT did the job. But since the problem was somewhat difficult to pin down, and the situation with PGSSLROOTCERT set to the system cert store seems somewhat standard, I'm trying to find a way to prevent others from running into the same issue.

@vonzshik
Copy link
Contributor

vonzshik commented Jan 22, 2026

While I don't have a good grasp on these details, note that the current Npgsql code ignores the result of the verification against the system certificates if PGSSLROOTCERT is set, and then performs its own verification using the custom chain policy. So my (uncertain) impression is that whatever could be ignored is already being ignored.

Yes but not exactly, as that while we do throw away the previous verification result, the only thing we do is we add specified certificate to custom store and ask the chain to trust whatever there is, after which we try to build the chain again. That is to say, if there were some other issues unrelated to root certificate not being trusted (for example, getting a completely different root certificate), it'll still return an error.

@SReicheltPTV
Copy link
Author

SReicheltPTV commented Jan 23, 2026

If we consider loading certificates from disk per each physical open as the main issue, then we can just cache them in memory (looking at OpenSSL's code, it seems that's exactly what they're doing) after the first load. Building the chain twice should be quite fast, as long as certificate revocation isn't involved (and even then I'm pretty sure the results are cached for all certificates).

I can confirm that caching the certificates fixes the problem we had, and I have updated the PR accordingly.

Just noting that this seems somewhat unlikely - a physical connection open is quite a heavy thing (with potentially multiple network roundrips etc.), and I'd expect just loading a certificate from disk to be negligible. It may be a good idea to first prove that there's a problem via a benchmark.

I agree a benchmark would be useful, but I don't know exactly how to do it. In the web service where we experienced the problem, two factors appear to influence the result:

  • As PGSSLROOTCERT is set to the system store, quite a few certificates are loaded.
  • The CPU resources in the development stage are intentionally kept very low.

In that setting, the CPU usage had a rather serious side effect: If many requests reached the service at the same time, all database connection attempts ran into a timeout while validating the certificate chain. As a result, these connections also never made it into the connection pool, so every incoming request caused another certificate validation, making the problem even worse.

Yes but not exactly, as that while we do throw away the previous verification result, the only thing we do is we add specified certificate to custom store and ask the chain to trust whatever there is, after which we try to build the chain again. That is to say, if there were some other issues unrelated to root certificate not being trusted (for example, getting a completely different root certificate), it'll still return an error.

I see, thanks. I've removed this change from the PR.

@roji
Copy link
Member

roji commented Jan 23, 2026

BTW just to state the obvious - IIUC caching certificates would mean that users can't deploy a new certificate without restarting the application. That might not be a problem, and if OpenSSL does this as @vonzshik indicated, then we definitely can do the same, I think.

@SReicheltPTV
Copy link
Author

BTW just to state the obvious - IIUC caching certificates would mean that users can't deploy a new certificate without restarting the application. That might not be a problem, and if OpenSSL does this as @vonzshik indicated, then we definitely can do the same, I think.

Oh, right. We could check the file timestamp if necessary, of course.

@roji
Copy link
Member

roji commented Jan 23, 2026

BTW just to state the obvious - IIUC caching certificates would mean that users can't deploy a new certificate without restarting the application. That might not be a problem, and if OpenSSL does this as @vonzshik indicated, then we definitely can do the same, I think.

Oh, right. We could check the file timestamp if necessary, of course.

If you're doing I/O to check the file timestamp, and certificate files are generally quite small, then it really would seem weird that just checking the file timestamp is so much faster than just loading the file... Seeing an actual benchmark around all this would be very helpful.

@vonzshik
Copy link
Contributor

I decided to dive in deeper in regards to OpenSSL's certificate cache, and things are not as straightforward as I thought at first.

libpq uses SSL_CTX_load_verify_locations to load root cert in memory, which in the end calls X509_load_cert_crl_file_ex, with no caching in sight. Now, there seems to be some caching involved whenever SSL_CTX_load_verify_locations is called for a folder instead of a specific file, but that's not our case and not that interesting anyway.

@vonzshik
Copy link
Contributor

@SReicheltPTV I see in your pr you added a reference to runtime issue dotnet/runtime#123058. Reading through it, I see that the dev there recommends to instead use CertificateChainPolicy with CustomTrustStore. If that's indeed enough to implement a one-time certificate verification with no other side effects (which seems to be the case, judging by their response), I think we can go back to the previous implementation and go from there.

@SReicheltPTV
Copy link
Author

If you're doing I/O to check the file timestamp, and certificate files are generally quite small, then it really would seem weird that just checking the file timestamp is so much faster than just loading the file... Seeing an actual benchmark around all this would be very helpful.

Checking the timestamp proved unproblematic in terms of performance. I've added it as a separate commit now so you can decide whether to include it.

I haven't done any systematic benchmarking, but for the instance in our low-resource dev environment, the service becomes completely overloaded when a certain number of database connections are opened simultaneously (triggering the certificate validation), and requests time out after 30-45s. (Not sure why it varies; I'm guessing that incoming requests are being blocked at some point.) Normal response times are around 50-300ms. I don't think the numbers can be compared meaningfully, though, because under normal circumstances, when the database connections succeed, they remain in the connection pool for a while.
The certificate file is 220KB in this case; I think it contains the standard root certificates of a Linux distribution plus some custom ones. (As I wrote above, although it may not be a good idea to set PGSSLROOTCERT to this file, the actual issue was that we had no idea that this was causing the problems.)

@SReicheltPTV I see in your pr you added a reference to runtime issue dotnet/runtime#123058. Reading through it, I see that the dev there recommends to instead use CertificateChainPolicy with CustomTrustStore. If that's indeed enough to implement a one-time certificate verification with no other side effects (which seems to be the case, judging by their response), I think we can go back to the previous implementation and go from there.

Oh, I hadn't seen that actually. Yes, they point to https://learn.microsoft.com/en-us/dotnet/core/extensions/sslstream-best-practices#custom-certificate-trust, but that page doesn't seem to address your specific objection. I'm not an expert on this at all unfortunately. Anyway, I have rebased onto that commit again as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Severe performance issue in Npgsql 10 when PGSSLROOTCERT is set to system store

3 participants