Troubleshooting Aurora Usage Spiking

Ask Question

Asked 2 years, 4 months ago

Modified 2 years, 2 months ago

Viewed 909 times

Part of AWS Collective

I'm having a difficult time finding an issue with my AWS Aurora V2 cluster. I have my ACU range set to 4 - 16. My ACUs in use usually hangs out at around 4 or 5 and my ACU Utilization average is around 40%. Once a week, for a period of between 1 - 3 days it will enter a spiking pattern where the ACU Utilization for one of my instances will spike to 100, causing Aurora to autoscale up to 16 ACUs for a minute or two, then drop back down to 4 or 5 when the utilization recovers to the average of around 40%. These spikes happen roughly every 20 - 30 minutes (the timing isn't ever precise, but it's usually in that window) and they occur for, again, 1 - 3 days continuously. After which I typically have 4 or more days without spiking. The pattern is at different times of the week and, strangely, often occurs when no users are even logged in. Our application is a dental office application and our users typically do not work on weekends, yet we sometimes still see spiking starting on Fridays or Saturdays and going through the weekends even when people aren't in the office using the system.

I've looked at the monitoring for our database and have found a few queries that need to be optimized, but the timing of the spikes doesn't correspond to the running of these queries. We also have a few scheduled tasks that run, but they, too, have been ruled out as we can see them running consistently and the utilization can be high or remain low while they are running. They also don't run every 20 - 30 minutes and they don't have different workloads when they run.

I've overlayed every relevant metric on top of the chart below to see if there is any correlative link but in every case I did not see a relation between any metric I checked. I've also uploaded my XRay traces from times with spiking and times without spiking to a database and have compared the two traces and cannot find anything of interest in those comparisons.

I looked into whether an auto vacuum might be the culprit, but I didn't see anything obviously wrong there. I'm not a Postgres expert, but I've ruled it out as best I can with what I know. I could see the last time auto vacuum was run, but didn't have any way to see a historical picture of the times that it ran to be able to overlay it on a timeline. I also wouldn't expect auto vacuum to be a cause of a series of spikes that go for days or for there to be a lot of auto vacuum operations running while no real changes are happening in the database.

I started looking at the Aurora Postgres parameter group and options group, but there are hundreds of settings and it's a little daunting to even look through a list of them. Plus I keep telling myself that default Aurora Postgres settings probably wouldn't have this issue. I did look at the options for vacuum and they didn't seem to be overly aggressive at all.

We have two instances in our Aurora cluster, one read/write and one read only instance. Both instances experience this same issue although the graphs for the two instances don't tend to overlap. In other words, they each experience this separately and at different times.

Here is the chart I'm looking at that shows the spiking pattern. Each prolonged spike pattern represents between 1 and 3 days (2.25 days on average). Does anyone have any thoughts or suggestions on how I can figure out the cause of this? We're planning on adding a lot of new tenants to our application soon and I want to make sure we won't have an issue with scaling when we do.

I guess my questions are, any ideas on what I can check to figure out what is causing these spikes? Is this normal for Aurora? Is this something running on the Aurora server itself?

edited Aug 16, 2023 at 13:08

asked Aug 15, 2023 at 19:29

omatase

1,7812 gold badges24 silver badges46 bronze badges

I am facing the same problem as well

Sashi Kumar
– Sashi Kumar

2024-02-08 04:26:45 +00:00
Commented Feb 8, 2024 at 4:26
I've just switched ours to Serverless v2 and have noticed that it upscales in reaction to any one of at least these: - CPU usage - Memory usage - Connection count So even though our queries are super optimal, we find that periods of increase connection attempts will spike the load. Check your connection loads. 4-16 for a dental app sounds insane high. I've got online payment processing with usage in 3-6 range and even that is high. I'm starting to feel like Serverless v2 is too sensitive and provisioned instances would be better. Just a feeling right now.

dayneo
– dayneo

2024-03-05 11:23:52 +00:00
Commented Mar 5, 2024 at 11:23

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Troubleshooting Aurora Usage Spiking

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest