I'm having a difficult time finding an issue with my AWS Aurora V2 cluster. I have my ACU range set to 4 - 16. My ACUs in use usually hangs out at around 4 or 5 and my ACU Utilization average is around 40%. Once a week, for a period of between 1 - 3 days it will enter a spiking pattern where the ACU Utilization for one of my instances will spike to 100, causing Aurora to autoscale up to 16 ACUs for a minute or two, then drop back down to 4 or 5 when the utilization recovers to the average of around 40%. These spikes happen roughly every 20 - 30 minutes (the timing isn't ever precise, but it's usually in that window) and they occur for, again, 1 - 3 days continuously. After which I typically have 4 or more days without spiking. The pattern is at different times of the week and, strangely, often occurs when no users are even logged in. Our application is a dental office application and our users typically do not work on weekends, yet we sometimes still see spiking starting on Fridays or Saturdays and going through the weekends even when people aren't in the office using the system.
I've looked at the monitoring for our database and have found a few queries that need to be optimized, but the timing of the spikes doesn't correspond to the running of these queries. We also have a few scheduled tasks that run, but they, too, have been ruled out as we can see them running consistently and the utilization can be high or remain low while they are running. They also don't run every 20 - 30 minutes and they don't have different workloads when they run.
I've overlayed every relevant metric on top of the chart below to see if there is any correlative link but in every case I did not see a relation between any metric I checked. I've also uploaded my XRay traces from times with spiking and times without spiking to a database and have compared the two traces and cannot find anything of interest in those comparisons.
I looked into whether an auto vacuum might be the culprit, but I didn't see anything obviously wrong there. I'm not a Postgres expert, but I've ruled it out as best I can with what I know. I could see the last time auto vacuum was run, but didn't have any way to see a historical picture of the times that it ran to be able to overlay it on a timeline. I also wouldn't expect auto vacuum to be a cause of a series of spikes that go for days or for there to be a lot of auto vacuum operations running while no real changes are happening in the database.
I started looking at the Aurora Postgres parameter group and options group, but there are hundreds of settings and it's a little daunting to even look through a list of them. Plus I keep telling myself that default Aurora Postgres settings probably wouldn't have this issue. I did look at the options for vacuum and they didn't seem to be overly aggressive at all.
We have two instances in our Aurora cluster, one read/write and one read only instance. Both instances experience this same issue although the graphs for the two instances don't tend to overlap. In other words, they each experience this separately and at different times.
Here is the chart I'm looking at that shows the spiking pattern. Each prolonged spike pattern represents between 1 and 3 days (2.25 days on average). Does anyone have any thoughts or suggestions on how I can figure out the cause of this? We're planning on adding a lot of new tenants to our application soon and I want to make sure we won't have an issue with scaling when we do.
I guess my questions are, any ideas on what I can check to figure out what is causing these spikes? Is this normal for Aurora? Is this something running on the Aurora server itself?
