I have AWS RDS PostgreSQL as a database. I have a table for storing childbirth_data data with currently 5M records and another for storing message to child stats (not the actual messages) with 30M records. Monthly around 0.3M person records and 3M message records are added to these tables
Query:
SELECT message_date, message_from, sms_type, message_type, direction, status,
cd.state ,cd.date_uploaded,
FROM "Suvita".messages m, "Suvita".childbirth_data cd
where m.contact_record_id = cd.id
Explain analyze results:
Hash Join (cost=568680.28..6272284.94 rows=29688640 width=319) (actual time=5473.787..96501.807 rows=30893261 loops=1)
Hash Cond: (m.contact_record_id = cd.id)
-> Seq Scan on messages m (cost=0.00..2739237.50 rows=29719350 width=174) (actual time=2.364..41071.274 rows=30936614 loops=1)
-> Hash (cost=402612.68..402612.68 rows=5347568 width=121) (actual time=5448.157..5448.158 rows=5349228 loops=1)
Buckets: 32768 Batches: 256 Memory Usage: 3518kB
-> Seq Scan on childbirth_data cd (cost=0.00..402612.68 rows=5347568 width=121) (actual time=0.011..3013.382 rows=5349228 loops=1)
Planning Time: 18.349 ms
Execution Time: 97849.004 ms
I have made aggregated tables for the dashboard where aggregate stats are required and materialized views for different states or types of messages depending on the reporting needs. My writes on these tables are via a scheduled job which run every night, so even if writes become comparatively slow because of more indexes, it will not be a concern.
I have already put in the following indexes:
person - id,state, mother name, mother phone, rchid, telerivet_contact_id
Messages - id, contact_record_id, message_date and state
My Query is still very slow and the indexes are not being used in the query, it seems. How do I make PostgreSQL use the indexes?
Would partitioning help? The issue is that I was thinking of partitioning by state but one of the states has 90% of the data, so I think it will not be helpful. Also I do not want to touch the code which does the data input.
I am not database expert so please advise if there is anything else I can do to make the reads on the tables faster?
Edit - 24 April 2024 I already have indexes on m.contact_record_id = cd.id
If I disable enable_hashjoin = false, the query plan looks like below
Workers Planned: 2 Workers Launched: 2 -> Merge Join (cost=9142830.70..9948915.12 rows=12996738 width=317) (actual time=109544.770..123398.539 rows=10500973 loops=3) Merge Cond: (m.contact_record_id = cd.id) -> Sort (cost=7447866.74..7480412.83 rows=13018435 width=172) (actual time=85772.254..90202.085 rows=10500974 loops=3) Sort Key: m.contact_record_id Sort Method: external merge Disk: 1497424kB Worker 0: Sort Method: external merge Disk: 1485808kB Worker 1: Sort Method: external merge Disk: 1481472kB -> Parallel Seq Scan on messages m (cost=0.00..2572228.35 rows=13018435 width=172) (actual time=0.367..40690.536 rows=10515424 loops=3) -> Sort (cost=1694830.27..1708199.61 rows=5347737 width=121) (actual time=23772.489..25077.382 rows=5349239 loops=3) Sort Key: cd.id Sort Method: external merge Disk: 722360kB Worker 0: Sort Method: external merge Disk: 722384kB Worker 1: Sort Method: external merge Disk: 722368kB -> Seq Scan on childbirth_data cd (cost=0.00..402625.37 rows=5347737 width=121) (actual time=0.035..16575.124 rows=5349239 loops=3) Planning Time: 0.886 ms Execution Time: 136276.437 ms
There is not much difference in the execution time. What other optimizations can I do on the database or table level to make the reads faster?
WHEREclause.