0

I have AWS RDS PostgreSQL as a database. I have a table for storing childbirth_data data with currently 5M records and another for storing message to child stats (not the actual messages) with 30M records. Monthly around 0.3M person records and 3M message records are added to these tables

Query:

SELECT message_date, message_from, sms_type,  message_type, direction, status,
 cd.state ,cd.date_uploaded,
FROM "Suvita".messages m, "Suvita".childbirth_data cd
where m.contact_record_id = cd.id

Explain analyze results:

Hash Join  (cost=568680.28..6272284.94 rows=29688640 width=319) (actual time=5473.787..96501.807 rows=30893261 loops=1)
  Hash Cond: (m.contact_record_id = cd.id)
  ->  Seq Scan on messages m  (cost=0.00..2739237.50 rows=29719350 width=174) (actual time=2.364..41071.274 rows=30936614 loops=1)
  ->  Hash  (cost=402612.68..402612.68 rows=5347568 width=121) (actual time=5448.157..5448.158 rows=5349228 loops=1)
        Buckets: 32768  Batches: 256  Memory Usage: 3518kB
        ->  Seq Scan on childbirth_data cd  (cost=0.00..402612.68 rows=5347568 width=121) (actual time=0.011..3013.382 rows=5349228 loops=1)
Planning Time: 18.349 ms
Execution Time: 97849.004 ms

I have made aggregated tables for the dashboard where aggregate stats are required and materialized views for different states or types of messages depending on the reporting needs. My writes on these tables are via a scheduled job which run every night, so even if writes become comparatively slow because of more indexes, it will not be a concern.

I have already put in the following indexes:

person - id,state, mother name, mother phone, rchid, telerivet_contact_id
Messages - id, contact_record_id, message_date and state

My Query is still very slow and the indexes are not being used in the query, it seems. How do I make PostgreSQL use the indexes?

Would partitioning help? The issue is that I was thinking of partitioning by state but one of the states has 90% of the data, so I think it will not be helpful. Also I do not want to touch the code which does the data input.

I am not database expert so please advise if there is anything else I can do to make the reads on the tables faster?

Edit - 24 April 2024 I already have indexes on m.contact_record_id = cd.id

If I disable enable_hashjoin = false, the query plan looks like below

Workers Planned: 2 Workers Launched: 2 -> Merge Join (cost=9142830.70..9948915.12 rows=12996738 width=317) (actual time=109544.770..123398.539 rows=10500973 loops=3) Merge Cond: (m.contact_record_id = cd.id) -> Sort (cost=7447866.74..7480412.83 rows=13018435 width=172) (actual time=85772.254..90202.085 rows=10500974 loops=3) Sort Key: m.contact_record_id Sort Method: external merge Disk: 1497424kB Worker 0: Sort Method: external merge Disk: 1485808kB Worker 1: Sort Method: external merge Disk: 1481472kB -> Parallel Seq Scan on messages m (cost=0.00..2572228.35 rows=13018435 width=172) (actual time=0.367..40690.536 rows=10515424 loops=3) -> Sort (cost=1694830.27..1708199.61 rows=5347737 width=121) (actual time=23772.489..25077.382 rows=5349239 loops=3) Sort Key: cd.id Sort Method: external merge Disk: 722360kB Worker 0: Sort Method: external merge Disk: 722384kB Worker 1: Sort Method: external merge Disk: 722368kB -> Seq Scan on childbirth_data cd (cost=0.00..402625.37 rows=5347737 width=121) (actual time=0.035..16575.124 rows=5349239 loops=3) Planning Time: 0.886 ms Execution Time: 136276.437 ms

There is not much difference in the execution time. What other optimizations can I do on the database or table level to make the reads faster?

1
  • 2
    Full Table Scans are much faster than indexes when you want to read entire tables. This seems to be the case, since your query does not have a WHERE clause. Commented Apr 23, 2024 at 14:10

2 Answers 2

1

There is no index that can help with a hash join.

One problem is that work_mem is set too small for efficient processing. You can use EXPLAIN (ANALYZE, BUFFERS) to see the temporary file usage. If you can affort to increase work_mem for this query, it should help.

I am surprised that PostgreSQL does not consider a merge join if you give it that little memory. What does the plan look like if you create indexes on childbirth_data.id and messages.contact_record_id? If you temporarily set enable_hashjoin and enable_nestloop to off in your session, how does the execution plan look?

Sign up to request clarification or add additional context in comments.

1 Comment

That's useless in comments. Please edit the question and add the information there.
0

You are correct that adding indexes to the table is the best way to make reads faster. The tradeoff is that writes will be slower, so go easy on how many you add to the table (even if right now you think it won't matter much); it's important to choose them carefully.

Your problem

When picking an index, you want to look at the queries you're making. You've indicated that you want to run the following query:

SELECT
    message_date,
    message_from,
    sms_type,
    message_type,
    direction,
    status,
    cd.state,
    cd.date_uploaded,
FROM
    "Suvita".messages AS m
CROSS JOIN
    "Suvita".childbirth_data AS cd
WHERE
    m.contact_record_id = cd.id

Here I've reformatted things for clarity, but it is equivalent.

This is quite an odd thing to be doing. You are:

  • cross joining every row in m to every row in cd
  • filtering on a certain condition that throws away all cross joined rows where the IDs don't match

I am fairly sure this is the cause of your problems. First, this operation massively increases the amount of data being wrangled (30M x 5M rows = 150T rows before filtering). Secondly, there is no index on the cross joined data so no index can be used.

The good news

The query you're making seems to me to be precisely equivalent to

SELECT
    message_date,
    message_from,
    sms_type,
    message_type,
    direction,
    status,
    cd.state,
    cd.date_uploaded,
FROM
    "Suvita".childbirth_data AS cd
INNER JOIN
    "Suvita".messages AS m
ON
    cd.id = m.contact_record_id

This is a very typical query and should run much faster. I have put the smaller table first in the join.

If you need additional performance improvements, the way forward would be to add an index on contact_record_id to messages and an index on id to childbirth_data.

However, those indexes are likely already defined: if you made id a primary key then it will certainly have a unique index on it; if you made contact_record_id a foreign key back to childbirth_data then it too will have an index defined. I would recommend doing it this way rather than adding extra indexes.

With these indexes in place I am very confident that the query I have written will run in an optimised way.

12 Comments

This seems to be a long explanation on Bad Habits to Kick: Old-style JOINs, and at the end claiming that a query using INNER JOIN woud preform better than without it. I did not test, but performance of both queries should be the same. (because of the nice job that the optimizer does do!)
Oh interesting to hear that the optimiser can handle this. Still, I am surprised that it would take >90s to return results...
Using an index for this query with this data, is most likely slower, not faster. It's not filtering out any data, it needs all data.
Surely that depends on the number of matches between the two tables? I will defer to those with more expertise, but I always believed that foreign keys come with an index because this index improves the performance of a join between two tables (since for each row in one table you need to locate matching rows in the other). Most of the reference materials I've found support this view (though that's no match for practical experience!).
@PaddyAlton: No it does not depend on that, please create two tables, end test the output of the two SQL statements using explain. (You should see that the output is the same)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.