I'm looking for assistance with writing a PostgreSQL query in Apache Superset.
I have metered consumption data, where the meters have unique IDs. Normally, a meter is read at the start of every hour, and the difference is calculated in the consumption column. There are times, however, when an error occurs and the meter is not read, leading to missing hours in the data. I would like to add new rows and create a new column that would say "Missing Data" in those cases.
Here is an example of what the data looks like. The read_time_locals are in a timestamp without timezone format
| meter_id | start_read_time_local | end_read_time_local | start_read | end_read | consumption |
|---|---|---|---|---|---|
| 1111 | 9/10/2024 0:00 | 9/10/2024 1:00 | 79118.91 | 79118.93 | 0.02 |
| 1111 | 9/10/2024 3:00 | 9/10/2024 4:00 | 79122.47 | 79123.99 | 1.52 |
| 2222 | 9/10/2024 0:00 | 9/10/2024 1:00 | 1937135 | 1937174 | 39 |
| 2222 | 9/10/2024 1:00 | 9/10/2024 2:00 | 1937174 | 1937191 | 17 |
| 2222 | 9/10/2024 2:00 | 9/10/2024 3:00 | 1937191 | 1937197 | 6 |
| 2222 | 9/10/2024 3:00 | 9/10/2024 4:00 | 1937197 | 1937202 | 5 |
What I would like is to create a query that would do this:
| meter_id | start_read_time_local | end_read_time_local | start_read | end_read | consumption | Notes |
|---|---|---|---|---|---|---|
| 1111 | 9/10/2024 0:00 | 9/10/2024 1:00 | 79118.91 | 79118.93 | 0.02 | |
| 1111 | 9/10/2024 1:00 | 9/10/2024 2:00 | null | Missing Data | ||
| 1111 | 9/10/2024 2:00 | 9/10/2024 3:00 | null | Missing Data | ||
| 1111 | 9/10/2024 3:00 | 9/10/2024 4:00 | 79122.47 | 79123.99 | 1.52 | |
| 2222 | 9/10/2024 0:00 | 9/10/2024 1:00 | 1937135 | 1937174 | 39 | |
| 2222 | 9/10/2024 1:00 | 9/10/2024 2:00 | 1937174 | 1937191 | 17 | |
| 2222 | 9/10/2024 2:00 | 9/10/2024 3:00 | 1937191 | 1937197 | 6 | |
| 2222 | 9/10/2024 3:00 | 9/10/2024 4:00 | 1937197 | 1937202 | 5 |
I have looked and tried various solutions, but have not been able to successfully do this. My initial attempts were to use a combination of min/max of the end_read_time_local and generate_series() to create a sequence of timestamps that I would then right join to the water_consumption table, the result being the missing hours would have no consumption/reading values.
I was using this command to create the series and just make a list of 10 hours:
select (date_trunc('hour',current_timestamp) + ((a-1)||' hour')::interval)::timestamptz
from generate_series(1, 10, 1) as a
I'm struggling to join it, and I have yet to figure out how I will join these for each meter ID. In other words, I need the join to check to see if meter 1111 has a complete record as well as meter 2222.
I have a table account with distinct devices, where device_id matches water_consumption.meter_id.
Following d r's advice, I tried the following:
WITH
grid AS
( Select a.device_id as meter_id,
DATEADD('hrs', gs, DATE_TRUNC('day', wc.start_read_time_local)) as read_start,
DATEADD('hrs', gs, DATE_TRUNC('day', wc.end_read_time_local)) as read_end,
Min(wc.start_read_time_local) as min_start_time,
Max(wc.start_read_time_local) as max_end_time
From account a
Inner Join water_consumption wc ON(wc.meter_id = a.device_id)
Cross Join generate_series(0, 23) gs
Group By a.device_id, gs, DATE_TRUNC( 'day', wc.start_read_time_local)
)
Select g.meter_id,
Coalesce(wc.start_read_time_local, g.read_start) as start_read_time_local,
Coalesce(wc.end_read_time_local, g.read_end) as end_read_time_local,
wc.start_read, wc.end_read, wc.consumption,
Case When wc.start_read_time_local Is Null Then 'Missing Data' End as notes
From grid g
Left Join water_consumption wc ON(wc.meter_id = g.meter_id And
wc.start_read_time_local = g.read_start )
Where read_start >= min_start_time And read_start <= max_end_time
Order By g.meter_id, g.read_start
But I got this Redshift error: column "wc.end_read_time_local" must appear in the GROUP BY clause or be used in an aggregate function
I tired to add it to the group by function, but didn't work.