Insert missing rows in time series consumption history

Question

I'm looking for assistance with writing a PostgreSQL query in Apache Superset.

I have metered consumption data, where the meters have unique IDs. Normally, a meter is read at the start of every hour, and the difference is calculated in the consumption column. There are times, however, when an error occurs and the meter is not read, leading to missing hours in the data. I would like to add new rows and create a new column that would say "Missing Data" in those cases.

Here is an example of what the data looks like. The read_time_locals are in a timestamp without timezone format

meter_id	start_read_time_local	end_read_time_local	start_read	end_read	consumption
1111	9/10/2024 0:00	9/10/2024 1:00	79118.91	79118.93	0.02
1111	9/10/2024 3:00	9/10/2024 4:00	79122.47	79123.99	1.52
2222	9/10/2024 0:00	9/10/2024 1:00	1937135	1937174	39
2222	9/10/2024 1:00	9/10/2024 2:00	1937174	1937191	17
2222	9/10/2024 2:00	9/10/2024 3:00	1937191	1937197	6
2222	9/10/2024 3:00	9/10/2024 4:00	1937197	1937202	5

What I would like is to create a query that would do this:

meter_id	start_read_time_local	end_read_time_local	start_read	end_read	consumption	Notes
1111	9/10/2024 0:00	9/10/2024 1:00	79118.91	79118.93	0.02
1111	9/10/2024 1:00	9/10/2024 2:00			null	Missing Data
1111	9/10/2024 2:00	9/10/2024 3:00			null	Missing Data
1111	9/10/2024 3:00	9/10/2024 4:00	79122.47	79123.99	1.52
2222	9/10/2024 0:00	9/10/2024 1:00	1937135	1937174	39
2222	9/10/2024 1:00	9/10/2024 2:00	1937174	1937191	17
2222	9/10/2024 2:00	9/10/2024 3:00	1937191	1937197	6
2222	9/10/2024 3:00	9/10/2024 4:00	1937197	1937202	5

I have looked and tried various solutions, but have not been able to successfully do this. My initial attempts were to use a combination of min/max of the end_read_time_local and generate_series() to create a sequence of timestamps that I would then right join to the water_consumption table, the result being the missing hours would have no consumption/reading values.

I was using this command to create the series and just make a list of 10 hours:

select (date_trunc('hour',current_timestamp) + ((a-1)||' hour')::interval)::timestamptz
from generate_series(1, 10, 1) as a

I'm struggling to join it, and I have yet to figure out how I will join these for each meter ID. In other words, I need the join to check to see if meter 1111 has a complete record as well as meter 2222.

I have a table account with distinct devices, where device_id matches water_consumption.meter_id.

Following d r's advice, I tried the following:

WITH 
   grid AS
    ( Select     a.device_id as meter_id, 
                 DATEADD('hrs', gs, DATE_TRUNC('day', wc.start_read_time_local)) as read_start, 
                 DATEADD('hrs', gs, DATE_TRUNC('day', wc.end_read_time_local)) as read_end,
                 Min(wc.start_read_time_local) as min_start_time, 
                 Max(wc.start_read_time_local) as max_end_time
      From       account a
      Inner Join water_consumption wc ON(wc.meter_id = a.device_id)
      Cross Join generate_series(0, 23) gs
      Group By   a.device_id, gs, DATE_TRUNC( 'day', wc.start_read_time_local)
    ) 
Select     g.meter_id, 
           Coalesce(wc.start_read_time_local, g.read_start) as start_read_time_local, 
           Coalesce(wc.end_read_time_local, g.read_end) as end_read_time_local, 
           wc.start_read, wc.end_read, wc.consumption, 
           Case When wc.start_read_time_local Is Null Then 'Missing Data' End as notes
From        grid g
Left Join   water_consumption wc ON(wc.meter_id = g.meter_id And 
                                    wc.start_read_time_local = g.read_start )
Where      read_start >= min_start_time And read_start <= max_end_time
Order By    g.meter_id, g.read_start

But I got this Redshift error: column "wc.end_read_time_local" must appear in the GROUP BY clause or be used in an aggregate function

I tired to add it to the group by function, but didn't work.

Redshift is very much different from Postgres. Remember to declare your RDBMS. — Erwin Brandstetter
– Erwin Brandstetter, Commented Sep 15, 2024 at 4:42
Sorry, I"m a little out of my depth and was told I was in Postgres. I really appreciate your assistance...... I tried what you suggested and was met with this error: "redshift error: column notation .start_read_time_local applied to type integer, which is not a composite type". I assume that this has something to do with the fact that we are truncating the original timestamp data. When I look at it in the water_consumption data it looks like this, for example: '2024-07-12T18:00:00'. Is the issue that even though it looks like it is rounded to the hour it isn't? — Crazymonkey44
– Crazymonkey44, Commented Sep 15, 2024 at 6:52
Sorry, that invalid reference was my fault. (In Postgres and Redshift alike.) See fixed answer. — Erwin Brandstetter
– Erwin Brandstetter, Commented Sep 15, 2024 at 8:31

Erwin Brandstetter · Accepted Answer · 2024-09-15 08:30:08Z

2

Build a Cartesian Product (CROSS JOIN) of all relevant meters and hours, then LEFT JOIN to existing data:

SELECT a.device_id
     , h.start_read_time_local
     , h.start_read_time_local + interval '1 hour' AS end_read_time_local 
     , w.start_read
     , w.end_read
     , w.consumption
     , CASE WHEN w.meter_id IS NULL THEN 'Missing Data' END AS notes
FROM   account a
CROSS  JOIN (
   SELECT date_trunc('hour', LOCALTIMESTAMP) + h * interval '1 hour' AS start_read_time_local
   FROM   generate_series(0, 9) h
   ) h
LEFT   JOIN water_consumption w ON w.meter_id = a.device_id
                               AND w.start_read_time_local = h.start_read_time_local
;

Using your table account that contains a distinct set of relevant devices, where account.device_id equals water_consumption.meter_id.

Related, with more explanation:

Generate a series of dates for each group in a table

The USING clause I had in my initial answer builds on equal column names. Use ON instead, if that shortcut is not possible.

Join syntax basics in the manual

edited Sep 15, 2024 at 8:30

answered Sep 14, 2024 at 9:50

Erwin Brandstetter

669k160 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

d r Over a year ago

This doesn't result as OP posted in the question - an 8 rows table under "What I would like is to create a query that would do this:". Or am I missing something!?

d r Over a year ago

Got it. Your code works but just with today's data - it doesn't work with other data. You should change the LOCALTIMESTAMP with an actual date from the table that is beeing processed. With OP's sample data it could be To_Timestamp('09/10/2024 00:00', 'mm/dd/yyyy hh24:mi')...

d r Over a year ago

You can see the difference in the fiddle at dbfiddle.uk/lZfhWsdq .

marc_s · Accepted Answer · 2024-11-20 17:25:17Z

1

To fill in the gaps in hours of a dates create a grid of all hours per date and meter_id - left join your readings data and use Coalesce() function to show every hour then define notes column as 'Missing Data' for hours without readings.

--    S a m p l e    D a t a : 
CREATE TABLE account AS
  Select 1111 as device_id Union All
  Select 2222;

CREATE TABLE water_consumption AS
  Select 1111 as meter_id, To_Timestamp('09/10/2024 00:00', 'mm/dd/yyyy hh24:mi') as start_read_time_local,  To_Timestamp('09/10/2024 01:00', 'mm/dd/yyyy hh24:mi') as end_read_time_local, 79118.91 as start_read  , 79118.93 as end_read, 0.02 as consumption Union All
  Select 1111,  To_Timestamp('09/10/2024 03:00', 'mm/dd/yyyy hh24:mi'),     To_Timestamp('09/10/2024 04:00', 'mm/dd/yyyy hh24:mi'), 79122.47,     79123.99, 1.52 Union All
  Select 2222,  To_Timestamp('09/10/2024 00:00', 'mm/dd/yyyy hh24:mi'),     To_Timestamp('09/10/2024 01:00', 'mm/dd/yyyy hh24:mi'),  1937135,      1937174, 39 Union All
  Select 2222,  To_Timestamp('09/10/2024 01:00', 'mm/dd/yyyy hh24:mi'),     To_Timestamp('09/10/2024 02:00', 'mm/dd/yyyy hh24:mi'),  1937174,      1937191, 17 Union All
  Select 2222,  To_Timestamp('09/10/2024 02:00', 'mm/dd/yyyy hh24:mi'),     To_Timestamp('09/10/2024 03:00', 'mm/dd/yyyy hh24:mi'),  1937191,      1937197, 6 Union All
  Select 2222,  To_Timestamp('09/10/2024 03:00', 'mm/dd/yyyy hh24:mi'),     To_Timestamp('09/10/2024 04:00', 'mm/dd/yyyy hh24:mi'),  1937197,      1937202, 5;

... create a cte grid that will have a row for every hour for every date of reading and for every meter_id ...

WITH 
   grid AS
    ( Select     a.device_id as meter_id, 
                 DATE_TRUNC('day', wc.start_read_time_local) + interval '1 minute' * (gs * 60) as read_start,
                 DATE_TRUNC('day', wc.start_read_time_local) + interval '1 minute' * ((gs + 1) * 60) as read_end,
                 Min(wc.start_read_time_local) as min_start_time, 
                 Max(wc.start_read_time_local) as max_end_time
      From       account a
      Inner Join water_consumption wc ON(wc.meter_id = a.device_id)
      Cross Join generate_series(0, 23) gs
      Group By   a.device_id, gs, DATE_TRUNC( 'day', wc.start_read_time_local)
    )

... if you want the result as presented in your question - filter the data with WHERE clause as below. Without WHERE clause you will get all hours of a date per meter_id (see it in the fiddle - link is below)

--      M a i n    S Q L : 
Select     g.meter_id, 
           Coalesce(wc.start_read_time_local, g.read_start) as start_read_time_local, 
           Coalesce(wc.end_read_time_local, g.read_end) as end_read_time_local, 
           wc.start_read, wc.end_read, wc.consumption, 
           Case When wc.start_read_time_local Is Null Then 'Missing Data' End as notes
From        grid g
Left Join   water_consumption wc ON(wc.meter_id = g.meter_id And 
                                    wc.start_read_time_local = g.read_start )
Where      read_start >= min_start_time And read_start <= max_end_time
Order By    g.meter_id, g.read_start

/*            R e s u l t : 
  meter_id  start_read_time_local   end_read_time_local     start_read     end_read consumption  notes
----------  ----------------------  ----------------------  ----------  ----------- -----------  ---------------
      1111  2024-09-10 00:00:00+00  2024-09-10 01:00:00+00    79118.91     79118.93        0.02  null
      1111  2024-09-10 01:00:00+00  2024-09-10 02:00:00+00        null         null        null  Missing Data
      1111  2024-09-10 02:00:00+00  2024-09-10 03:00:00+00        null         null        null  Missing Data
      1111  2024-09-10 03:00:00+00  2024-09-10 04:00:00+00    79122.47     79123.99        1.52  null
      2222  2024-09-10 00:00:00+00  2024-09-10 01:00:00+00     1937135      1937174          39  null
      2222  2024-09-10 01:00:00+00  2024-09-10 02:00:00+00     1937174      1937191          17  null
      2222  2024-09-10 02:00:00+00  2024-09-10 03:00:00+00     1937191      1937197           6  null
      2222  2024-09-10 03:00:00+00  2024-09-10 04:00:00+00     1937197      1937202           5  null          */

See the fiddle here.

NOTE
This is tested with postgres but it should work on redshift with one adjustment in grid cte - adding hours to a date uses different syntax.

-- read the redshift docs to deal with it - but it should be like this
DATEADD('hrs', gs, DATE_TRUNC('day', wc.start_read_time_local)) as read_start, 
DATEADD('hrs', gs, DATE_TRUNC('day', wc.end_read_time_local)) as read_end
--  instead of Postgres syntax used in cte grid definition code for columns read_start and read_end

edited Nov 20, 2024 at 17:25

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Sep 15, 2024 at 12:48

d r

8,6722 gold badges9 silver badges19 bronze badges

6 Comments

Crazymonkey44 Over a year ago

Thank you for your reply! Can you indicate where the last part you mentioned ('DATEADD('hrs'.....') would go in the context of the larger context of the large code?

d r Over a year ago

@Crazymonkey44 It goes in grid cte code, in Select clause instead of current 2nd and 3rd line - the lines starting with DATE_TRUNC defining columns named as read_start and as read_end. The goal is to add gs number of hours to truncated dates of both time_local.

Crazymonkey44 Over a year ago

Thank you d r, tired to slot it in like you said, but got an error.

d r Over a year ago

@Crazymonkey44 It's imposible to tell anything without the code that you used and without the info about the error !? Could you provide it? I don't use redshift, but the code above works, and the only sintax difference is in adding hours to a daate. DATEADD() function is from redshift docs. It could be somethingg with your "larger context" (that weknow nothing about), too.

Crazymonkey44 Over a year ago

Sorry, I should have mentioned that I edited my original question to display the code that I used. See the code block after the text 'Following d r's advice, I tried the following:'

|

Collectives™ on Stack Overflow

Insert missing rows in time series consumption history

2 Answers 2

3 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related