Row-level calculations within a windows function?

Question

I'm working on this problem from LeetCode.com:

Column Name Type

machine_id int

process_id int

activity_type enum

timestamp float

The table shows the user activities for a factory website. (machine_id, process_id, activity_type) is the primary key (combination of columns with unique values) of this table. machine_id is the ID of a machine. process_id is the ID of a process running on the machine with ID machine_id. activity_type is an ENUM (category) of type ('start', 'end'). timestamp is a float representing the current time in seconds. 'start' means the machine starts the process at the given timestamp and 'end' means the machine ends the process at the given timestamp. The 'start' timestamp will always be before the 'end' timestamp for every (machine_id, process_id) pair.

There is a factory website that has several machines each running the same number of processes. Write a solution to find the average time each machine takes to complete a process.

The time to complete a process is the 'end' timestamp minus the 'start' timestamp. The average time is calculated by the total time to complete every process on the machine divided by the number of processes that were run.

The resulting table should have the machine_id along with the average time as processing_time, which should be rounded to 3 decimal places.

Return the result table in any order.

I was able to solve this problem using a self join, but I found the best practices say to avoid self joins and instead use window functions. However, I can't figure out how I would do this (if it is possible).

I'm able to solve the problem with a self join as follows:

select a1.machine_id, round(avg(a2.timestamp-a1.timestamp), 3) as processing_time 
from Activity a1
join Activity a2 
on a1.machine_id=a2.machine_id and a1.process_id=a2.process_id
and a1.activity_type='start' and a2.activity_type='end'
group by a1.machine_id

I also am able to start a window function like this:

SELECT a1.machine_id, AVG(a1.timestamp) OVER (PARTITION BY machine_id) AS processing_time
FROM Activity AS a1

But is it possible to do a row-level calculation to subtract the times while aggregating these averages? Also, my above code (2nd block) doesn't work when I add group by a1.machine_id to the bottom. Can anyone explain why? Here's the specific error:

[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Column 'Activity.timestamp' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. (8120) (SQLExecDirectW)

Something like this mayhaps: select machine_id, round(avg(timestamp-endtime), 3) as processing_time from ( select * , lead(timestamp) over(partition by machine_id, process_id order by timestamp) as endtime from Activity a1 ) x WHERE x.activity_type = 'start' group by machine_id — siggemannen
– siggemannen, Commented Aug 20, 2023 at 21:26

Dale K · Accepted Answer · 2023-08-20 21:50:03Z

Window functions are a bit tricky. Your avg as Window like you want will not be achievable, as you need the difference

You can use LAG or LEAD to get the last or next row to make a subtraction

CREATE TABLe Activity  (machine_id int ,
process_id  int,
activity_type   VARCHAR(10) NOT NULL CHECK (activity_type IN('begin', 'end')),
timestamp   float)

INSERT INTO Activity  VALUES(1,1,'Begin', 1.2),(1,1,'End', 2.2),(1,2,'Begin', 3.2),(1,2,'End', 5.2)

WITH CTE AS (SELECT a1.machine_id, process_id 
 , 
 CASE WHEN  activity_type = 'end' Then a1.timestamp 
  -  LAG(a1.timestamp) OVER (PARTITION BY machine_id,process_id ORDER BY activity_type ) END
   processing_time
FROM Activity AS a1)
  SELECT machine_id,
  ROUND(AVG(processing_time),3) as processing_time 
  FROM CTE
WHERE processing_time IS NOT NULL
  GROUP BY machine_id

machine_id	processing_time
1	1.5

select a1.machine_id, round(avg(a2.timestamp-a1.timestamp), 3) as processing_time 
from Activity a1
join Activity a2 
on a1.machine_id=a2.machine_id and a1.process_id=a2.process_id
and a1.activity_type='start' and a2.activity_type='end'
group by a1.machine_id

machine_id	processing_time

fiddle

Bohemian · Accepted Answer · 2023-08-20 22:00:34Z

If you're asking about performance, self-joins are not necessarily evil. You can improve your query performance slightly as follows:

select
    a1.machine_id,
    round(avg(a2.timestamp - a1.timestamp), 3) as processing_time 
from Activity a1
join Activity a2 on a1.machine_id = a2.machine_id
  and a1.process_id = a2.process_id
  and a2.activity_type = 'start'
where a1.activity_type = 'end' 
group by a1.machine_id

I've reversed the start and end conditions: By selecting only end activities (via the where clause) in the first accessed table, you avoid processing start rows that have no end rows.

Although the optimizer should apply the condition a2.activity_type = 'start' during the join, you can make it practically a certainty by moving that condition into the join, which may reduce the number of intermediate rows processed.

Collectives™ on Stack Overflow

Row-level calculations within a windows function?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related