1

I have a table data as show below.

cust_id city_type city_name start_date
1 physical Las Vegas 5/17/2024
1 office Seattle 5/17/2024
1 office Dallas 9/20/2024
1 physical Dallas 10/30/2024
1 office Austin 2/10/2025

and I am expecting the below result. My data has around 1M customer_ids. Trying to implement SCD type 2 style.

cust_id physical_city office_city start_date end_date
1 Las Vegas Seattle 5/17/2024 9/20/2024
1 Las Vegas Dallas 9/20/2024 10/30/2024
1 Dallas Dallas 10/30/2024 2/10/2025
1 Dallas Austin 2/10/2025 12/31/9999

I have tried below query but I am missing something that I am not getting the required answer.

select 
  cust_id,
  max(case when city_type='Physical' then city_name end) as physical_city,
  max(case when city_type='Office' then city_name end) as office_city,
  start_date,
  coalesce(lead(start_date) over(partition by party_id order by start_date), '9999-12-31') as end_date
from table
group by cust_id, start_date

Thank you!

2
  • please provide more details: a) what is the rule to choose the end_date? For the 1st record in your output, why did you choose "9/20/2024" instead of "5/17/2024" which is the date associated with "Seatle" (2nd row from input)? b) why does your output return the 3rd record, with "Dalas" and "Dalas"? considering that for "Las Vegas"(as physical) you didn't create a similar record (LA and LA). I believe that the 3rd record should not exist. Commented Feb 19 at 3:20
  • Thank you for comment ! The rule for end_date is When there is new record for either physical or office city. The city does not change until its new record comes for physical or office city. In the Expected output, 1st row - Both physical and office city started records on same day. 2nd row - Office city moved from Seattle to Dallas on 9/20/2024. 3rd row - Physical city moved from Las Vegas to Dallas on 10/30/2024 but office city is same as Dallas from previous record. Commented Feb 19 at 4:07

2 Answers 2

0

It is necessary to consider cases when the data one type ('office' or 'physical') change 2 or more times in a row sequence ordered by start_date. For example:

cust_id city_type city_name start_date
2 office Seattle 2024-05-17
2 physical Las Vegas 2024-05-17
2 office Dallas 2024-09-20
2 office Boston 2024-09-21
2 office Seatle 2024-09-22
2 physical Dallas 2024-10-30
2 office Austin 2025-02-10

After grouping (pivot) we have

cust_id start_date phisycal_city office_city
2 2024-05-17 Las Vegas Seattle
2 2024-09-20 null Dallas
2 2024-09-21 null Boston
2 2024-09-22 null Seatle
2 2024-10-30 Dallas null
2 2025-02-10 null Austin

So, calculation's like lead(start_date...) not enough.

  1. Take ranges for every city_type start_date - to_date. We use there partition by cust_id,city_type
select *
  ,lead(start_date,1,cast('9999-12-31' as date))
    over(partition by cust_id,city_type order by start_date) to_date
From test
  1. Group data by start_date (pivot). We take city_name or null for this date.
  2. JOIN ranged_data to cover null's

Test data (expanded)
cust_id=1 - your example
cust_id=2 - 'office' city change twice
cust_id=3 - 'physical' city change twice
cust_id=4,5 - first rows for 'physical' and 'office' do nit have same date

See example

cust_id city_type city_name start_date
1 physical Las Vegas 2024-05-17
1 office Seattle 2024-05-17
1 office Dallas 2024-09-20
1 physical Dallas 2024-10-30
1 office Austin 2025-02-10
2 office Seattle 2024-05-17
2 physical Las Vegas 2024-05-17
2 office Dallas 2024-09-20
2 office Boston 2024-09-21
2 physical Dallas 2024-10-30
2 office Austin 2025-02-10
3 office Seattle 2024-05-17
3 physical Las Vegas 2024-05-17
3 physical Dallas 2024-09-20
3 office Boston 2024-09-21
3 physical Seatle 2024-10-30
3 physical Austin 2025-02-10
4 office Dallas 2024-10-30
4 physical Austin 2025-02-10
5 physical Dallas 2024-10-30
5 office Austin 2025-02-10
with ranged_data as(
select *
  ,lead(start_date,1,cast('9999-12-31' as date))
    over(partition by cust_id,city_type order by start_date) to_date
From test
)
,agg_data as(
select cust_id,start_date
  ,max(case when city_type='physical' then city_name end) physical_city
  ,max(case when city_type='office' then city_name end) office_city
from test
group by cust_id,start_date
)
select ad.cust_id,ad.start_date
  ,lead(ad.start_date,1,cast('9999-12-31' as date))
    over(partition by ad.cust_id order by ad.start_date) end_date
  ,coalesce(ad.physical_city,rd2.city_name) physical_city
  ,coalesce(ad.office_city,rd1.city_name) office_city
from agg_data ad
left join ranged_data rd1 on rd1.cust_id=ad.cust_id and rd1.city_type='office'
      and  rd1.start_date<ad.start_date and rd1.to_date>=ad.start_date
left join ranged_data rd2 on rd2.cust_id=ad.cust_id and rd2.city_type='physical'
      and  rd2.start_date<ad.start_date and rd2.to_date>=ad.start_date
order by ad.cust_id,ad.start_date
cust_id start_date end_date physical_city office_city
1 2024-05-17 2024-09-20 Las Vegas Seattle
1 2024-09-20 2024-10-30 Las Vegas Dallas
1 2024-10-30 2025-02-10 Dallas Dallas
1 2025-02-10 9999-12-31 Dallas Austin
2 2024-05-17 2024-09-20 Las Vegas Seattle
2 2024-09-20 2024-09-21 Las Vegas Dallas
2 2024-09-21 2024-10-30 Las Vegas Boston
2 2024-10-30 2025-02-10 Dallas Boston
2 2025-02-10 9999-12-31 Dallas Austin
3 2024-05-17 2024-09-20 Las Vegas Seattle
3 2024-09-20 2024-09-21 Dallas Seattle
3 2024-09-21 2024-10-30 Dallas Boston
3 2024-10-30 2025-02-10 Seatle Boston
3 2025-02-10 9999-12-31 Austin Boston
4 2024-10-30 2025-02-10 null Dallas
4 2025-02-10 9999-12-31 Austin Dallas
5 2024-10-30 2025-02-10 Dallas null
5 2025-02-10 9999-12-31 Dallas Austin

fiddle

Sign up to request clarification or add additional context in comments.

Comments

0

First of all, we are going to transform the original table by separating the city_name into two columns namely physical_city & office_city and also finding the next_start_date by using city_transitions.

Then obtaining previous column data for physical_city & office_city by using Coalesce & Lag function. Assuming that office city will be empty for the first record.

Here is the solution:

WITH city_transitions AS (
SELECT 
    cust_id,
    (CASE WHEN city_type = 'physical' THEN city_name END) as physical_city,
    (CASE WHEN city_type = 'office' THEN city_name END) as office_city,
    start_date,
    LEAD(start_date) OVER (PARTITION BY cust_id ORDER BY start_date) AS next_start_date
FROM your_table
),

filled_cities AS (
SELECT 
    cust_id,
    Coalesce(physical_city, Lag(physical_city) Over (Partition by cust_id ORDER BY start_date)) as physical_city,
    Coalesce(office_city, Lag(office_city) Over (Partition by cust_id ORDER BY start_date)) as office_city,
    start_date,
    next_start_date
FROM city_transitions
)

SELECT 
cust_id,
Coalesce(physical_city, Lag(physical_city) Over (Partition by cust_id ORDER BY start_date)) as physical_city,
Coalesce(office_city, Lag(office_city) Over (Partition by cust_id ORDER BY start_date)) as office_city,
start_date,
Coalesce(next_start_date, '9999-12-31') as end_date
FROM filled_cities
WHERE office_city IS NOT NULL
ORDER BY cust_id, start_date;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.