0

I have a dataframe as the following, showing the relationship of different entities in each row.

Child Parent Ult_Parent Full_Family
A032 A001 A039 A001, A032, A039, A040, A041, A043, A043, A045, A046

In the "Full_Family" column, it's showing the correct hierarchy for the whole family tree from small to big, not just entities in that specific row. But to be aware of, not every entity in the "Full_Family" will appear in the "Child/Parent/Ult_Parent", as the "Full_Family" data is coming from the other source.

Here I have 2 problems need to solve:

  1. The Child/Parent/Ult_Parent order is not correct, how to based on the "Full_Family" column to identify the correct order?
  2. If possible, Can I also identify the correct Ult_Parent for a family in the whole table instead of each row. This correct Ult_Parent needs to be the entity that has appeared in Child/Parent/Ult_Parent columns.

Below is the example and ideal outcome:

Child Parent Ult_Parent Full_Family Correct_Order Correct_Ult_Parent_per_Family
A032 A001 A039 A001, A032, A039, A040, A041, A043, A043, A045, A046 A001, A032, A039 A043
A001 A043 A039 A001, A032, A039, A040, A041, A043, A043, A045, A046 A001, A039, A043 A043

Although A046 is the ultimate parent in this family, but it does not appear in this dataframe's Child/Parent/Ult_Parent, so the correct ultimate parent per family is A043 in this case.

Appreciate the help.

1 Answer 1

1

IIUC, assuming this dataframe as input:

import pandas as pd

data = {
    "Child": ["A032", "A001"],
    "Parent": ["A001", "A043"],
    "Ult_Parent": ["A039", "A039"],
    "Full_Family": [
        "A001, A032, A039, A040, A041, A043, A043, A045, A046",
        "A001, A032, A039, A040, A041, A043, A043, A045, A046",
    ],
}

df = pd.DataFrame(data)
  Child Parent Ult_Parent                                        Full_Family
0  A032   A001       A039  A001, A032, A039, A040, A041, A043, A043, A045...
1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...

You can use this approach:

df["Correct_Order"] = df.apply(
    lambda row: ", ".join(sorted([row["Parent"], row["Child"], row["Ult_Parent"]])),
    axis=1,
)

df["Correct_Ult_Parent_per_Family"] = (
    df[["Parent", "Child", "Ult_Parent"]].max(axis=1).max()
)
  Child Parent Ult_Parent                                        Full_Family     Correct_Order Correct_Ult_Parent_per_Family
0  A032   A001       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A032, A039                          A043
1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A039, A043                          A043

If 'Full_Family' is not necessarily in ascending order, and you want to respect its order, you can define a custom key to sorted.

For example, if A039 comes before A032 in 'Full_Family' in the first row:

data = {
    "Child": ["A032", "A001"],
    "Parent": ["A001", "A043"],
    "Ult_Parent": ["A039", "A039"],
    "Full_Family": [
        "A001, A039, A032, A040, A041, A043, A043, A045, A046",
        "A001, A032, A039, A040, A041, A043, A043, A045, A046",
    ],
}

df = pd.DataFrame(data)

Using a custom key:

df["Correct_Order"] = df.apply(
    lambda row: ", ".join(
        sorted(
            [row["Parent"], row["Child"], row["Ult_Parent"]],
            key=lambda x: {
                val: idx for idx, val in enumerate(row["Full_Family"].split(", "))
            }[x],
        )
    ),
    axis=1,
)

df["Correct_Ult_Parent_per_Family"] = df["Correct_Order"].str.split().str[-1].max()
  Child Parent Ult_Parent                                        Full_Family     Correct_Order Correct_Ult_Parent_per_Family
0  A032   A001       A039  A001, A039, A032, A040, A041, A043, A043, A045...  A001, A039, A032                          A043
1  A001   A043       A039  A001, A032, A039, A040, A041, A043, A043, A045...  A001, A039, A043                          A043
Sign up to request clarification or add additional context in comments.

6 Comments

I think it's a bit different from what I want. I think my example is not good. The order is not wrong because the Parent/Child position is wrong. It can be randomly wrong. Like the Child should be Ult_Parent and Ult_Parent should be Child. The point is we need to check the order in the Full_Family, so we can conclude the correct order. So I'm guessing using sorted and fix the order by Parent > Child > Ult_Parent is not what I want in this case. Thanks for the help!
@LH IIUC, the second part of my answer addresses that, using a custom key for sorted.
Just checked! Sorry I don't know why I didn't see the other parts. It works just as what I expect. Thank you :)
@LH No worries, I had edited the answer to add it, so maybe that's why you missed it. Also, seeing as 'Full_Family' is not necessarily ascending, I just realized we need a different way of getting 'Correct_Ult_Parent_per_Family' in the second part, so I'm editing it now to add it.
so I have a row contains ID like this '008ZY8-E', and I run this function and keep getting this KeyError: '008ZY8-E' do you know the reason why?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.