2

Is it possible to copy a dataframe in the middle of a method chain to a new variable? Something like:

import pandas as pd

df = (pd.DataFrame([[2, 4, 6],
                    [8, 10, 12],
                    [14, 16, 18],
                    ])
      .assign(something_else=100)
      .div(2)
      .copy_to_new_variable(df_imag)  # Imaginated method to copy df to df_imag.
      .div(10)
      )

print(df_imag) would then return:

    0   1   2   something_else
0   1.0 2.0 3.0 50.0
1   4.0 5.0 6.0 50.0
2   7.0 8.0 9.0 50.0

.copy_to_new_variable(df_imag) could be replaced by df_imag = df.copy() but this would result in compromising the method chain.

5
  • 4
    Ok. Method chaining in pandas does not improve the speed of evaluation. Chained methods don't get optimised. It would make absolutely no difference if you just stopped the chain, took a manual copy, and then continued processing the df Commented Sep 15, 2023 at 20:53
  • Agree with roganjosh's recommendation that coding with explicit side-effects will be confusing, clearer to use separate assignment and expression chaining. Why are you doing this? just to make a copy for debug? or production code? Commented Sep 15, 2023 at 21:05
  • It should be for production. It would increase readability if I could use something like .copy_to_new_variable(df_imag) instead of the := operator. Thank you for your thoughts. Commented Sep 15, 2023 at 21:18
  • mouwsy: .copy_to_new_variable(df_imag) would be syntactic sugar for df_imag :=. But pandas [df.copy()](pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html intentionally doesn't allow you use an assignment target on the RHS, they really don't want you putting assigns with side-effects in a pipeline. Why do you want to do this in production? That sort of code will break lots of things, like optimization (e.g. numba). By the way, do you want the copy to be a deep-copy or shallow copy? Is your dataframe ints, floats, strings, arbitrary objects...? Commented Sep 15, 2023 at 21:33
  • 1
    Very, very important to understand that df_imag = df does not copy the data frame Commented Sep 16, 2023 at 8:44

3 Answers 3

3

Use := operator:

df = (df_imag := df.assign(new_var=100).div(2)).div(10)
print(df)
print(df_imag)

Prints:

     0    1    2  new_var
0  0.1  0.2  0.3      5.0
1  0.4  0.5  0.6      5.0
2  0.7  0.8  0.9      5.0

     0    1    2  new_var
0  1.0  2.0  3.0     50.0
1  4.0  5.0  6.0     50.0
2  7.0  8.0  9.0     50.0
Sign up to request clarification or add additional context in comments.

1 Comment

should probably note that this is a Python 3.8+ op.
3

Creating variables dynamically is not a good idea, but you can easily take advantage of mutable objects like dictionaries.

Adding a new DataFrame method to do this seamlessly:

from pandas.core.base import PandasObject

### this only needs to be done once per session
def to_name(df, dic, name, copy=False):
    dic[name] = df.copy() if copy else df
    return df
    
PandasObject.to_name = to_name
###

tmp = {}

df = (pd.DataFrame([[2, 4, 6],
                    [8, 10, 12],
                    [14, 16, 18],
                    ])
      .assign(something_else=100)
      .div(2)
      .to_name(tmp, 'after_div2', copy=True)
      .div(10)
      )

print(tmp['after_div2'])

print(df)

Output:

# tmp['after_div2']
     0    1    2  something_else
0  1.0  2.0  3.0            50.0
1  4.0  5.0  6.0            50.0
2  7.0  8.0  9.0            50.0

# df
     0    1    2  something_else
0  0.1  0.2  0.3             5.0
1  0.4  0.5  0.6             5.0
2  0.7  0.8  0.9             5.0

If you don't want to monkey patch the DataFrame objects, use pipe:

def to_name(df, dic, name, copy=False):
    dic[name] = df.copy() if copy else df
    return df

tmp = {}

df = (pd.DataFrame([[2, 4, 6],
                    [8, 10, 12],
                    [14, 16, 18],
                    ])
      .assign(something_else=100)
      .div(2)
      .pipe(to_name, tmp, 'after_div2')
      .div(10)
      .pipe(lambda df: print('\nQuick alternative:', df, sep='\n') or df)
      )

print(tmp['after_div2'])

printing

In the same line you can also add a chainable print method, or again use a lambda in pipe:

from pandas.core.base import PandasObject

### this only needs to be done once per session
def df_print(df, *args):
    if args:
        print(*args)
    print(df)
    return df
    
PandasObject.print = df_print
###

df = (pd.DataFrame([[2, 4, 6],
                    [8, 10, 12],
                    [14, 16, 18],
                    ])
      .print()
      .assign(something_else=100)
      .div(2)
      .print('\nAfter 2:')
      .div(10)
      .pipe(lambda df: print('\nQuick alternative:', df, sep='\n') or df)
      )

Output:

    0   1   2
0   2   4   6
1   8  10  12
2  14  16  18

After 2:
     0    1    2  something_else
0  1.0  2.0  3.0            50.0
1  4.0  5.0  6.0            50.0
2  7.0  8.0  9.0            50.0

Quick alternative:
     0    1    2  something_else
0  0.1  0.2  0.3             5.0
1  0.4  0.5  0.6             5.0
2  0.7  0.8  0.9             5.0

As a module

You could also create a module:

pandas_debug.py

from pandas.core.base import PandasObject

def df_print(df, *args):
    if args:
        print(*args)
    print(df)
    return df
    
PandasObject.print = df_print

def to_name(df, dic, name, copy=False):
    dic[name] = df.copy() if copy else df
    return df

PandasObject.to_name = to_name

Then in your code:

import pandas as pd
import pandas_debug

tmp = {}
df = (pd.DataFrame([[2, 4, 6],
                    [8, 10, 12],
                    [14, 16, 18],
                    ])
      .assign(something_else=100)
      .div(2)
      .to_name(tmp, 'after_div2')
      .div(10)
      .print()
      )

2 Comments

Thank you very much, this is what I was looking for. Just for better understanding: What does from pandas.core.base import PandasObject and PandasObject.to_name = to_name do? Can I drop these? Because the code also works without these.
This is required to add a new DataFrame method (to_name), which otherwise wouldn't exit, but you only have to run it once per session and it will work for all DataFrames. You do not need it if you use the pipe approach.
0

Actually, this video and (and also that video; same approach for polars) describes what I was looking for. You can check the links, the idea is from Matt Harrison (who wrote multiple books about pandas) for debugging of method chains. This approach is also recommended in this great article 4 Pandas Anti-Patterns to Avoid and How to Fix Them by Aidan Cooper.

import pandas as pd

def to_df(df, name):
    globals()[name] = df.copy()
    return df

df = (pd.DataFrame([[1, 2, 3],
                    [10, 10, 10],
                    ], columns=["A", "B", "C"]
                   )
      .set_index("C")
      .pipe(to_df, "df_imag")
      .sum()
      )

df_imag is then the intermediate dataframe as described in the question.


Another approach, which however only works in jupyter notebooks, is to use .pipe(lambda df_: display(df_) or df_) if you would like to view the dataframe midway through the chain without interrupting the rest of the chain. This is also explained in the aforementioned article:

import pandas as pd

df = (
    pd.DataFrame(
        [
            [2, 4, 6],
            [8, 10, 12],
            [14, 16, 18],
        ]
    )
    .assign(something_else=100)
    .div(2)
    .pipe(lambda df_: display(df_) or df_)
    .div(10)
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.