Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
cleanups + formatting
  • Loading branch information
lithomas1 committed Mar 12, 2023
commit 03ace50ba070fe9e57ac0d43e2a34f13e43f2093
40 changes: 17 additions & 23 deletions web/pandas/pdeps/0008-inplace-methods-in-pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,8 @@ ways" to achieve the same result:
subtle bugs and is harder to debug.

Finally, there are also methods that have a `copy` keyword instead of an `inplace` keyword (which also avoids copying
the data in the case of `copy=False`, but still returns a new object referencing the same data instead of updating the
calling object), adding to the inconsistencies. This `copy=False` option also has become redundant with the introduction
of Copy-on-Write.
the data when `copy=False`, but returns a new object referencing the same data instead of updating the calling object),
adding to the inconsistencies. This keyword is also redundant now with the introduction of Copy-on-Write.

Given the above reasons, we are convinced that there is no need for neither the `inplace` nor the `copy` keyword (except
for a small subset of methods that can actually update data inplace). Removing those keywords will give a more
Expand Down Expand Up @@ -94,7 +93,7 @@ the ``copy`` and ``inplace`` keywords, with the value of ``inplace`` overwriting
To summarize the status quo of inplace behavior of methods, we have divided methods that can operate inplace or have
an ``inplace``/``copy`` keyword into 4 groups:

**Group 1: Methods that always operate inplace**
**Group 1: Methods that always operate inplace (no user-control with ``inplace``/``copy`` keyword) **

| Method Name |
|:--------------|
Expand All @@ -103,8 +102,6 @@ an ``inplace``/``copy`` keyword into 4 groups:
| ``update`` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is update always inplace? e.g. in

In [43]: df1 = pd.DataFrame({'numbers': [1, np.nan]})

In [44]: df2 = pd.DataFrame({'numbers': ['foo', 'bar']})

In [45]: df1.update(df2)

Copy link
Member

@phofl phofl Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always operates inplace on the pandas object, but not necessarily on the underlying data/array. Your example would update df1, but the array would be copied before the update actually happens. Same as with replace and friends when the value to set is not compatible with the array dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but then should it belong in group 2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no inplace keyword, so we don't have to change anything, same with insert and pop

Copy link
Member

@MarcoGorelli MarcoGorelli Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the title then

methods that always operate inplace

As in, methods that always operate inplace on the pandas object?

Because in another part of the tutorial the wording

Some of the methods with an inplace keyword can actually work inplace

is used - I find it confusing when you're talking about "actually working inplace" and "modifying the pandas object"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would make it clearer, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the keywords do we just go all in and make those copy_pandas_object and copy_underlying_data? Or something similar.

Obviously more verbose but yea I agree with @MarcoGorelli sentiment that inplace is vague to the point of misinterpretation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_pandas_object won't be necessary with CoW. We will only need a keyword where we actually can modify the underlying data inplace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd to be clear, my comment above was about terminology to use in the text explaining this, not for actual keyword names (from your comment I get the impression you are talking about actual keywords)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably way too abstracted from all of this, but find the terminology confusing. I didn't realize that CoW only ever referred to the pandas "container" object and didn't take a point of view on the underlying data. So if that' is the long term commitment then yea copy_pandas_object is a bit redundant, although I wonder if there is other terminology we need to cover if/when CoW extends to the underlying data

Generally it feels like there is a lot to be gained from a more explicit keyword than inplace when we want to modify the underlying data; I think is in line with @Dr-Irv points

| ``isetitem``* |

These methods always operate inplace and don't have the ``inplace`` or ``copy`` keyword.

\* Although ``isetitem`` operates on the original pandas object inplace, it will not change any existing values
inplace (it will remove the values of the column being set, and insert new values).

Expand All @@ -121,7 +118,8 @@ inplace (it will remove the values of the column being set, and insert new value
| ``bfill`` |
| ``clip`` |

These methods don't operate inplace by default, but can be done inplace with `inplace=True` if the dtypes are compatible (e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave
These methods don't operate inplace by default, but can be done inplace with `inplace=True` if the dtypes are compatible
(e.g. the values replacing the old values can be stored in the original array without an astype). All those methods leave
the structure of the DataFrame or Series intact (shape, row/column labels), but can mutate some elements of the data of
the DataFrame or Series.

Expand Down Expand Up @@ -162,8 +160,8 @@ Two methods also have both keywords: `rename`, `rename_axis`, with the `inplace`

**Group 4: Methods that can never operate inplace**

| Method Name | Keyword |
|:-------------------------|-------------|
| Method Name | Keyword |
|:-----------------------|-----------|
| `drop` (dropping rows) | `inplace` |
| `dropna` | `inplace` |
| `drop_duplicates` | `inplace` |
Expand All @@ -178,9 +176,9 @@ Two methods also have both keywords: `rename`, `rename_axis`, with the `inplace`
| `reindex_like` | `copy` |
| `truncate` | `copy` |

Although all of these methods either `inplace` or `copy`, they can never operate inplace because the nature of the
Although these methods the `inplace`/`copy` keywords, they can never operate inplace because the nature of the
operation requires copying (such as reordering or dropping rows). For those methods, `inplace=True` is essentially just
syntactic sugar for reassigning the new result to `self` (the calling DataFrame).
syntactic sugar for reassigning the new result to the calling DataFrame/Series.

Note: in the case of a "no-op" (for example when sorting an already sorted DataFrame), some of those methods might not
need to perform a copy. This currently happens with Copy-on-Write (regardless of `inplace`), but this is considered an
Expand All @@ -193,14 +191,11 @@ The methods from group 1 won't change behavior, and will remain always inplace.
Methods in groups 3 and 4 will lose their `copy` and `inplace` keywords. Under Copy-on-Write, every operation will
potentially return a shallow copy of the input object, if the performed operation does not require a copy. This is
equivalent to behavior with `copy=False` and/or `inplace=True` for those methods. If users want to make a hard
copy(`copy=True`), they can do:

:::python
df = df.func().copy()
copy(`copy=True`), they can call the `copy()` method on the result of the operation.

Therefore, there is no benefit of keeping the keywords around for these methods.

User can emulate behavior of the `inplace` keyword by assigning the result of an operation to the same variable:
To emulate behavior of the `inplace` keyword, we can reassig the result of an operation to the same variable:

:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
Expand All @@ -210,8 +205,7 @@ User can emulate behavior of the `inplace` keyword by assigning the result of an
All references to the original object will go out of scope when the result of the `reset_index` operation is assigned
to `df`. As a consequence, `iloc` will continue to operate inplace, and the underlying data will not be copied.

The methods in group 2 behave different compared to the first three groups. These methods are actually able to operate
inplace because they only modify the underlying data.
Group 2 methods differ, though, since they only modify the underlying data, and therefore can be inplace.

:::python
df = pd.DataFrame({"foo": [1, 2, 3]})
Expand Down Expand Up @@ -336,19 +330,19 @@ There are some behaviour changes (for example the current `copy=False` returning
actual" shallow copy, but protected under Copy-on-Write), but those behaviour changes are covered by the Copy-on-Write
proposal[^1].

## Alternatives
## Rejected ideas

### Remove the `inplace` keyword altogether

In the past, it was considered to remove the `inplace` keyword entirely. This was because many operations that had
In the past, it was considered to remove the `inplace` keyword entirely. This was because many methods with
the `inplace` keyword did not actually operate inplace, but made a copy and re-assigned the underlying values under
the hood, causing confusion and providing no real benefit to users.

Because a majority of the methods supporting `inplace` did not operate inplace, it was considered at the time to
deprecate and remove inplace from all methods, and add back the keyword as necessary.[^3]

For the subset of methods where the operation actually _can_ be done inplace (group 2), however, removing the `inplace`
keyword for those as well could give a significant performance regression when currently using this keyword with large
For methods where the operation actually _can_ be done inplace (group 2), however, removing the `inplace`
keyword could give a significant performance regression when currently using this keyword with large
DataFrames. Therefore, we decided to keep the `inplace` keyword for this small subset of methods.

### Standardize on the `copy` keyword instead of `inplace`
Expand Down Expand Up @@ -382,7 +376,7 @@ It may be helpful to review those discussions (see links) [^2] [^3] [^4] to bett
Copy-on-Write is a relatively new feature (added in version 1.5) and some methods are missing the "lazy copy"
optimization (equivalent to `copy=False`).

Therefore, we will start showing deprecation warnings for the `copy` and `inplace` parameters in pandas 2.1, to
Therefore, we propose deprecating the `copy` and `inplace` parameters in pandas 2.1, to
allow for bugs with Copy-on-Write to be addressed and for more optimizations to be added.

Hopefully, users will be able to switch to Copy-on-Write to keep the no-copy behavior and to silence the warnings.
Expand Down