Skip to content

Conversation

@dannyi96
Copy link
Contributor

@dannyi96 dannyi96 commented Oct 8, 2022

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fix the defaultdict mutation issue? #47527

@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Series Series data structure labels Oct 10, 2022
@dannyi96
Copy link
Contributor Author

dannyi96 commented Oct 10, 2022

Does this fix the defaultdict mutation issue? #47527

Just tested this ( & followed the older threads on this issue ).
Yes, these changes does fix the defaultdict mutation issue.

Eg -
Before changes -

s = pd.Series([1, 2, np.nan])
default_map = defaultdict(lambda: "missing", {1: "a", 2: "b", np.nan: "c"})

s.map(default_map) 
# 0              a
# 1              b
# 2    missing
# dtype: object

# defaultdict is mutated
default_map 
# defaultdict(<function <lambda> at 0x7f3e792833a0>, {1: 'a', 2: 'b', nan: 'c', nan: 'missing'})

After changes -

s = pd.Series([1, 2, np.nan])
default_map = defaultdict(lambda: "missing", {1: "a", 2: "b", np.nan: "c"})

s.map(default_map) 
# 0    a
# 1    b
# 2    c
# dtype: object

# defaultdict is not mutated
default_map 
# defaultdict(<function <lambda> at 0x7fe002b773a0>, {1: 'a', 2: 'b', nan: 'c'})

ser = Series([1, np.nan, 2])
result = ser.map(mapping)
expected = Series([10, 0, 0])
expected = Series([10, 42, 0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this older testcase seemed incorrect & needed to be corrected with this change.
Am however unsure if the older behavior is as per intention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the discussion we had in #47585 the old behavior is correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see.
Do we keep it as it is now ? or is this enhancement fine - since there are some discrepancies observed (based on dtype) in map behavior as highlighted in bug description #48813
TIA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rhshadrach can you weigh in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing the behavior in this test. The 0 in question here is because np.nan does not equal itself, and NumPy often returns views so that ids are not equal either; e.g.

mapping = defaultdict(int, {1: 10, np.nan: 42})
arr = np.array([1, np.nan, 2])
print(mapping[arr[1]])

# 0

I think introducing a better lookup for NaN values makes sense, and brings this in line with the Series case:

mapping = pd.Series({1: 10, np.nan: 42})
ser = Series([1, np.nan, 2])
print(ser.map(mapping))

# 0    10.0
# 1    42.0
# 2     NaN
# dtype: float64

@mroeschke
Copy link
Member

Yes, these changes does fix the defaultdict mutation issue.

Could you add a test to check that the default dict isn't mutated?

@mroeschke
Copy link
Member

cc @phofl since you made a fix here recently

@dannyi96
Copy link
Contributor Author

Yes, these changes does fix the defaultdict mutation issue.

Could you add a test to check that the default dict isn't mutated?

Updated the PR with a testcase to check defaultdict mutation.

@mroeschke mroeschke added this to the 2.0 milestone Oct 14, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @phofl or @rhshadrach merge when ready

@phofl phofl merged commit 42a43e3 into pandas-dev:main Oct 14, 2022
@phofl
Copy link
Member

phofl commented Oct 14, 2022

Thx @dannyi96

@dannyi96 dannyi96 deleted the daniel_bug_fix branch October 14, 2022 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Series Series data structure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: differences in Series.map with defaultdict with different dtypes

4 participants