1

I'm trying to find an efficient way to remove rows of numpy array that contains duplicated elements. For example, the array below:

[[1,2,3], [1,2,2], [2,2,2]]

should keep [[1,2,3]] only.

I know pandas apply can work row-wise but that's too slow. What is the quicker alternative?

Thanks!

1
  • How many rows/cols do you have in real-life? Commented Jul 11, 2023 at 13:16

3 Answers 3

3

Using pandas nunique (not fast!):

out = a[pd.DataFrame(a).nunique(axis=1).eq(a.shape[1])]

Or with numpy's sort and diff to ensure all values are different in a row (quite efficient if the number of columns is reasonable):

out = a[(np.diff(np.sort(a, axis=1))!=0).all(axis=1)]

Or with broadcasting (memory expensive if lots of columns):

out = a[(a[:,:,None] == a[:,None]).sum(axis=(1,2))==a.shape[1]]

Output: array([[1, 2, 3]])

Comparison of approaches:

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the runtime comparison mozway, csv data is slightly less than 10gb in size which fits into memory. The list comprehension mask worked but looks like numpy broadcasting will be MUCH faster.
@Xe- if you give it a try on your data, please provide the timings ;)
0

one way to do so, without Pandas, would be as follows:

a = np.array([[1,2,3], [1,2,2], [2,2,2]])

mask = np.array([len(set(row)) == len(row) for row in a])
result = a[mask] 

which outputs:

print(result)

[[1 2 3]]

Comments

0

To exclude duplicated rows from numpy array:

# let's assume this is your sample array

import numpy as np

# Input array
arr = np.array([[1, 2, 3], [1, 2, 2], [2, 2, 2]])
# find unique rows from the above numpy array

unique_rows = np.array([row for row in arr if len(set(row)) == len(row)])

Output:

[[1 2 3]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.