Remove row if exist duplicated value in Numpy

Question

I'm trying to find an efficient way to remove rows of numpy array that contains duplicated elements. For example, the array below:

[[1,2,3], [1,2,2], [2,2,2]]

should keep [[1,2,3]] only.

I know pandas apply can work row-wise but that's too slow. What is the quicker alternative?

Thanks!

How many rows/cols do you have in real-life?

mozway
– mozway

2023-07-11 13:16:25 +00:00
Commented Jul 11, 2023 at 13:16 — mozway
– mozway, Commented Jul 11, 2023 at 13:16

mozway · Accepted Answer · 2023-07-11 13:16:09Z

3

Using pandas nunique (not fast!):

out = a[pd.DataFrame(a).nunique(axis=1).eq(a.shape[1])]

Or with numpy's sort and diff to ensure all values are different in a row (quite efficient if the number of columns is reasonable):

out = a[(np.diff(np.sort(a, axis=1))!=0).all(axis=1)]

Or with broadcasting (memory expensive if lots of columns):

out = a[(a[:,:,None] == a[:,None]).sum(axis=(1,2))==a.shape[1]]

Output: array([[1, 2, 3]])

Comparison of approaches:

edited Jul 11, 2023 at 13:16

answered Jul 11, 2023 at 12:34

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Xe- Over a year ago

Thanks for the runtime comparison mozway, csv data is slightly less than 10gb in size which fits into memory. The list comprehension mask worked but looks like numpy broadcasting will be MUCH faster.

mozway Over a year ago

@Xe- if you give it a try on your data, please provide the timings ;)

Marc · Accepted Answer · 2023-07-11 12:29:14Z

0

one way to do so, without Pandas, would be as follows:

a = np.array([[1,2,3], [1,2,2], [2,2,2]])

mask = np.array([len(set(row)) == len(row) for row in a])
result = a[mask]

which outputs:

print(result)

[[1 2 3]]

answered Jul 11, 2023 at 12:29

Marc

2,4322 gold badges15 silver badges19 bronze badges

Comments

Dejene T. · Accepted Answer · 2023-07-11 12:31:03Z

0

To exclude duplicated rows from numpy array:

# let's assume this is your sample array

import numpy as np

# Input array
arr = np.array([[1, 2, 3], [1, 2, 2], [2, 2, 2]])
# find unique rows from the above numpy array

unique_rows = np.array([row for row in arr if len(set(row)) == len(row)])

Output:

[[1 2 3]]

answered Jul 11, 2023 at 12:31

Dejene T.

9798 silver badges14 bronze badges

Collectives™ on Stack Overflow

Remove row if exist duplicated value in Numpy

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related