4

I have a MxN array of values taken from an experiment. Some of these values are invalid and are set to 0 to indicate such. I can construct a mask of valid/invalid values using

mask = (mat1 == 0) & (mat2 == 0)

which produces an MxN array of bool. It should be noted that the masked locations do not neatly follow columns or rows of the matrix - so simply cropping the matrix is not an option.

Now, I want to take the mean along one axis of my array (E.G end up with a 1xN array) while excluding those invalid values in the mean calculation. Intuitively I thought

 np.mean(mat1[mask],axis=1)

should do it, but the mat1[mask] operation produces a 1D array which appears to just be the elements where mask is true - which doesn't help when I only want a mean across one dimension of the array.

Is there a 'python-esque' or numpy way to do this? I suppose I could use the mask to set masked elements to NaN and use np.nanmean - but that still feels kind of clunky. Is there a way to do this 'cleanly'?

2 Answers 2

5

I think the best way to do this would be something along the lines of:

masked = np.ma.masked_where(mat1 == 0 && mat2 == 0, array_to_mask)

Then take the mean with

masked.mean(axis=1)
Sign up to request clarification or add additional context in comments.

1 Comment

Worked perfectly! I didn't know about masked arrays - thank you!
1

One similarly clunky but efficient way is to multiply your array with the mask, setting the masked values to zero. Then of course you'll have to divide by the number of non-masked values manually. Hence clunkiness. But this will work with integer-valued arrays, something that can't be said about the nan case. It also seems to be fastest for both small and larger arrays (including the masked array solution in another answer):

import numpy as np

def nanny(mat, mask):
    mat = mat.astype(float).copy() # don't mutate the original
    mat[~mask] = np.nan            # mask values
    return np.nanmean(mat, axis=0) # compute mean

def manual(mat, mask):
    # zero masked values, divide by number of nonzeros
    return (mat*mask).sum(axis=0)/mask.sum(axis=0)

# set up dummy data for testing
N,M = 400,400
mat1 = np.random.randint(0,N,(N,M))
mask = np.random.randint(0,2,(N,M)).astype(bool)

print(np.array_equal(nanny(mat1, mask), manual(mat1, mask))) # True

4 Comments

What would be the manual approach when dealing with floats?
@m_power I think the manual version should also work for floats. My motivation was more that for the float case you can just use nans for invalid values and use np.nanmean, which is likely to be faster because it's a single numpy function call. But OP already knew this if you look at the last part of their question, which is why I focussed on the manual version that might be necessary for integral arrays. But the accepted answer's approach with masked arrays might be overall better if you need the masked data in multiple places. This depends on your use case.
Thanks! I'm using np.nanmean (for an array of float with some NaNs), but I was looking to see if there was a faster approach.
@m_power If you already have an array of floats I'd expect np.nanmean to be fastest, but admittedly I haven't played with such problems. The function seems to be implemented in python so you can try doing what it does with fewer checks if this is really your bottleneck: github.com/numpy/numpy/blob/main/numpy/lib/nanfunctions.py#L863

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.