I have two large 2-d arrays and I'd like to find their set difference taking their rows as elements. In Matlab, the code for this would be setdiff(A,B,'rows'). The arrays are large enough that the obvious looping methods I could think of take too long.
- 11,158
- 2
- 34
- 58
- 335
- 2
- 3
- 10
-
What do you mean by "set difference"? – reptilicus Aug 10 '12 at 13:52
-
@user1443118 I'm guessing that he means "values in A that are not in B." as per http://www.mathworks.com/help/techdoc/ref/setdiff.html. – Hooked Aug 10 '12 at 13:55
-
"set difference" as in "set difference" the set theory operation? – Pablo Santa Cruz Aug 10 '12 at 13:56
-
How does you 2-d array look like? a list of lists? – Pablo Santa Cruz Aug 10 '12 at 13:57
-
Are the arrays the same dimensions? – reptilicus Aug 10 '12 at 14:00
-
Not quite what you are looking for, but there is a 1D version (setdiff1d) [http://docs.scipy.org/doc/numpy/reference/generated/numpy.setdiff1d.html] – Hooked Aug 10 '12 at 14:02
-
The 2-d arrays are numpy array objects with the same number of columns but different numbers of rows. – zss Aug 10 '12 at 14:03
3 Answers
This should work, but is currently broken in 1.6.1 due to an unavailable mergesort for the view being created. It works in the pre-release 1.7.0 version. This should be the fastest way possible, since the views don't have to copy any memory:
>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = a1.view([('', a1.dtype)] * a1.shape[1])
>>> a2_rows = a2.view([('', a2.dtype)] * a2.shape[1])
>>> np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1])
array([[1, 2, 3]])
You can do this in Python, but it might be slow:
>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = set(map(tuple, a1))
>>> a2_rows = set(map(tuple, a2))
>>> a1_rows.difference(a2_rows)
set([(1, 2, 3)])
- 64,866
- 22
- 157
- 202
-
Thanks. The bottom method eventually crashed, but once I figure out how to install the new version of numpy I'll try the top method. – zss Aug 11 '12 at 13:56
Here is a nice alternative pure numpy solution that works for 1.6.1. It does create an intermediate array, so this may or may not be a problem for you. It also does not rely on any speedup from a sorted array or not (as setdiff probably does).
from numpy import *
# Create some sample arrays
A =random.randint(0,5,(10,3))
B =random.randint(0,5,(10,3))
As an example, this is what I got - note that there is one common element:
>>> A
array([[1, 0, 3],
[0, 4, 2],
[0, 3, 4],
[4, 4, 2],
[2, 0, 2],
[4, 0, 0],
[3, 2, 2],
[4, 2, 3],
[0, 2, 1],
[2, 0, 2]])
>>> B
array([[4, 1, 3],
[4, 3, 0],
[0, 3, 3],
[3, 0, 3],
[3, 4, 0],
[3, 2, 3],
[3, 1, 2],
[4, 1, 2],
[0, 4, 2],
[0, 0, 3]])
We look for when the (L1) distance between the rows is zero. This gives us a matrix, which at the points where it is zero, these are the items common to both lists:
idx = where(abs((A[:,newaxis,:] - B)).sum(axis=2)==0)
As a check:
>>> A[idx[0]]
array([[0, 4, 2]])
>>> B[idx[1]]
array([[0, 4, 2]])
- 84,485
- 43
- 192
- 261
-
Can the downvoter explain? I'm welcome to any criticism or comments on how to improve. – Hooked Aug 10 '12 at 15:43
-
Thanks for the clever code (I'll remember the newaxis formulation). Unfortunately, when I tried it I got the error: "ValueError: array is too big." – zss Aug 11 '12 at 14:08
-
@user1590405 When you run `A.size()` and `B.size()` how big are the arrays? – Hooked Aug 11 '12 at 18:51
I'm not sure what you are going for, but this will get you a boolean array of where 2 arrays are not equal, and will be numpy fast:
import numpy as np
a = np.random.randn(5, 5)
b = np.random.randn(5, 5)
a[0,0] = 10.0
b[0,0] = 10.0
a[1,1] = 5.0
b[1,1] = 5.0
c = ~(a-b==0)
print c
[[False True True True True]
[ True False True True True]
[ True True True True True]
[ True True True True True]
[ True True True True True]]
- 10,290
- 6
- 55
- 79
-
1This is not correct, it compares the elements. OP is looking for the set diff of the **rows**. – Hooked Aug 10 '12 at 14:29
-
It's true that "`a[0, c[0]]` gives the elements in the 0 row of a not in b", but the way I read the question was not to find the elements of A and B per row that were identical, but to find the _rows_ of A and the _rows_ of B that matched. – Hooked Aug 10 '12 at 15:38
-
From the match matrix however you can easily go to an array giving the row match using `np.all(match_matrix, axis=0)` however. – Okarin Jul 16 '15 at 17:47