scipy rankdata with masked array

scipy rankdata with masked array - scipy

I notice the following strange behavior with rankdata with maksed_array. Here is the code:
import numpy as np
import scipy.stats as stats
m = [True, False]
print(stats.mstats.rankdata(np.ma.masked_array([1.0, 100], mask=m)))
# result [0. 1.]
print(stats.mstats.rankdata(np.ma.masked_array([1.0, np.nan], mask=m)))
# result [1. 0.]
print(stats.mstats.rankdata([1.0, np.nan]))
# result [1. 2.]
According the scipy doc, masked values will be assigned 0 (use_missing=False). So why it outputs [1 0] in the 2nd one? Bug?

After tracing I find it is related to the argsort method of masked_array. When mstats.rankdata calls argsort, it does not specify fill_value, and endwith input parameters, which defaults to np.nan and True respectively. Based on the following code from numpy, the fill_value is np.nan.
if fill_value is None:
if endwith:
# nan > inf
if np.issubdtype(self.dtype, np.floating):
fill_value = np.nan
So in the case of masked_array of [1, 100], it is argsorting [nan, 100], which is [1, 0]. In the case of masked_array of [1, np.nan], it is argsoring [nan, nan], which can be [0,1]. Then in the rankdata function, it assume first n (n=1) from argsort is valid, which is not correct here.
n = data.count()
rk = np.empty(data.size, dtype=float)
idx = data.argsort()
rk[idx[:n]] = np.arange(1,n+1)

Related

Numpy equivalent to MATLAB's hist

For some reason Numpy's hist always returns one less bin than MATLAB's hist:
for example in MATLAB:
x = [1,2,2,2,1,4,4,2,3,3,3,3];
[Rep,Val] = hist(x,unique(x));
gives:
Rep = [2 4 4 2]
Val = [1 2 3 4]
but in Numpy:
import numpy as np
x = np.array([1,2,2,2,1,4,4,2,3,3,3,3])
Rep, Val = np.histogram(x,np.unique(x))
gives:
>>>Rep
array([2, 4, 6])
>>>Val
array([1, 2, 3, 4])
How can I get identical results ti MATLAB's?

Based on dilayapici's answer on this post, a general solution (applied to your example) to run Python's np.histogram in the same way as Matlab's hist, is the following:
x = np.array([1,2,2,2,1,4,4,2,3,3,3,3])
# Convert the bin centers given in Matlab to bin edges needed in Python.
numBins = len(np.unique(x))
bins = np.linspace(np.amin(x), np.amax(x), numBins)
# Edit the 'bins' argument of `np.histogram` by just putting '+inf' as the last element.
bins = np.concatenate((bins, [np.inf]))
Rep, Val = np.histogram(x, bins)
Output:
Rep
array([2, 4, 4, 2], dtype=int64)

Firstly I want to explain this problem.
In Phyton it is running like:
np.unique(x) = [1, 2, 3, 4] so,
The first bin is equal to [1, 2) (including 1, but excluding 2) and therefore ==> Rep[0]=2
The second bin is equal to [2, 3) (including 2, but excluding 3) and therefore ==> Rep[1]=4
The last bin is equal to [3, 4], which includes 4. Therefore ==> Rep[2] = 6
In MATLAB hist() function is running like:
The first bin is equal to [1, 2) (including 1, but excluding 2) and therefore ==> Rep[0]=2
The second bin is equal to [2, 3) (including 2, but excluding 3) and therefore ==> Rep[1]=4
The third bin is equal to [3, 4) (including 3, but excluding 4) and therefore ==> Rep[2]=4
The last bin is equal to [4, ∞) and therefore ==> Rep[3]=2
Now If you want same result in Pyhton, you have to use different function in Matlab. This is histogram() function. We can decide "bins number".
x = [1,2,2,2,1,4,4,2,3,3,3,3];
nbins=3 ;
h= histogram(x,nbins);
h.Values
You can see h.Values equals to [2,4,6].
I hope, I could help :)

How can I efficiently convert a scipy sparse matrix into a sympy sparse matrix?

I have a matrix A with the following properties.
<1047x1047 sparse matrix of type '<class 'numpy.float64'>'
with 888344 stored elements in Compressed Sparse Column format>
A has this content.
array([[ 1.00000000e+00, -5.85786642e-17, -3.97082034e-17, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[ 6.82195979e-17, 1.00000000e+00, -4.11166786e-17, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[-4.98202332e-17, 1.13957868e-17, 1.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
...,
[ 4.56847824e-15, 1.32261454e-14, -7.22890998e-15, ...,
1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[-9.11597396e-15, -2.28796167e-14, 1.26624823e-14, ...,
0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
[ 1.80765584e-14, 1.93779820e-14, -1.36520100e-14, ...,
0.00000000e+00, 0.00000000e+00, 1.00000000e+00]])
Now I'm trying to create a sympy sparse matrix from this scipy sparse matrix.
from sympy.matrices import SparseMatrix
A = SparseMatrix(A)
But I get this error message.
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
I am confused because this matrix has no logical entries.
Thanks for any help!

The Error
When you get an error that you don't understand, take a bit of time to look at the traceback. Or at least show it to us!
In [288]: M = sparse.random(5,5,.2, 'csr')
In [289]: M
Out[289]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [290]: print(M)
(1, 1) 0.17737340878962138
(2, 2) 0.12362174819457106
(2, 3) 0.24324155883057885
(3, 0) 0.7666429046432961
(3, 4) 0.21848551209470246
In [291]: SparseMatrix(M)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-291-cca56ea35868> in <module>
----> 1 SparseMatrix(M)
/usr/local/lib/python3.6/dist-packages/sympy/matrices/sparse.py in __new__(cls, *args, **kwargs)
206 else:
207 # handle full matrix forms with _handle_creation_inputs
--> 208 r, c, _list = Matrix._handle_creation_inputs(*args)
209 self.rows = r
210 self.cols = c
/usr/local/lib/python3.6/dist-packages/sympy/matrices/matrices.py in _handle_creation_inputs(cls, *args, **kwargs)
1070 if 0 in row.shape:
1071 continue
-> 1072 elif not row:
1073 continue
1074
/usr/local/lib/python3.6/dist-packages/scipy/sparse/base.py in __bool__(self)
281 return self.nnz != 0
282 else:
--> 283 raise ValueError("The truth value of an array with more than one "
284 "element is ambiguous. Use a.any() or a.all().")
285 __nonzero__ = __bool__
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
A full understanding requires reading the sympy code, but a cursory look indicates that it's trying to handle your input as "full matrix", and looks at rows. The error isn't the result of you doing logical operations on the entries, but that sympy is doing a logical test on your sparse matrix. It's trying to check if the row is empty (so it can skip it).
SparseMatrix docs may not be the clearest, but most examples either show a dict of points, or a flat array of ALL values plus shape, or a ragged list of lists. I suspect it's trying to treat your matrix that way, looking at it row by row.
But the row of M is itself a sparse matrix:
In [295]: [row for row in M]
Out[295]:
[<1x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Compressed Sparse Row format>,
<1x5 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>,
...]
And trying to check if that row is empty not row produces this error:
In [296]: not [row for row in M][0]
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
So clearly SparseMatrix cannot handle a scipy.sparse matrix as is (at least not in the csr or csc format, and probably not the others. Plus scipy.sparse is not mentioned anywhere in the SparseMatrix docs!
from dense array
Converting the sparse matrix to its dense equivalent does work:
In [297]: M.A
Out[297]:
array([[0. , 0. , 0. , 0. , 0. ],
[0. , 0.17737341, 0. , 0. , 0. ],
[0. , 0. , 0.12362175, 0.24324156, 0. ],
[0.7666429 , 0. , 0. , 0. , 0.21848551],
[0. , 0. , 0. , 0. , 0. ]])
In [298]: SparseMatrix(M.A)
Out[298]:
⎡ 0 0 0 0 0 ⎤
...⎦
Or a list of lists:
SparseMatrix(M.A.tolist())
from dict
The dok format stores a sparse matrix as a dict, which then can be
In [305]: dict(M.todok())
Out[305]:
{(3, 0): 0.7666429046432961,
(1, 1): 0.17737340878962138,
(2, 2): 0.12362174819457106,
(2, 3): 0.24324155883057885,
(3, 4): 0.21848551209470246}
Which works fine as an input:
SparseMatrix(5,5,dict(M.todok()))
I don't know what's most efficient. Generally when working with sympy we (or at least I) don't worry about efficiency. Just get it to work is enough. Efficiency is more relevant in numpy/scipy where arrays can be large, and using the fast compiled numpy methods makes a big difference in speed.
Finally - numpy and sympy are not integrated. That applies also to the sparse versions. sympy is built on Python, not numpy. So inputs in the form of lists and dicts makes most sense.

from sympy.matrices import SparseMatrix
import scipy.sparse as sps
A = sps.random(100, 10, format="dok")
B = SparseMatrix(100, 10, dict(A.items()))
From the perspective of someone who likes efficient memory structures this is like staring into the abyss. But it will work.

This is a simplified version of your error.
from scipy import sparse
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = sparse.csc_matrix((data, (row, col)), shape=(3, 3))
So A is a sparse matrix with 6 elements:
<3x3 sparse matrix of type '<class 'numpy.intc'>'
with 6 stored elements in Compressed Sparse Column format>
Calling SparseMatrix() on it returns the same kind of error that you have. You might like to convert A to numpy array first:
>>> SparseMatrix(A.todense())
Matrix([
[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])

K-means sort labels

Assume I have matrix A and I perform K-means clustering on them in MATLAB. I get the following
A=
1 20 5
1 30 10
2 60 20
5 100 45
kmeans(A,4) results in the following labels:
2
4
3
1
Now I permute rows of A and I get matrix B:
B =
2 60 20
1 30 10
5 100 45
1 20 5
and after applying the kmeans the labels are B1 = [3 1 2 4], which seems to be random assignment. For example second row of matrix A is in cluster 4 but second row of matrix B which is the same thing as second row of A is in cluster 1.
How can I get the labels in the kmeans such that rows that have highest value always get the same label, for example 3, and row that have lowest value always get 1?
For example the last row of A gets label 3, thus the third row of B also get label 3.

Each label is tied to the mean of a cluster. To sort the labels, you sort the means in e.g. order of appearance along a given axis (x-axis in this example). Here's an implementation in Python:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
np.random.seed(1)
def rearrange_labels(X, cluster_labels, sort_on_column=0):
labels, ctrs = [], []
for i in range(len(set(cluster_labels))):
Xi = X[cluster_labels == i]
ctr = np.mean(Xi, axis=0)
labels.append(i)
ctrs.append(ctr)
ctrs = np.row_stack(ctrs)
labels = np.array(labels).reshape(-1, 1)
# sort on x column
new_order = ctrs[:, sort_on_column].argsort()
labels_new = labels[new_order]
ctrs_new = ctrs[new_order]
np.put(cluster_labels, labels, labels_new)
return cluster_labels, ctrs_new
X, _ = make_blobs(n_samples=500, centers=10, n_features=2)
clf = KMeans(n_clusters=10)
cluster_labels = clf.fit_predict(X)
cluster_labels, ctrs = rearrange_labels(X=X, cluster_labels=cluster_labels)
fig, ax = plt.subplots()
for i, m in enumerate(ctrs):
ax.annotate(
xy=m[[0, 1]],
s=i,
bbox=dict(boxstyle="square", fc="w", ec="grey", alpha=0.9),
)
ax.scatter(X[:, 0], X[:, 1], c=cluster_labels)
plt.show()

The cluster numbers assigned by k-means do not have an order - don't treat them as such. They are numbers just for convenience, they might as well be A B C D.
If you want to impose an order on them, you can relabel them as you want. You can sort the centers by X coordinate, and relabel them. It's not the job of k-means to do so, you need to do this yourself.

reshape scipy csr matrix

How can I reshape efficiently and scipy.sparse csr_matrix?
I need to add zero rows at the end.
Using:
from scipy.sparse import csr_matrix
data = [1,2,3,4,5,6]
col = [0,0,0,1,1,1]
row = [0,1,2,0,1,2]
a = csr_matrix((data, (row, col)))
a.reshape(3,5)
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 129, in reshape
self.__class__.__name__)
NotImplementedError: Reshaping not implemented for csr_matrix.

If you can catch the problem early enough, just include a shape parameter:
In [48]: a = csr_matrix((data, (row, col)))
In [49]: a
Out[49]:
<3x2 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [50]: a = csr_matrix((data, (row, col)),shape=(3,5))
In [51]: a
Out[51]:
<3x5 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [52]: a.A
Out[52]:
array([[1, 4, 0, 0, 0],
[2, 5, 0, 0, 0],
[3, 6, 0, 0, 0]], dtype=int64)
You could also hstack on a pad. Make sure it's the sparse version:
In [59]: z = sparse.coo_matrix(np.zeros((3,3)))
In [60]: z
Out[60]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in COOrdinate format>
In [61]: sparse.hstack((a,z))
Out[61]:
<3x5 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in COOrdinate format>
In [62]: _.A
Out[62]:
array([[1., 4., 0., 0., 0.],
[2., 5., 0., 0., 0.],
[3., 6., 0., 0., 0.]])
hstack uses sparse.bmat. That combines the coo attributes of the 2 arrays, and makes a new coo matrix.

The reshape() method will work with csr_matrix objects in scipy 1.1, which is close to being released. In the meantime, you can try the code at Reshape sparse matrix efficiently, Python, SciPy 0.12 for reshaping a sparse matrix.
Your example won't work, however, because you are trying to reshape an array with shape (3, 2) into an array with shape (3, 5). The code linked to above and the sparse reshape() method follow the same rules as the reshape() method of numpy arrays: you can't change the total size of the array.
If you want to change the total size, you will eventually be able to use the resize() method (which operates in-place), but that is also a new feature of scipy 1.1, so it is not yet released.
Instead, you can construct a new sparse matrix as follows:
In [57]: b = csr_matrix((a.data, a.indices, a.indptr), shape=(3, 5))
In [58]: b.shape
Out[58]: (3, 5)
In [59]: b.A
Out[59]:
array([[1, 4, 0, 0, 0],
[2, 5, 0, 0, 0],
[3, 6, 0, 0, 0]], dtype=int64)

Accessing sparse matrix elements

I have a very large sparse matrix of the type 'scipy.sparse.coo.coo_matrix'. I can convert to csr with .tocsr(), however .todense() will not work since the array is too large. I want to be able to extract elements from the matrix as I would do with a regular array, so that I may pass row elements to a function.
For reference, when printed, the matrix looks as follows:
(7, 0) 0.531519363001
(48, 24) 0.400946334437
(70, 6) 0.684460955022
...

Make a matrix with 3 elements:
In [550]: M = sparse.coo_matrix(([.5,.4,.6],([0,1,2],[0,5,3])), shape=(5,7))
It's default display (repr(M)):
In [551]: M
Out[551]:
<5x7 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
and print display (str(M)) - looks like the input:
In [552]: print(M)
(0, 0) 0.5
(1, 5) 0.4
(2, 3) 0.6
convert to csr format:
In [553]: Mc=M.tocsr()
In [554]: Mc[1,:] # row 1 is another matrix (1 row):
Out[554]:
<1x7 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
In [555]: Mc[1,:].A # that row as 2d array
Out[555]: array([[ 0. , 0. , 0. , 0. , 0. , 0.4, 0. ]])
In [556]: print(Mc[1,:]) # like 2nd element of M except for row number
(0, 5) 0.4
Individual element:
In [560]: Mc[1,5]
Out[560]: 0.40000000000000002
The data attributes of these format (if you want to dig further)
In [562]: Mc.data
Out[562]: array([ 0.5, 0.4, 0.6])
In [563]: Mc.indices
Out[563]: array([0, 5, 3], dtype=int32)
In [564]: Mc.indptr
Out[564]: array([0, 1, 2, 3, 3, 3], dtype=int32)
In [565]: M.data
Out[565]: array([ 0.5, 0.4, 0.6])
In [566]: M.col
Out[566]: array([0, 5, 3], dtype=int32)
In [567]: M.row
Out[567]: array([0, 1, 2], dtype=int32)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

scipy rankdata with masked array - scipy

Related

Numpy equivalent to MATLAB's hist

How can I efficiently convert a scipy sparse matrix into a sympy sparse matrix?

K-means sort labels

reshape scipy csr matrix

Accessing sparse matrix elements

Categories

Resources