How to get element with python? - imputation

I have not any idea about how to implement it with such library and function. Anybody can give me some idea. Just some function name or idea or some helpful website url would be ok! Thanks!
I thinks it's different.

How does linear regression map to your problem?
Given a matrix X, the rows may represent the samples, and the columns the variables.
The column containing the nunmpy.nan value represents the target variable ("y"). The remaining columns represent the input variables (the x1, x2,...).
The rows with observed values represent the training set, the rest represent the test set.
The code
Below is a code snippet that implements these points using your example matrix X.
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 2, 3], [2, 4, np.nan], [3, 6, 9]])
# Unknown rows (test examples), i.e. rows with a nan
impute_rows = np.any(np.isnan(X), axis=1)
# Known rows (training examples), i.e. rows without a nan
full_rows = np.logical_not(impute_rows)
# Column acting as variable to predict
output_var = np.any(np.isnan(X), axis=0)
input_var = np.logical_not(output_var)
# Check only one variable to predict
assert(np.sum(output_var)==1)
# Construct traing/test input/output
train_input = X[np.ix_(full_rows, input_var)]
train_output = X[np.ix_(full_rows, output_var)]
test_input = X[np.ix_(impute_rows, input_var)]
# Perform regression
lr = LinearRegression()
lr.fit(train_input, train_output)
lr.predict(test_input)
Note that using the specific X you provided represents an oversimplified case, where only two points are fitted, but these ideas should be applicable to larger matrices.
Also note there exist other more specialized methods to impute missing values from matrices (is understood in your question that this was an exercise). This specific method may be valid in cases where there is a linear relationship between the elements of the matrix (as is the case in your simplified example).

Related

In MATLAB how to get the result of a classification tree in a matrix?

I made a Classification Tree, code:
mytree=ClassificationTree.fit(MyData,MyLables);
mytree.view('mode','graph');
My data has two classes and I want to get the result of prediction as a matrix that can show me every data row is belongs to which as an example.
data row predicted class
1 2
2 1
. .
. .
. .
how can i make this matrix?
---------------------Edited----------------------
I found that with this function I can predict my data:
label = predict(Mdl,MyData([1:50],:));
but this labels are belong to which rows?
The first column, i.e. 'data row', is simply a vector starting from 1 to number of rows of X (which is obviously also the same as number of values in Y). The second column, i.e. 'predicted class', is the same as the variable MyLables. Hence:
ReqResult = [(1:numel(Y)).' Y];
%Assuming Y is a column vector (order = nx1).
%If Y is a row vector then take the transpose of Y as well.
Warning:
If you're using ≥ R2014a, you should use fitctree instead of ClassificationTree.fit because as mentioned in the documentation:
ClassificationTree.fit will be removed in a future release. Use fitctree instead.

Spark out of memory when reducing by key

I'm working on a algorithm that requires math operations on large matrix. Basically, the algorithm involves the following steps:
Inputs: two vectors u and v of size n
For each vector, compute pairwise Euclidean distance between elements in the vector. Return two matrix E_u and E_v
For each entry in the two matrices, apply a function f. Return two matrix M_u, M_v
Find the eigen values and eigen vectors of M_u. Return e_i, ev_i for i = 0,...,n-1
Compute the outer product for each eigen vector. Return a matrix O_i = e_i*transpose(e_i), i = 0,...,n-1
Adjust each eigen value with e_i = e_i + delta_i, where delta_i = sum all elements(elementwise product of O_i and M_v)/2*mu, where mu is a parameter
Final return a matrix A = elementwise sum (e_i * O_i) over i = 0,...,n-1
The issue I'm facing is mainly the memory when n is large (15000 or more), since all matrices here are dense matrices. My current way to implement this may not be the best, and partially worked.
I used a RowMatrix for M_u and get eigen decomposition using SVD.
The resulting U factor of SVD is a row matrix whose columns are ev_i's, so I have to manually transpose it so that its rows become ev_i. The resulting e vector is the eigen values e_i.
Since a previous attempt of directly mapping each row ev_i to O_i failed due to out of memory, I'm currently doing
R = U.map{
case(i,ev_i) => {
(i, ev_i.toArray.zipWithIndex)
}
}//add index for each element in a vector
.flatMapValues(x=>x)}
.join(U)//eigen vectors column is appended
.map{case(eigenVecId, ((vecElement,elementId), eigenVec))=>(elementId, (eigenVecId, vecElement*eigenVec))}
To compute adjusted e_i's in step 5 above, M_v is stored as rdd of tuples (i, denseVector). Then
deltaRdd = R.join(M_v)
.map{
case(j,((i,row_j_of_O_i),row_j_of_M_v))=>
(i,row_j_of_O_i.t*DenseVector(row_j_of_M_v.toArray)/(2*mu))
}.reduceByKey(_+_)
Finally, to compute A, again due to memory issue, I have to first joining rows from different rdds and then reducing by key. Specifically,
R_rearranged = R.map{case(j, (i, row_j_of_O_i))=>(i,(j,row_j_of_O_i))}
termsForA = R_rearranged.join(deltaRdd)
A = termsForA.map{
case(i,(j,row_j_of_O_i), delta_i)) => (j, (delta_i + e(i))*row_j_of_O_i)
}
.reduceByKey(_+_)
The above implementation worked to the step of termsForA, which means if I execute an action on termsForA like termsForA.take(1).foreach(println), it succeeded. But if I execute an action on A, like A.count(), an OOM error occured on driver.
I tried to tune sparks configuration to increase driver memory as well as parallelism level, but all failed.
Use IndexedRowMatrix instead of RowMatrix, it will help in conversions and transpose.
Suppose your IndexedRowMatrix is Irm
svd = Irm.computeSVD(k, True)
U = svd.U
U = U.toCoordinateMatrix().transpose().toIndexedRowMatrix()
You can convert Irm to BlockMatrix for multiplication with another distributed BlockMatrix.
I guess at some point Spark decided there's no need to carry out operations on executors, and do all the work on driver. Actually, termsForA would fail as well in action like count. Somehow I made it work by broadcasting deltaRdd and e.

What's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?

I recently came across tf.nn.sparse_softmax_cross_entropy_with_logits and I can not figure out what the difference is compared to tf.nn.softmax_cross_entropy_with_logits.
Is the only difference that training vectors y have to be one-hot encoded when using sparse_softmax_cross_entropy_with_logits?
Reading the API, I was unable to find any other difference compared to softmax_cross_entropy_with_logits. But why do we need the extra function then?
Shouldn't softmax_cross_entropy_with_logits produce the same results as sparse_softmax_cross_entropy_with_logits, if it is supplied with one-hot encoded training data/vectors?
Having two different functions is a convenience, as they produce the same result.
The difference is simple:
For sparse_softmax_cross_entropy_with_logits, labels must have the shape [batch_size] and the dtype int32 or int64. Each label is an int in range [0, num_classes-1].
For softmax_cross_entropy_with_logits, labels must have the shape [batch_size, num_classes] and dtype float32 or float64.
Labels used in softmax_cross_entropy_with_logits are the one hot version of labels used in sparse_softmax_cross_entropy_with_logits.
Another tiny difference is that with sparse_softmax_cross_entropy_with_logits, you can give -1 as a label to have loss 0 on this label.
I would just like to add 2 things to accepted answer that you can also find in TF documentation.
First:
tf.nn.softmax_cross_entropy_with_logits
NOTE: While the classes are mutually exclusive, their probabilities
need not be. All that is required is that each row of labels is a
valid probability distribution. If they are not, the computation of
the gradient will be incorrect.
Second:
tf.nn.sparse_softmax_cross_entropy_with_logits
NOTE: For this operation, the probability of a given label is
considered exclusive. That is, soft classes are not allowed, and the
labels vector must provide a single specific index for the true class
for each row of logits (each minibatch entry).
Both functions computes the same results and sparse_softmax_cross_entropy_with_logits computes the cross entropy directly on the sparse labels instead of converting them with one-hot encoding.
You can verify this by running the following program:
import tensorflow as tf
from random import randint
dims = 8
pos = randint(0, dims - 1)
logits = tf.random_uniform([dims], maxval=3, dtype=tf.float32)
labels = tf.one_hot(pos, dims)
res1 = tf.nn.softmax_cross_entropy_with_logits( logits=logits, labels=labels)
res2 = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=tf.constant(pos))
with tf.Session() as sess:
a, b = sess.run([res1, res2])
print a, b
print a == b
Here I create a random logits vector of length dims and generate one-hot encoded labels (where element in pos is 1 and others are 0).
After that I calculate softmax and sparse softmax and compare their output. Try rerunning it a few times to make sure that it always produce the same output

How to build multiple regression for structures within array of structures in Matlab

I am trying to use the regress function:
b = regress(y,X);
However, I am having trouble getting it to work with structures. I think I need to fit two structures (independent variables) into X for it to work. Is there a way to do it? Perhaps I'm on the wrong track?
Here is what my structs look like:
s(1).s1 = -0.169
s(2).s1 = 0.125
s(3).s1 = -0.188
s(4).s1 = 0.188
s(5).s1 = 0.012
s(1).s2 = 0.572
s(2).s2 = 0.300
s(3).s2 = 0.018
s(4).s2 = 0.147
s(5).s2 = 1.080
s(1).s3 = 0.076
s(2).s3 = -0.490
s(3).s3 = -0.144
s(4).s3 = -0.134
s(5).s3 = -0.183
s1 and s2 are my independent variables and s3 is the dependent variable.
The reason why you have your values as fields in a structure array is beyond my understanding.... but working with this, extract out the fields and place them into a matrix (for the independent variables) and a vector (for the dependent variable).
Extract out each field for each structure into a comma-separated list, then use regress:
X = [[s.s1].' [s.s2].'];
y = [s.s3].';
b = regress(y, X);
This is assuming that the first column consists of s1 and the second column consists of s2 for the "independent" matrix. Also, s3 is the dependent variable. Simply put, the X matrix will consist of two columns. The first column is all of the s1 values extracted from the array of structures and the second column is all of the s2 values extracted. The dependent vector is made up of all of the s3 values. This syntax [s.s1] (or [s.s2] and [s.s3]) may seem a bit peculiar but it is common-place in MATLAB. Doing s.s1 for example produces a comma-separated list which takes each field from the array of structures and represents them like so:
s(1).s1, s(2).s1, s(3).s1, s(4).s1, s(5).s1
Wrapping this with [] essentially creates an array, but this creates a row vector. We need to make this a column vector, which is why the transpose (.') operator is required. For regress each column is a variable while each row is a sample for the X matrix. We repeat this for the s2 field, and the dependent vector for s3.
After running this code, I get:
>> format long g;
>> b
b =
-0.687194475280996
-0.21086419010155
format long g; is used to show more digits of precision for the answer.

Converting a scipy.sparse.csr.csr_matrix into a form readable by MATLAB

I have a scipy.sparse.csr.csr_matrix which is the output from TfidfVectorizer() class. I know I can access the individual components of this matrix in this manner:
So if I have this matrix here:
tf_idf_matrix = vectorizer.fit_transform(lines)
I can access the individual components here:
tf_idf_matrix.data
tf_idf_matrix.indices
tf_idf_matrix.indptr
How do I save this from Python- such that I can load it into a MATLAB sparse matrix? OR how do I change it into a dense array, and save it as one numpy.ndarray text file - such that I can just simply load it into MATLAB as a matrix. The size of this matrix is not VERY large - its (5000, 68k)
Please help. Thanks
The MATLAB sparse constructor:
S = sparse(i,j,s,m,n,nzmax) uses vectors i, j, and s to generate an m-by-n sparse matrix such that S(i(k),j(k)) = s(k), with space allocated for nzmax nonzeros
is the same as the scipy sparse (including the step of adding values with ij are same).
csr_matrix((data, ij), [shape=(M, N)])
where data and ij satisfy the relationship a[ij[0, k], ij[1, k]] = data[k]
data and ij the attributes of the coo_matrix format. So for a start I'd suggest converting tocoo and writing the three arrays to a .mat file (scipy.io).
Assuming you have those components in matlab
then
x = accumarray(indptr+1, ones(size(indptr)),[1,N]);
% N being the number of rows >= max indptr+1
colind = cumsum(x);
res = sparse(colind,indices,data);
should do it.
The first part just converts the indptr vector into a vector to match each index with the right column number.
(note that indptr may have repititions and this is why the accumarray is necessary )