sparse matrix causes Segmentation fault exit code 139 - scipy

When working with a sparse matrix, it abruptly kills the kernel and exit code 139.
This happened when working with Gensim, which uses the sparse matrix format.
The failure happens when multiplying the matrix with another matrix, or even when using matrix.sum().
the matrix was created using scipy:
matrix = scipy.sparse.csc_matrix((data, indices, indptr), shape=(num_terms, num_docs), dtype=dtype)

Turns out the shape of the matrix (num_terms) didn't match the max(indices) which causes numpy to make erroneous assumptions about memory addresses.
This can easily be avoided if after creating the matrix we call:
matrix.check_format()
which makes some sanity checks on the matrix.
if using gensim, just use a high num_features. It doesn't have to be your actual number of features, as long as it's not lower than the actual number.
edit for more details:
with gensim you might be working on document similarity using:
sim_method = gensim.similarities.SparseMatrixSimilarity(documents, num_features=max_index)
if "documents" contain a higher id than max_index, it will cause a bug.
gensim simply wraps the scipy sparse matrix object. to call the check_format on it use:
sim_method.index.check_format()
Though a more likely bug can occur when you're trying to use this similarity method on another corpus of documents to get their similarity scores.
sim_method[query_documents]
again if query_documents contain an id higher than the max_index given at the time of the creation of the sim method - it will cause the bug.
Here gensim hides the scipy matrix completely so you can't call check_format directly, you'll just have to check your own input and make sure there's no bug there.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

MATLAB: How to apply a vectorized function using sparsity structure?

I need to (repeatedly) build a vector of length 200 from a vector of length 2500. I can describe this operation using multiplication by a matrix which is extremely sparse: it is 200x2500 and has only one entry in each row. But I have very little control over where this entry is. My actual problem is that I need to apply this matrix not to the vector that I currently have, but rather to some componentwise function of this vector. Since I have all this sparsity, it is wasteful to apply this componentwise function to all 2500 components of my vector. Instead I would rather apply it only to the 200 components that actually contribute.
A program (with randomly chosen numbers replacing of my actual numbers) which would have a similar problem would be something like this:
ind=randi(2500,200,1);
coefficients=randn(200,1);
A=sparse(1:200,ind,coefficients,200,2500);
x=randn(2500,1);
y=A*subplus(x);
What I don't like here is applying subplus to all of x; I would rather only have to apply it to x(ind), since only that contributes to the matrix product.
Right now the only way I can see to work around this is to replace my sparse matrix with a 200-component vector of coefficients and a 200-component vector of indices. Working this way, the code above would become:
ind=randi(2500,200,1);
coefficients=randn(200,1);
x=randn(2500,1);
y=coefficients.*subplus(x(ind))
Is there a better way to do this, preferably one that would work when A contains a few elements per row instead of just one?
The code in your question throws an exception, I think it should be:
n=2500;
m=200;
ind=randi(n,m,1);
coefficients=randn(m,1);
A=sparse(1:m,ind,coefficients,m,n);
x=randn(n,1);
Your idea using x(ind) was basically right, but ind would reorder x which is not intended. Instead you could use sort(unique(ind)). I opted to use the sparse logical index any(A~=0) because I expect it to be faster, but you could compare both versions.
%original code
y=A*subplus(x);
.
%multiplication using sparse logical indexing:
relevant=any(A~=0);
y=A(:,relevant)*subplus(x(relevant));
.
%fixed version of your code
relevant=sort(unique(ind));
y=A(:,relevant)*subplus(x(relevant));

How to get level of fitness of data to a distribution by using probplot() in Matlab?

I have 2 sets of data of float numbers, set A and set B. Both of them are matrices of size 40*40. I would like to find out which set is closer to the normal distribution. I know how to use probplot() in matlab to plot the probability of one set. However, I do not know how to find out the level of the fitness of the distribution is.
In python, when people use problot, a parameter ,R^2, shows how good the distribution of the data is against to the normal distribution. The closer the R^2 value to value 1, the better the fitness is. Thus, I can simply use the function to compare two set of data by their R^2 value. However, because of some machine problem, I can not use the python in my current machine. Is there such parameter or function similar to the R^2 value in matlab ?
Thank you very much,
Fitting a curve or surface to data and obtaining the goodness of fit, i.e., sse, rsquare, dfe, adjrsquare, rmse, can be done using the function fit. More info here...
The approach of #nate (+1) is definitely one possible way of going about this problem. However, the statistician in me is compelled to suggest the following alternative (that does, alas, require the statistics toolbox - but you have this if you have the student version):
Given that your data is Normal (not Multivariate normal), consider using the Jarque-Bera test.
Jarque-Bera tests the null hypothesis that a given dataset is generated by a Normal distribution, versus the alternative that it is generated by some other distribution. If the Jarque-Bera test statistic is less than some critical value, then we fail to reject the null hypothesis.
So how does this help with the goodness-of-fit problem? Well, the larger the test statistic, the more "non-Normal" the data is. The smaller the test statistic, the more "Normal" the data is.
So, assuming you have converted your matrices into two vectors, A and B (each should be 1600 by 1 based on the dimensions you provide in the question), you could do the following:
%# Build sample data
A = randn(1600, 1);
B = rand(1600, 1);
%# Perform JB test
[ANormal, ~, AStat] = jbtest(A);
[BNormal, ~, BStat] = jbtest(B);
%# Display result
if AStat < BStat
disp('A is closer to normal');
else
disp('B is closer to normal');
end
As a little bonus of doing things this way, ANormal and BNormal tell you whether you can reject or fail to reject the null hypothesis that the sample in A or B comes from a normal distribution! Specifically, if ANormal is 1, then you fail to reject the null (ie the test statistic indicates that A is probably drawn from a Normal). If ANormal is 0, then the data in A is probably not generated from a Normal distribution.
CAUTION: The approach I've advocated here is only valid if A and B are the same size, but you've indicated in the question that they are :-)

keving murphy's hmm matlab toolbox assertion error

I am working on a project that needs to use hidden markov models. I downloaded Kevin Murphy's toolbox. I have some problems about the usage. In the toolbox webpage, he says that first input of dhmm_em and dhmm_logprob are symbol sequence data. On their examples, they give row vectors as data. So, when I give my symbol sequence as row vector, I get error;
??? Error using ==> assert at 9
assertion violated:
Error in ==> fwdback at 105
assert(approxeq(sum(alpha(:,t)),1))
Error in ==> dhmm_logprob at 17
[alpha, beta, gamma, ll] = fwdback(prior,
transmat, obslik, 'fwd_only', 1);
Error in ==> mainCourseProject at 110
loglik(train_act) =
dhmm_logprob(orderedSymbols,
hmm{train_act}.prior,
hmm{train_act}.trans,
hmm{act}.emiss);
However, before giving this error, code works for some symbol vectors. When I give my data as column vector, functions work fine, no errors. So why exactly am I getting this error?
You might say that I should be giving not single vectors, but vector sets, I also tried to collect my feature vectors in a struct and give row vectors as such, but nothing changed, I still get assertion error.
By the way, my symbol sequence does not have any zeros, I am doing everything almost the same as they showed in their examples, so I would be greatful if anyone could help me please.
Im not sure, but from the function call stack shown above, shouldn't the last line be hmm{train_act}.emiss instead of hmm{act}.emiss.
In other words when you computing the log-probability of a sequence, you should pass components that belong to the same HMM model (transition matrix, emission matrix, and prior probabilities).
By the way, the ASSERT in the code is a sanity check that a vector of probabilities should sum to 1. Oftentimes, when working with very small values (log-probabilities), numerical stability issues can creep in... You could edit the APPROXEQ function to relax the comparison a bit, by giving it a bigger margin of error
This error message and the code it refers to are human-readable. An assertion is a guard put in by the programmer, to ensure that certain conditions are met. In this case, what is the condition? approxeq(sum(alpha(:,t)),1) I'd venture to say that approxeq wants the values to be approximately equal, so this boils down to: sum(alpha(:,t)) ~= 1
Without knowing anything about the code, I'd also guess that these refer to probabilities. The probabilities of a node's edges must sum to one. Hopefully this starts you down a productive debugging path. If you can't figure out what's wrong with your input that produces this condition, start wading into the code a bit to see where this alpha vector comes from, and how it ended up invalid.

Eigenvalue decomposition using MATLAB

I'm conducting dimensional reduction of a square matrix A. My issue now is that I have problem computing eigvalue decomposition of a 13000 x 13000 matrix A, i.e. [v d]=eigs(A). Because it's a sparse matrix, I get 'out of memory error' using a 4GB RAM. I'm convinced it's not my PC's problem, since the memory is not used up when eigs command is run. The help I saw online had to do with ARPACK. I checked the recommended site, but there were a lot of files there, don't know which to download. Also, I did not understand how to use it with MATLAB. Another help says use numerical methods, but I dont know which specific one to use. Please any solution is welcome.
Error in ==> eigs>ishermitian at 1535
tf = isequal(A,A');
Error in ==> eigs>checkInputs at 479
issymA = ishermitian(A);
Error in ==> eigs at 96
[A,Amatrix,isrealprob,issymA,n,B,classAB,k,eigs_sigma,whch, ...
Error in ==> labcomp at 20
[vector lambda] = eigs(A)
Please can I get translation of these errors and how to correct it?
The reason you don't see the memory used up, is that it isn't used up - Matlab fails to allocate the needed amount of memory.
Although an array of 13000 x 13000 doubles (the default data type in Matlab) is about 1.25 GB, it doesn't mean a 4Gb of ram is enough - Matlab need 1.25Gb of contiguous memory, otherwise it won't succeed in allocating your matrix. You can read more on memory problems in Matlab here: http://www.mathworks.com/support/tech-notes/1100/1106.html
You can as a first step try using single precision:
[v d]=eigs(single(A));
You say
another help says use numerical methods
If you are doing it on the computer, it's numerical by definition.
If you dont' want (or can't due to memory constraints) to do it in Matlab, you can look for a linear algebra library (ARPACK is just one of them) and have the calculation done outside of Matlab.
First if A is sparse, single(A) won't work. Single sparse matrices are not implemented in MATLAB, see comments:
how to create a single float sparse matrix in mex files
The call to ishermitian may fail because you can't store two copies of your matrix (A and A'). Bypass this problem by commenting the line out and setting issymA to true or false, depending on whether your matrix is Hermitian.
If you find further problems with memory inside eigs, try to reduce its memory footage by asking less solutions, eigs(A,1), or reducing the maximum size of the basis (option p), which by default is twice the number of asked solutions:
opts.p = 3
[x,d] = eigs(A,2,'LM',opts)