keving murphy's hmm matlab toolbox assertion error - matlab

I am working on a project that needs to use hidden markov models. I downloaded Kevin Murphy's toolbox. I have some problems about the usage. In the toolbox webpage, he says that first input of dhmm_em and dhmm_logprob are symbol sequence data. On their examples, they give row vectors as data. So, when I give my symbol sequence as row vector, I get error;
??? Error using ==> assert at 9
assertion violated:
Error in ==> fwdback at 105
assert(approxeq(sum(alpha(:,t)),1))
Error in ==> dhmm_logprob at 17
[alpha, beta, gamma, ll] = fwdback(prior,
transmat, obslik, 'fwd_only', 1);
Error in ==> mainCourseProject at 110
loglik(train_act) =
dhmm_logprob(orderedSymbols,
hmm{train_act}.prior,
hmm{train_act}.trans,
hmm{act}.emiss);
However, before giving this error, code works for some symbol vectors. When I give my data as column vector, functions work fine, no errors. So why exactly am I getting this error?
You might say that I should be giving not single vectors, but vector sets, I also tried to collect my feature vectors in a struct and give row vectors as such, but nothing changed, I still get assertion error.
By the way, my symbol sequence does not have any zeros, I am doing everything almost the same as they showed in their examples, so I would be greatful if anyone could help me please.

Im not sure, but from the function call stack shown above, shouldn't the last line be hmm{train_act}.emiss instead of hmm{act}.emiss.
In other words when you computing the log-probability of a sequence, you should pass components that belong to the same HMM model (transition matrix, emission matrix, and prior probabilities).
By the way, the ASSERT in the code is a sanity check that a vector of probabilities should sum to 1. Oftentimes, when working with very small values (log-probabilities), numerical stability issues can creep in... You could edit the APPROXEQ function to relax the comparison a bit, by giving it a bigger margin of error

This error message and the code it refers to are human-readable. An assertion is a guard put in by the programmer, to ensure that certain conditions are met. In this case, what is the condition? approxeq(sum(alpha(:,t)),1) I'd venture to say that approxeq wants the values to be approximately equal, so this boils down to: sum(alpha(:,t)) ~= 1
Without knowing anything about the code, I'd also guess that these refer to probabilities. The probabilities of a node's edges must sum to one. Hopefully this starts you down a productive debugging path. If you can't figure out what's wrong with your input that produces this condition, start wading into the code a bit to see where this alpha vector comes from, and how it ended up invalid.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

sparse matrix causes Segmentation fault exit code 139

When working with a sparse matrix, it abruptly kills the kernel and exit code 139.
This happened when working with Gensim, which uses the sparse matrix format.
The failure happens when multiplying the matrix with another matrix, or even when using matrix.sum().
the matrix was created using scipy:
matrix = scipy.sparse.csc_matrix((data, indices, indptr), shape=(num_terms, num_docs), dtype=dtype)
Turns out the shape of the matrix (num_terms) didn't match the max(indices) which causes numpy to make erroneous assumptions about memory addresses.
This can easily be avoided if after creating the matrix we call:
matrix.check_format()
which makes some sanity checks on the matrix.
if using gensim, just use a high num_features. It doesn't have to be your actual number of features, as long as it's not lower than the actual number.
edit for more details:
with gensim you might be working on document similarity using:
sim_method = gensim.similarities.SparseMatrixSimilarity(documents, num_features=max_index)
if "documents" contain a higher id than max_index, it will cause a bug.
gensim simply wraps the scipy sparse matrix object. to call the check_format on it use:
sim_method.index.check_format()
Though a more likely bug can occur when you're trying to use this similarity method on another corpus of documents to get their similarity scores.
sim_method[query_documents]
again if query_documents contain an id higher than the max_index given at the time of the creation of the sim method - it will cause the bug.
Here gensim hides the scipy matrix completely so you can't call check_format directly, you'll just have to check your own input and make sure there's no bug there.

Simulink with if/else and different signal dimension

I need to implement a if/else in simulink to find out if a input is a scalar value or a matrix. Please see, the diagram below :
Given:
Block(1) - is a input that can be a scalar "1" or a matrix "[[0 15];[5 10]]"
Block(2) - must return the signal dimension of the input. Ex: 1 for scalar and >1 for a matrix
The requirements are:
Everything must work interpreted or compiled (Simulink coder)
The final output of blocks (4) and (5) are scalars
I have average understanding of CMexSFunctions. So if I need to implement one to solve the problem it is ok
So far, I have had the following problems:
I don't at all if what I am planning to do is feasible
I don't know how to implement Block(2) to work on compiled mode
Even though there is a if/else, simulink performs a pre-check before running to verify if all signal dimensions are ok. During this check, it gives a error saying ex: that Block(5) has a input of matrix
Any Clues?
Block(2) is the easiest part which can be implemented using the "Probe" block in Simulink library. Your Input at port 1 must be variable sized signal since you are expecting a scalar or matrix.
I assume you are feeding Input(1) to blocks 4 and 5. At model compile time Simulink does not know which one of these blocks are going to run based on the input size. So Simulink needs to assume both blocks may get scalar or matrix. You need to make blocks 4 and 5 not throw error for both scalar and matrix even though they will be used only for one type at run-time.
If you are not able to do this, for the scalar case a simple work around is to place a Selector before block 5 that selects the first sample always. This will let Simulink know that the input to block 5 is always a scalar.

Divide-by-zero encountered: rhok assumed large error using scipy.optimizor

I used scipy.optimize.fmin_bfgs to minimize the hinge loss (SVM). However, there are errors :
Divide-by-zero encountered: rhok assumed large.
Somebody said that “It had to do with the training data set”, anybody knows how to deal with the problem?
From the source code of scipy, rhok is,
rhok = 1.0 / (numpy.dot(yk, sk))
where both yk and sk depend on intput array x0.
A possible causes of this error may be a bad choice of initial condition x0 which tends to singularities in your function f. I would suggest plotting your function and maybe ensuring initial conditions are always away from possible divergent values. If this is part of a larger training routine, you could possibly use try and on catching an ZeroDivisionError try shifting the initial condition shifted by some amount. You may also find a different minimisation method is more robust from scipy minimize.
If you add the full_output option to scipy.optimize.fmin_bfgs it should give you more information about you particular case.

I am not able to figure out how to carry out the Matrices multiplication in matlab

I have to carry out the following operation
R=[0,0.5,-0.25;-0.25,0,0.25;0,0,0.25];
B=[0,k21,k31;k12,0,k32;0,0,k];
G=inv(R).*B;
g=det(G);
but Matlab is showing the following error
??? Error using ==> horzcat
CAT arguments dimensions are not consistent.
Error in ==> g at 60
B=[0,k21,k31;k12,0,k32;0,0,k];
K21,K31,K12,K32 and k all have dimensions of 923334 by 1. Can anyone help me how can I carry out the following operation.
Your code works well for me. Check that the k-values (k12,k31,k32...) are scalars (or 1x1 dimension)
EDIT :
For the case you mention, k's are nx1, one simple way is to perform a loop:
R=[0,0.5,-0.25;-0.25,0,0.25;0,0,0.25];
for ii=1:length(k)
B=[0,k21(ii),k31(ii);k12(ii),0,k32(ii);0,0,k(ii)];
G=inv(R).*B;
g(ii)=det(G);
end
There is also a "vectorized" way to do that, but it seems to be good enough...