Training HMM - The amount of data required - matlab

I'm using HMM for classifications. I came cross an example in Wikipedia Baum–Welch algorithm Example. Hope someone can help me.
The example as follow: "Suppose we have a chicken from which we collect eggs at noon everyday. Now whether or not the chicken has laid eggs for collection depends on some unknown factors that are hidden. We can however (for simplicity) assume that there are only two states that determine whether the chicken lays eggs."
Note that we have 2 different observations (N and E) and 2 states (S1 and S2) in this example.
My question here is:
How many observations/Observed sequences (or training data) do we need to best train the model. Is there any way to estimate or to test the amount of training data required.

For each variable in your HMM model, you need about 10 samples. Using this rule of thumb, you can easily calculate how many samples do you need to construct a reliable classifier.
In your example you have two states which results in a 2 in 2 transition matrix A=[a_00, a_01;a_10, a_11] where a_ij is the transition probability from state S_i to S_j.
Moreover, each of these states with probability p_S1 and p_S2 generate observations, i.e.: If we are at state S1 with probability p_S1 the chicken will lay egg and with probability 1-p_S1 it will not.
In total you have 6 variables needed to be estimate. It is more or less obvious that it is not possible to accurately estimate them from only two observations. As I mentioned before, it is conventional to assume at least 10 samples per variable are needed in order to estimate that variable accurately.

Related

how to compare two hyper parameters in a hierarchical model?

In one hierarchical model, we have two hyer parameters: dnorm(A_mu, 0.25^-2) and dnorm (B_mu, 0.25^-2). In this case, 0.25 is the sd, I use the fixed number. A_mu and B_mu represent the mean of group level. After fitting the data by rjags, we get the distributions for each parameter. So I just directly compare the highest posterior density interval (HDI) of A_mu and B_mu? Do I need to calculate something using the sd(0.25)?
In another case, if the sd of two hyper parameters is not fixed, like that: dnorm(A_mu, A_sd) and dnorm (B_mu, B_sd). How can I compare the two hyper parameters and make a decision, e.g. this group is significantly different another group?
Remember that you are getting posterior distributions for A_mu and B_mu. This makes your comparison easy as you can have a look at 95% confidence intervals (CI) for the parameters (or pick a confidence value that satisfies your needs). I believe JAGS uses Gibbs sampling and so you should be able to get the raw samples from the posteriors for A_mu and B_mu. You can then ask "what is the probability that B_mu is greater than some value?" by calculating the percentage of posterior samples that are greater than that value. Alternatively, and in a similar way to frequentist Hypothesis testing, you can ask what is the probability that the mean of B_mu is a draw from the posterior of A_mu. So the key is just to directly use the samples from your posterior. I would recommend taking a look at Andrew Gelman's BDA3 textbook (Chapter 4) for a really good reference on these concepts.
A few things to keep in mind before drawing conclusions from the data: (1) you should always check the validity of your Markov Chains by evaluating things like autocorrelation (2) try to do a posterior predictive check to make sure your model is well fit to the data. If your model is poorly fit to the data then you can get very misleading results from the procedure above.

Appropriate method for clustering ordinal variables

I was reading through all (or most) previously asked questions, but couldn't find an answer to my problem...
I have 13 variables measured on an ordinal scale (thy represent knowledge transfer channels), which I want to cluster (HCA) for a following binary logistic regression analysis (including all 13 variables is not possible due to sample size of N=208). A Factor Analysis seems inappropriate due to the scale level. I am using SPSS (but tried R as well).
Questions:
1: Am I right in using the Chi-Squared measure for count data instead of the (squared) euclidian distance?
2. How can I justify a choice of method? I tried single, complete, Ward and average, but all give different results and I can't find a source to base my decision on.
Thanks a lot in advance!
Answer 1: Since the variables are on ordinal scale, the chi-square test is an appropriate measurement test. Because, "A Chi-square test is designed to analyze categorical data. That means that the data has been counted and divided into categories. It will not work with parametric or continuous data (such as height in inches)." Reference.
Again, ordinal scaled data is essentially count or frequency data you can use regular parametric statistics: mean, standard deviation, etc Or non-parametric tests like ANOVA or Mann-Whitney U test to compare 2 groups or Kruskal–Wallis H test to compare three or more groups.
Answer 2: In a clustering problem, the choice of distance method solely depends upon the type of variables. I recommend you to read these detailed posts 1, 2,3

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

How do Markov Chains work and what is memorylessness?

How do Markov Chains work? I have read wikipedia for Markov Chain, But the thing I don't get is memorylessness. Memorylessness states that:
The next state depends only on the current state and not on the
sequence of events that preceded it.
If Markov Chain has this kind of property, then what is the use of chain in markov model?
Explain this property.
You can visualize Markov chains like a frog hopping from lily pad to lily pad on a pond. The frog does not remember which lily pad(s) it has previously visited. It also has a given probability for leaping from lily pad Ai to lily pad Aj, for all possible combinations of i and j. The Markov chain allows you to calculate the probability of the frog being on a certain lily pad at any given moment.
If the frog was a vegetarian and nibbled on the lily pad each time it landed on it, then the probability of it landing on lily pad Ai from lily pad Aj would also depend on how many times Ai was visited previously. Then, you would not be able to use a Markov chain to model the behavior and thus predict the location of the frog.
The idea of memorylessness is fundamental to the success of Markov chains. It does not mean that we don't care about the past. On contrary, it means that we retain only the most relevant information from the past for predicting the future and use that information to define the present.
This nice article provides a good background on the subject
http://www.americanscientist.org/issues/pub/first-links-in-the-markov-chain
There is a trade-off between the accuracy of your description of the past and the size of the associated state space. Say, there are three pubs in the neighborhood and every evening you choose one. If you choose those pubs randomly, this is not a Markov chain (or a trivial, zero-order one) – the outcome is random. More precisely, it is an independent random variable (modeling dependency was fundamental to Markov ideas underlying Markov chains).
In your choice of pubs you can factor in your last choice, i.e., which pub you went to the night before. For example, you might want to avoid going to the same pub two days in a row. While in reality this implies remembering where you have been yesterday (and thus remembering the past!), at your modeling level, your unit of time is one day, so your current state is the pub you went to yesterday. This is your classical (first-order) Markov model with three states and 3 by 3 transition matrix that provides conditional probabilities for each permutation (if yesterday you went to pub I, what is the change that today you “hop” to pub J).
However, you can define a model where you “remember” the last two days. In this second-order Markov model “present” state will include the knowledge of the pub from last night and from the night before. Now you have 9 possible states describing your present state, and therefore you have a 9 by 9 transition matrix. Thankfully, this matrix is not fully populated.
To understand why, consider a slightly different setup when you are so well-organized that you make firm plans for your pub choices both for today and tomorrow based on the last two visits. Then, you can select any possible permutations of pubs visited for two days in a row. The result is a fully populated 9 by 9 matrix that maps your choices for the last two days into those for the next two days. However, in our original problem we make the decision every day, so our future state is constrained by what happened today: at the next time step (tomorrow) today becomes yesterday, but it will be still a part of your definition of "today" at that time step, and relevant to what happens the following day. The situation is analogous to moving averages, or receding horizon procedures. As a result, from a given state, you can only move to three possible states (indicating your today’s choice of pubs), which means that each row of your transition matrix will have only three non-zero entries.
Let us tally up the number of parameters that characterize each problem: the zero- order Markov model with three states has two independent parameters (probabilities of hitting the first and the second pub, as the probability of visiting the third pub is the complement of the first two). The first-order Markov model has a fully populated 3 by 3 matrix with each column summing up to one (again, indicating that one of the pubs will always be visited at any given day), so we end up with six independent parameters. The second-order Markov model has 9 by 9 matrix with each row having only 3 non-zero entries and all columns adding to one, so we have 18 independent parameters.
We can continue defining higher-order models, and our state space will grow accordingly.
Importantly, we can further refine the concept by identifying important features of the past and use only those features to define the present, i.e. compress the information about the past. This is what I referred to in the beginning. For example, instead of remembering all history we can only keep track only of some memorable events that impact our choice, and use this “sufficient statistics” to construct the model.
It all boils down to the way you define relevant variables (state space), and the Markov concepts naturally follow from the underlying fundamental mathematics concepts. First-order (linear) relationships (and associated linear algebra operations) are at the core of most current mathematical applications. You can look at a polynomial equation on n-th with a single variable, or you can define an equivalent first-order (linear) system of n equations by defining auxiliary variables. Similarly, in classical mechanics you can either use second-order Lagrange equations or choose canonical coordinates that lead to (first-order) Hamiltonian formulation http://en.wikipedia.org/wiki/Hamiltonian_mechanics
Finally, a note on the steady-state vs. transient solutions of Markov problems. An overwhelming amount of practical applications (e.g., Page rank) relies on finding steady-state solutions. Indeed, the presence of such convergence to a steady state was the original motivation for A. Markov for creating his chains in an effort to extend the application of central limit theorem to dependent variables. The transient effects (such as hitting times) of Markov processes are significantly less studied and more obscure. However, it is perfectly valid to consider Markov prediction of the outcomes at the specific point in the future (and not only the converged, “equilibrium” solution).

Rapidminer - neural net operator - output confidence

I have feed-forward neural network with six inputs, 1 hidden layer and two output nodes (1; 0). This NN is learned by 0;1 values.
When applying model, there are created variables confidence(0) and confidence(1), where sum of this two numbers for each row is 1.
My question is: what do these two numbers (confidence(0) and confidence(1)) exactly mean? Are these two numbers probabilities?
Thanks for answers
In general
The confidence values (or scores, as they are called in other programs) represent a measure how, well, confident the model is that the presented example belongs to a certain class. They are highly dependent on the general strategy and the properties of the algorithm.
Examples
The easiest example to illustrate is the majority classifier, who just assigns the same score for all observations based on the proportions in the original testset
Another is example the k-nearest-neighbor-classifier, where the score for a class i is calculated by averaging the distance to those examples which both belong to the k-nearest-neighbors and have class i. Then the score is sum-normalized across all classes.
In the specific example of NN, I do not know how they are calculated without checking the code. I guess it is just the value of output node, sum-normalized across both classes.
Do the confidences represent probabilities ?
In general no. To illustrate what probabilities in this context mean: If an example has probability 0.3 for class "1", then 30% of all examples with similar feature/variable values should belong to class "1" and 70% should not.
As far as I know, his task is called "calibration". For this purpose some general methods exist (e.g. binning the scores and mapping them to the class-fraction of the corresponding bin) and some classifier-dependent (like e.g. Platt Scaling which has been invented for SVMs). A good point to start is:
Bianca Zadrozny, Charles Elkan: Transforming Classifier Scores into Accurate Multiclass Probability Estimates
The confidence measures correspond to the proportion of outputs 0 and 1 that are activated in the initial training dataset.
E.g. if 30% of your training set has outputs (1;0) and the remaining 70% has outputs (0; 1), then confidence(0) = 30% and confidence(1) = 70%