Degree of Freedom of Markov Chains - markov-chains

I have a set of 5000 strings of length 4, where each character in the string can be either A, B, C, or D.
0-order Markov Chain (no dependency), makes a 4*1 array of columns A, B, C, D.
1-order Markov Chain (pos j depends on previous pos i), makes a 4*4 matrix of rows Ai, Bi, Ci, Di; and columns of Aj, Bj, Cj, Dj.
2-order Markov Chain (pos k depends on pos j and pos i), makes a 4*4*4 matrix of dimensions Ai, Bi, Ci, Di; Aj, Bj, Cj, Dj; and Ak, Bk, Ck, Dk [or this makes a 16*4 matrix of dimensions Aij, Bij, Cij, Dij; Ak, Bk, Ck, Dk].
3-order Markov Chain (pos l depends on pos k, pos j, and pos i), makes a 4*4*4*4 matrix of dimensions Ai, Bi, Ci, Di; Aj, Bj, Cj, Dj; Ak, Bk, Ck, Dk; Al, Bl, Cl, Dl [or this makes a 64*4 matrix of dimensions Aijk, Bijk, Cijk, Dijk; Al, Bl, Cl, Dl].
What are the number of parameters for the 4 orders? I have a few ideas, but want to see what others think. Thank you for any advice!!

As was pointed out in the comments, the answer is almost contained in the question. A general formula for the number of independent parameters that fully specify Markov model of k-th order with n possible states is n^k*(n-1) for n>1.
The derivation of this general formula is the same as detailed
How does Morkov Chain works and what is memorylessness?
for n=3 and k=2.
Specifically, if we take into account k previous steps (including the current one) to predict the next step, the transition matrix should allow for all possible permutations, therefore its dimensions are n^k by n^k. However, since for each state only n outcomes are possible, each row of this matrix only has n non-zero entries. Thus we have n*n^k non-zero entries of this transition matrix, and each column should sum up to 1. So, to obtain the answer for the number of independent parameters we need to subtract n^k from the number of non-zero entries.
This answer does not cover initial conditions, which you would not need if you are looking for a steady-state solution. If you are interested in the transient solution, you need to specify additional (n-1)*k parameters.

Related

Fit f(x,y,z)=0 to a set of 3D points using MATLAB

The statement is the following:
Given a 3D set of points (x,y,z), fit a surface defined by a certain number of parameters IMPLICITLY
I'm definitely no expert in programming, but I need to get this done in one way or another. I've considered other programs, such as OriginPro, which can solve this problem pretty easily, but I want to have it done in MATLAB.
The surface is defined by:
A*x^2+B*y^2+C*z^2+D*x+E*y+F*z+G*xy+H*xz+I*yz+J=0
Considering that the Curve Fitting Toolbox can only fit explicit functions, what would you guys suggest?
IMPORTANT REMARK: I'm not asking for a solution, just advice on how to proceed
This can be chalked up to solving a linear system of equations where each point forms a constraint or equation in your system. You would thus find the right set of coefficients in your surface equation that satisfies all points. Using the equation in your question as is, one would find the null space of the linear system that satisfies the surface equation. What you'll have to do is given a set of m points that contain x, y and z coordinates, we can reformulate the above equation as a matrix-vector multiplication with the first argument technically being a matrix of one row and the vector being the coefficients that fit your plane. This is important before we proceed to the null space part of this problem.
In particular, you can agree with me that we can represent the above in the following matrix-vector multiplication:
[x^2 y^2 z^2 x y z xy xz yz 1][A] = [0]
[B]
[C]
[D]
[E]
[F]
[G]
[H]
[I]
[J]
Our objective is to find the coefficients A, B, ..., J that would satisfy the constraint above. Now moving onto the more general case, since you have m points, we can build our linear system and thus a matrix of coefficients on the left side of this expression:
[x_1^2 y_1^2 z_1^2 x_1 y_1 z_1 x_1*y_1 x_1*z_1 y_1*z_1 1][A] = [0]
[x_2^2 y_2^2 z_2^2 x_2 y_2 z_2 x_2*y_2 x_2*z_2 y_2*z_2 1][B] = [0]
[x_3^2 y_3^2 z_3^2 x_3 y_3 z_3 x_3*y_3 x_3*z_3 y_3*z_3 1][C] = [0]
... [D] = [0]
... [E] = [0]
... [F] = [0]
... [G] = [0]
... [H] = [0]
... [I] = [0]
[x_m^2 y_m^2 z_m^2 x_m y_m z_m x_m*y_m x_m*z_m y_m*z_m 1][J] = [0]
We now build this linear system, and solve to find our coefficients. The trick is to build the matrix that you see on the left hand side of this linear system, which I will call M. Each row is such that you create [x_i^2 y_i^2 z_i^2 x_i y_i z_i x_i*y_i x_i*z_i y_i*z_i 1] with x_i, y_i and z_i being the ith (x,y,z) coordinate in your dataset.
Once you build this, you would thus find the null space of this system. There are many methods to do this in MATLAB. One way is to simply use the null function on the matrix you build above and it will return to you a matrix where each column is a potential solution to the surface you are fitting above. That is, each column directly corresponds to the coefficients A, B, ..., J that would fit your data to the surface. You can also try using the singular value decomposition or QR decomposition if you like, but the null function is a good place to start as it uses singular value decomposition already.
I would like to point out that the above will only work if you provide a matrix that is not full rank. To simplify things, this can happen if the number of points you have is less than the number of parameters you have. Therefore, this method would only work if you had up to 9 points in your data. If you have exactly 9, then this method will work very well. If you have less than 9, then there will be more potential solutions as the number of degrees of freedom increases. Specifically, you will have 10 - m possible solutions and any of those solutions is valid. If you have more than 10 points and they are all unique, this would be considered a full rank matrix and thus the only solution to the null space is the trivial one with the coefficients being all set to 0.
In order to escape the possibility of the null space being all 0, or the possibility of the null space providing more than one solution, you probably just want one solution, and you most likely have 10 or more possible points that you want to fit your data with. An alternative method that I can provide is simply an extension of the above but we don't need to find the null space. Specifically, you relax one of the coefficients, say J and you can set that to any value you wish. For example, set it to J = 1. Therefore, the system of equations now changes where J disappears from the mix and it now appears on the right side of the system:
[x_1^2 y_1^2 z_1^2 x_1 y_1 z_1 x_1*y_1 x_1*z_1 y_1*z_1][A] = [-1]
[x_2^2 y_2^2 z_2^2 x_2 y_2 z_2 x_2*y_2 x_2*z_2 y_2*z_2][B] = [-1]
[x_3^2 y_3^2 z_3^2 x_3 y_3 z_3 x_3*y_3 x_3*z_3 y_3*z_3][C] = [-1]
... [D] = [-1]
... [E] = [-1]
... [F] = [-1]
... [G] = [-1]
... [H] = [-1]
[x_m^2 y_m^2 z_m^2 x_m y_m z_m x_m*y_m x_m*z_m y_m*z_m][I] = [-1]
You can thus find the parameters A, B, ..., I using linear least squares where the solution can be solved using the pseudoinverse. The benefit with this approach is that because the matrix is full rank, there is one and only one solution, thus being unique. Additionally, this formulation is nice because if there is an exact solution to the linear system, solving with the pseudoinverse will provide the exact solution. If there is no exact solution to the system, meaning that not all constraints are satisfied, the solution provided is one that minimizes the least squared error between the data and the parameters that were fit with that data.
MATLAB already has an awesome utility to solve a system through linear least squares - in fact, the core functionality of MATLAB is to solve linear algebra problems (if you didn't know that already). You can use matrix left division to solve the problem. Simply put, given that you built your matrix of coefficients above after you introduce the relaxation of J also being called M, the solution to the problem is simply coeff = M\(-ones(m,1)); with m being the number of points and coeff being the coefficients for the surface equation that fit your points. The ones statement in the code creates a column vector of ones that are negative that has m elements.
Using the least squares approach has a more stable and unique solution as you are specifically constraining one of the coefficients, J, to be 1. Using the null space approach only works if you have less points than you do parameters and will possibly give you more than one solution so long as the coefficients span the null space. Specifically, you will get 10 - m solutions and they are all equally good at fitting your data.
I hope this is enough to get you started and good luck!

PCA (Principle Component Analysis) on multiple datasets

I have a set of climate data (temperature, pressure and moisture for example), X, Y, Z which are matricies with dimensions (n x p) where n is the number of observations and p is the number of spatial points.
Previously, to investigate modes of variability in dataset X, I simply performed a empirical orthogonal function (EOF) analysis OR Principle component Analysis (PCA) on X. This involved decomposing (via SVD), the matrix X.
To investigate the coupling of the modes of variability of X and Y, I used maximum covariance analysis (MCA) which involved decomposing a covariance matrix proportional to XY^{T}. (T is transpose)
However, if I wish to looked at all three datasets, how do I go about doing this? One idea I had was to form a fourth matrix, L, which will be the 'feature' concatenation of the three datasets:
L = [X, Y, Z]
so that my matrix L will have dimensions (n x 3p).
I would then use standard PCA/EOF analysis and use SVD to decompose this matrix L and then I would obtain modes of variabiilty with size (3p x 1) and thus subsequently the mode associated with X is the first p values, the mode associated with Y is the second set of p values and the mode associated with Z is the last p values.
Is this correct? Or can anyone suggest a better way of looking at the coupling of all three (or more) datasets?
Thank you so much!
I'd recommend to treat spatial points as extra dimension, i.e. f x n x p, where 'f' is your number of features. At this point you should use multilinear extension of PCA that can work on tensor data.

Computing the SVD of a rectangular matrix

I have a matrix like M = K x N ,where k is 49152 and is the dimension of the problem and N is 52 and is the number of observations.
I have tried to use [U,S,V]=SVD(M) but doing this I get less memory space.
I found another code which uses [U,S,V]=SVD(COV(M)) and it works well. My questions are what is the meaning of using the COV(M) command inside the SVD and what is the meaning of the resultant [U,S,V]?
Finding the SVD of the covariance matrix is a method to perform Principal Components Analysis or PCA for short. I won't get into the mathematical details here, but PCA performs what is known as dimensionality reduction. If you like a more formal treatise on the subject, you can read up on my post about it here: What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?. However, simply put dimensionality reduction projects your data stored in the matrix M onto a lower dimensional surface with the least amount of projection error. In this matrix, we are assuming that each column is a feature or a dimension and each row is a data point. I suspect the reason why you are getting more memory occupied by applying the SVD on the actual data matrix M itself rather than the covariance matrix is because you have a significant amount of data points with a small amount of features. The covariance matrix finds the covariance between pairs of features. If M is a m x n matrix where m is the total number of data points and n is the total number of features, doing cov(M) would actually give you a n x n matrix, so you are applying SVD on a small amount of memory in comparison to M.
As for the meaning of U, S and V, for dimensionality reduction specifically, the columns of V are what are known as the principal components. The ordering of V is in such a way where the first column is the first axis of your data that describes the greatest amount of variability possible. As you start going to the second columns up to the nth column, you start to introduce more axes in your data and the variability starts to decrease. Eventually when you hit the nth column, you are essentially describing your data in its entirety without reducing any dimensions. The diagonal values of S denote what is called the variance explained which respect the same ordering as V. As you progress through the singular values, they tell you how much of the variability in your data is described by each corresponding principal component.
To perform the dimensionality reduction, you can either take U and multiply by S or take your data that is mean subtracted and multiply by V. In other words, supposing X is the matrix M where each column has its mean computed and the is subtracted from each column of M, the following relationship holds:
US = XV
To actually perform the final dimensionality reduction, you take either US or XV and retain the first k columns where k is the total amount of dimensions you want to retain. The value of k depends on your application, but many people choose k to be the total number of principal components that explains a certain percentage of your variability in your data.
For more information about the link between SVD and PCA, please see this post on Cross Validated: https://stats.stackexchange.com/q/134282/86678
Instead of [U, S, V] = svd(M), which tries to build a matrix U that is 49152 by 49152 (= 18 GB 😱!), do svd(M, 'econ'). That returns the “economy-class” SVD, where U will be 52 by 52, S is 52 by 52, and V is also 52 by 52.
cov(M) will remove each dimension’s mean and evaluate the inner product, giving you a 52 by 52 covariance matrix. You can implement your own version of cov, called mycov, as
function [C] = mycov(M)
M = bsxfun(#minus, M, mean(M, 1)); % subtract each dimension’s mean over all observations
C = M' * M / size(M, 1);
(You can verify this works by looking at mycov(randn(49152, 52)), which should be close to eye(52), since each element of that array is IID-Gaussian.)
There’s a lot of magical linear algebraic properties and relationships between the SVD and EVD (i.e., singular value vs eigenvalue decompositions): because the covariance matrix cov(M) is a Hermitian matrix, it’s left- and right-singular vectors are the same, and in fact also cov(M)’s eigenvectors. Furthermore, cov(M)’s singular values are also its eigenvalues: so svd(cov(M)) is just an expensive way to get eig(cov(M)) 😂, up to ±1 and reordering.
As #rayryeng explains at length, usually people look at svd(M, 'econ') because they want eig(cov(M)) without needing to evaluate cov(M), because you never want to compute cov(M): it’s numerically unstable. I recently wrote an answer that showed, in Python, how to compute eig(cov(M)) using svd(M2, 'econ'), where M2 is the 0-mean version of M, used in the practical application of color-to-grayscale mapping, which might help you get more context.

Matlab: How to convert a matrix into a Toeplitz matrix

Considering a discrete dynamical system where x[0]=rand() denotes the initial condition of the system.
I have generated an m by n matrix by the following step -- generate m vectors with m different initial conditions each with dimension N (N indicates the number of samples or elements). This matrix is called R. Using R how do I create a Toeplitz matrix, T? T
Mathematically,
R = [ x_0[0], ....,x_0[n-1];
..., ,.....;
x_m[0],.....,x_m[n-1]]
The toeplitz matrix T =
x[n-1], x[n-2],....,x[0];
x[0], x[n-1],....,x[1];
: : :
x[m-2],x[m-3]....,x[m-1]
I tried working with toeplitz(R) but the dimension changes. The dimension should no change, as seen mathematically.
According to the paper provided (Toeplitz structured chaotic sensing matrix for compressive sensing by Yu et al.) there are two Chaotic Sensing Matrices involved. Let's explore them separately.
The Chaotic Sensing Matrix (Section A)
It is clearly stated that to create such matrix you have to build m independent signals (sequences) with m different initials conditions (in range ]0;1[) and then concatenate such signals per rows (that is, one signal = one row). Each of these signals must have length N. This actually is your matrix R, which is correctly evaluated as it is. Although I'd like to suggest a code improvement: instead of building a column and then transpose the matrix you can directly build such matrix per rows:
R=zeros(m,N);
R(:,1)=rand(m,1); %build the first column with m initial conditions
Please note: by running randn() you select values with Gaussian (Normal) distribution, such values might not be in range ]0;1[ as stated in the paper (right below equation 9). As instead by using rand() you take uniformly distributed values in such range.
After that, you can build every row separately according to the for-loop:
for i=1:m
for j=2:N %skip first column
R(i,j)=4*R(i,j-1)*(1-R(i,j-1));
R(i,j)=R(i,j)-0.5;
end
end
The Toeplitz Chaotic Sensing Matrix (Section B)
It is clearly stated at the beginning of Section B that to build the Toeplitz matrix you should consider a single sequence x with a given, single, initial condition. So let's build such sequence:
x=rand();
for j=2:N %skip first element
x(j)=4*x(j-1)*(1-x(j-1));
x(j)=x(j)-0.5;
end
Now, to build the matrix you can consider:
how do the first row looks like? Well, it looks like the sequence itself, but flipped (i.e. instead of going from 0 to n-1, it goes from n-1 to 0)
how do the first column looks like? It is the last item from x concatenated with the elements in range 0 to m-2
Let's then build the first row (r) and the first column (c):
r=fliplr(x);
c=[x(end) x(1:m-1)];
Please note: in Matlab the indices start from 1, not from 0 (so instead of going from 0 to m-2, we go from 1 to m-1). Also end means the last element from a given array.
Now by looking at the help for the toeplitz() function, it is clearly stated that you can build a non-squared Toeplitz matrix by specifying the first row and the first column. Therefore, finally, you can build such matrix as:
T=toeplitz(c,r);
Such matrix will indeed have dimensions m*N, as reported in the paper.
Even though the Authors call both of them \Phi, they actually are two separate matrices.
They do not take the Toeplitz of the Beta-Like Matrix (Toeplitz matrix is not a function or operator of some kind), neither do they transform the Beta-Like Matrix into a Toeplitz-matrix.
You have the Beta-Like Matrix (i.e. the Chaotic Sensing Matrix) at first, and then the Toeplitz-structured Chaotic Sensing Matrix: such structure is typical for Toeplitz matrices, that is a diagonal-constant structure (all elements along a diagonal have the same value).

Detecting cycles in an adjacency matrix

Let A be the adjacency matrix for the graph G = (V,E). A(i,j) = 1 if the nodes i and j are connected with an edge, A(i,j) = 0 otherwise.
My objective is the one of understanding whether G is acyclic or not. A cycle is defined in the following way:
i and j are connected: A(i,j) = 1
j and k are connected: A(j,k) = 1
k and i are connected: A(k,i) = 1
I have implemented a solution which navigates the matrix as follows:
Start from an edge (i,j)
Select the set O of edges which are outgoing from j, i.e., all the 1s in the j-th row of A
Navigate O in a DFS fashion
If one of the paths generated from this navigation leads to the node i, then a cycle is detected
Obviously this solution is very slow, since I have to evaluate all the paths in the matrix. If A is very big, the required overhead is very huge. I was wondering whether there is a way of navigating the adjacency matrix so as to find cycles without using an expensive algorithm such as DFS.
I would like to implement my solution in MATLAB.
Thanks in advance,
Eleanore.
I came across this question when answering this math.stackexchange question. For future readers, I feel like I need to point out (as others have already) that Danil Asotsky's answer is incorrect, and provide an alternative approach. The theorem Danil is referring to is that the (i,j) entry of A^k counts the number of walks of length k from i to j in G. The key thing here is that a walk is allowed to repeat vertices. So even if a diagonal entries of A^k is positive, each walk the entry is counting may contain repeated vertices, and so wouldn't count as a cycle.
Counterexample: A path of length 4 would contain a 4-cycle according to Danil's answer (not to mention that the answer would imply P=NP because it would solve the Hamilton cycle problem).
Anyways, here is another approach. A graph is acyclic if and only if it is a forest, i.e., it has c components and exactly n-c edges, where n is the number of vertices. Fortunately, there is a way to calculate the number of components using the Laplacian matrix L, which is obtained by replacing the (i,i) entry of -A with the sum of entries in row i of A (i.e., the degree of vertex labeled i). Then it is known that the number of components of G is n-rank(L) (i.e., the multiplicity of 0 as an eigenvalue of L).
So G has a cycle if and only if the number of edges is at least n-(n-rank(L))+1. On the other hand, by the handshaking lemma, the number of edges is exactly half of trace(L). So:
G is acyclic if and only if 0.5*trace(L)=rank(L). Equivalently, G has a cycle if and only if 0.5*trace(L) >= rank(L)+1.
Based on the observation of Danil, you need to compute A^n, a slightly more efficient way of doing so is
n = size(A,1);
An = A;
for ii = 2:n
An = An * A; % do not re-compute A^n from skratch
if trace(An) ~= 0
fprintf(1, 'got cycles\n');
end
end
If A is the adjacency matrix of the directed or undirected graph G, then the matrix A^n (i.e., the matrix product of n copies of A) has following property: the entry in row i and column j gives the number of (directed or undirected) walks of length n from vertex i to vertex j.
E.g. if for some integer n matrix A^n contain at least one non-zero diagonal entry, than graph has cycle of size n.
Most easy way check for non-zero diagonal elements of matrix is calculate matrix trace(A) = sum(diag(A)) (in our case elements of matrix power will be always non-negative).
Matlab solution can be following:
for n=2:size(A,1)
if trace(A^n) ~= 0
fprintf('Graph contain cycle of size %d', n)
break;
end
end
This approach uses DFS, but is very efficient, because we don't repeat nodes in subsequent DFS's.
High-level approach:
Initialize the values of all the nodes to -1.
Do a DFS from each unexplored node, setting that node's value to that of an auto-incremented value starting from 0.
For these DFS's, update each node's value with previous node's value + i/n^k where that node is the ith child of the previous node and k is the depth explored, skipping already explored nodes (except for checking for a bigger value).
So, an example for n = 10:
0.1 0.11 0.111
j - k - p
0 / \ 0.12
i \ 0.2 l
m
1 1.1
q - o
...
You can also use i/branching factor+1 for each node to reduce the significant digits of the numbers, but that requires additional calculation to determine.
So above we did a DFS from i, which had 2 children j and m. m had no children, j had 2 children, .... Then we finished with i and started another DFS from the next unexplored node q.
Whenever you encounter a bigger value, you know that a cycle occurred.
Complexity:
You check every node at most once, and at every node you do n checks, so complexity is O(n^2), which is the same as looking at every entry in the matrix once (which you can't do much better than).
Note:
I'll also just note that an adjacency list will probably be faster than an adjacency matrix unless it's a very dense graph.
That is the problem I also found. The explanation, I thought, is the following:
when we talk about cycle, implicitly we mean directed cycles. The adjacency matrix that you have has a different meaning when you consider the directed graph; it is indeed a directed cycle of length 2. So, the solution of $A^n$ is actually for directed graphs. For undirected graphs, I guess a fix would be to just consider the upper triangular version of the matrix (the rest filled with zero) and repeat the procedure. Let me know if this is the right answer.
If digraph G is represented by its Adjacency matrix M then M'=(I - M ) will be singular if there is a cycle in it.
I : identity matrix of same order of M
Some more thoughts on the matrix approach... The example cited is the adjacency matrix for a disconnected graph (nodes 1&2 are connected, and nodes 3&4 are connected, but neither pair is connected to the other pair). When you calculate A^2, the answer (as stated) is the identity matrix. However, since Trace(A^2) = 4, this indicates that there are 2 loops each of length 2 (which is correct). Calculating A^3 is not permitted until these loops are properly identified and removed from the matrix. This is an involved procedure requiring several steps and is detailed nicely by R.L. Norman, "A Matrix Method for Location of Cycles of a Directed Graph," AIChE J, 11-3 (1965) pp. 450-452. Please note: it is unclear from the author whether this approach is guaranteed to find ALL cycles, UNIQUE cycles, and/or ELEMENTARY cycles. My experience suggests that it definitely does not identify ONLY unique cycles.
I cannot add a comment directly, but this comment by Casteels (#casteels) is incorrect:
#Pushpendre My point is that if Danil's answer was correct for directed >graphs, then it would be correct for undirected graphs as well, which it is >not. The counterexample in my previous comment does not have the adjacency >matrix you wrote down; I said to replace each edge with a directed edge in >each direction. This gives the same adjacency matrix as the undirected case. >Are you sure you are not confusing cycle with closed walk? – Casteels Apr 24 >'15 at 9:20
As soon as a directed graph has two vertices with arcs in both directions, then it has a cycle of length 2, and the square of its adjacency matrix (which, in the 'construction' proposed above, would indeed be equal to that of the underlying undirected graph), will have a non-zero diagonal coefficient (as does the square of every adjacency matrix of a non-empty undirected graph, since an edge immediately gives a (non-elementary) walk of length 2 from a vertex to itself). So in that case, Danil's answer essentially correctly detects a cycle. The reasoning above is not correct.
Danil's answer is indeed correct for directed graphs. In a digraph, a single arc cannot be traversed both ways, so every closed directed walk must contain a directed cycle, which will create a non-zero coefficient on the diagonal of some power of the original adjacency matrix of the directed graph. So one can keep computing the powers of the matrix increasingly from 1 to the number of vertices, stopping as soon as a diagonal coefficient is non-zero.