Extention of markov chain from first order to second order? - matlab

Below is my matlab code snippet for generating first order markov chain. I am having trouble with extending to 2nd order markov chain. Can some one help me in doing so? Any suggestions, hints, links, pseudo-code, algorithm, python or matlab snippets would be helpful.
cdist is the cumulative distribution vector (size-28*1) my 27 symbols. I am writing the output to file called chain.p and q are uniform random numbers. CTRANS is the cumulative matrix corresponding to my first order transition matrix TRANS (which is not shown here, its size is 729*27). CTRANS besides being the cumulative version also has a row vector of zeros appended on top for programming ease. cols is the column size of CTRANS.
%generate sequence according to distribution and transition matrix
fileID = fopen('chain','w');
for k=1:10000
p=rand;
for l=2:numel(cdist)%2 to 28
if ((p >= cdist(l-1)) && (p <= cdist(l)))
fprintf(fileID,'%s\n',num2str(l-1));
q=rand;
for m=2:cols%2 to 28
if ((q >= CTRANS(l,m-1)) && (q <= CTRANS(l,m)))
fprintf(fileID,'%s\n',num2str(m-1));
end
end
end
end
end
fclose(fileID);
I am struggling with the second order case. I can provide more details if required. My input data from where I extract the statistics is english text of length around 4000 characters. I have removed the punctuations etc and converted capital letters to small letters, so now there are 27 symbols where number 1 represents 'a' till 26 represents 'z' and 27 for space. Also I have created the bi-gram distribution vector for the second order case.

Related

'Find' function working incorrectly, have tried floating point accuracy resolution

I have vertically concatenated files from my directory into a matrix that is about 60000 x 15 in size (verified).
d=dir('*.log');
n=length(d);
data=[];
for k=1:n
data{k}=importdata(d(k).name);
end
total=[];
for k=1:n
total=[total;data{n}];
end
I am using a the following 32-iteration loop and the 'Find" function to locate row numbers where the final column is an integer corresponding to the integer iteration of the loop:
for i=1:32
v=[];
vn=[];
[v,vn]=find(abs(fix(i)-fix(total))<eps);
g=length(v)
end
I have tried to account for the floating point accuracy by using 'fix' on values of 'i' and values from matrix 'total', in addition to taking their absolute difference and checking it to be less than a tolerance of 'eps' (floating-point relative accuracy function), up to a tolerance of .99.
The 'Find' function is not working correctly. It is only working for certain integers (although it should be locating all of them (1-32)), and for the integers it does find the values are incomplete.
What is the problem here? If 'Find' is inadequate for this purpose, what is a suitable alternative?
You are getting a lot of zeros because you are looking not just at the 15th column of data but the entire data matrix so you are going to have a lot of non-integers.
Also, you're using fix on both numbers and since floating point errors can cause the number to be slightly above and below the desired integer, this will cause the ones that are below to round down an integer lower than what you'd expect. You should use round to round to the nearest integer instead.
Rather than using find to do this, I would use simple boolean logic to determine the value of the last column
for k = 1:32
% Compare column 15 to the current index
matches = abs(total(:,end) - k) < eps;
% Do stuff with these matches
g = sum(matches); % Count the matches
end
Depending on what you want to actually do with the data, you may be able to use the last column as an input to accumarray to perform an operation on each group.
As a side note, you can replace the first chunk of code with
d = dir('*.log');
data = cellfun(#importdata, {d.name}, 'UniformOutput', false);
total = cat(1, data{:});

Are the conditional probabilities of MATLAB's mnrval correct?

An MWE (stats toolbox required, tested on MATLAB R2014b):
x = (1:3)';
b = mnrfit(x,x,'model','hierarchical');
pihat = mnrval(b,x,'model','hierarchical','type','conditional')
Output:
pihat =
1 1
2.2204e-16 1
2.2204e-16 2.2204e-16
(Ignore the issued warning, it's because of the trivial example, which is linearly separable (I'm predicting x using itself). It doesn't matter: I've tried this as well with a non-trivial (and not-so minimal) example without the warning and the results are similar.)
My problem is the result. I've specified I want the conditional probabilities. According to MATLAB's documentation on mnrval:
Specify ['conditional'] to return predictions [...] in terms of the first k – 1 conditional category probabilities [...], i.e., the probability [...] for category j, given an outcome in category j or higher.
In my example this means rows of pihat contain the probability of
x=1 given x>=1
x=2 given x>=2
(A third column for x=3 is not necessary, because if the first two probabilities are known, the third is too. It follows logically from P(x=1) + P(x=2) + P(x=3) = 1.)
Am I interpreting this correctly? Thus, if x=1 is predicted, then the first column value should be large (close to one), because P(x=1) given x>=1 is large. The second column should be close to zero, because P(x=2) given x>=2 can't be large if x=1.
However, as you can see in the first row, the second column value is large as well as the first! I believe this is incorrect according to what the documentation specifies, am I right? The current (incorrect?) result implies the predicted probabilities in the rows are not of x=j given x>=j, but what are they then? Or how should I be interpreting them?
They are not equal to the cumulative probabilities, i.e. the probability of x<=j, which increases with j. I've checked this by calculating pihat2 = mnrval(b,x,'model','hierarchical','type','cumulative'); pihat2-pihat.

Retrieve a specific permutation without storing all possible permutations in Matlab

I am working on 2D rectangular packing. In order to minimize the length of the infinite sheet (Width is constant) by changing the order in which parts are placed. For example, we could place 11 parts in 11! ways.
I could label those parts and save all possible permutations using perms function and run it one by one, but I need a large amount of memory even for 11 parts. I'd like to be able to do it for around 1000 parts.
Luckily, I don't need every possible sequence. I would like to index each permutation to a number. Test a random sequence and then use GA to converge the results to find the optimal sequence.
Therefore, I need a function which gives a specific permutation value when run for any number of times unlike randperm function.
For example, function(5,6) should always return say [1 4 3 2 5 6] for 6 parts. I don't need the sequences in a specific order, but the function should give the same sequence for same index. and also for some other index, the sequence should not be same as this one.
So far, I have used randperm function to generate random sequence for around 2000 iterations and finding a best sequence out of it by comparing length, but this works only for few number of parts. Also using randperm may result in repeated sequence instead of unique sequence.
Here's a picture of what I have done.
I can't save the outputs of randperm because I won't have a searchable function space. I don't want to find the length of the sheet for all sequences. I only need do it for certain sequence identified by certain index determined by genetic algorithm. If I use randperm, I won't have the sequence for all indexes (even though I only need some of them).
For example, take some function, 'y = f(x)', in the range [0,10] say. For each value of x, I get a y. Here y is my sheet length. x is the index of permutation. For any x, I find its sequence (the specific permutation) and then its corresponding sheet length. Based on the results of some random values of x, GA will generate me a new list of x to find a more optimal y.
I need a function that duplicates perms, (I guess perms are following the same order of permutations each time it is run because perms(1:4) will yield same results when run any number of times) without actually storing the values.
Is there a way to write the function? If not, then how do i solve my problem?
Edit (how i approached the problem):
In Genetic Algorithm, you need to crossover parents(permutations), But if you crossover permutations, you will get the numbers repeated. for eg:- crossing over 1 2 3 4 with 3 2 1 4 may result something like 3 2 3 4. Therefore, to avoid repetition, i thought of indexing each parent to a number and then convert the number to binary form and then crossover the binary indices to get a new binary number then convert it back to decimal and find its specific permutation. But then later on, i discovered i could just use ordered crossover of the permutations itself instead of crossing over their indices.
More details on Ordered Crossover could be found here
Below are two functions that together will generate permutations in lexographical order and return the nth permutation
For example, I can call
nth_permutation(5, [1 2 3 4])
And the output will be [1 4 2 3]
Intuitively, how long this method takes is linear in n. The size of the set doesn't matter. I benchmarked nth_permutations(n, 1:1000) averaged over 100 iterations and got the following graph
So timewise it seems okay.
function [permutation] = nth_permutation(n, set)
%%NTH_PERMUTATION Generates n permutations of set in lexographical order and
%%outputs the last one
%% set is a 1 by m matrix
set = sort(set);
permutation = set; %First permutation
for ii=2:n
permutation = next_permute(permutation);
end
end
function [p] = next_permute(p)
%Following algorithm from https://en.wikipedia.org/wiki/Permutation#Generation_in_lexicographic_order
%Find the largest index k such that p[k] < p[k+1]
larger = p(1:end-1) < p(2:end);
k = max(find(larger));
%If no such index exists, the permutation is the last permutation.
if isempty(k)
display('Last permutation reached');
return
end
%Find the largest index l greater than k such that p[k] < p[l].
larger = [false(1, k) p(k+1:end) > p(k)];
l = max(find(larger));
%Swap the value of p[k] with that of p[l].
p([k, l]) = p([l, k]);
%Reverse the sequence from p[k + 1] up to and including the final element p[n].
p(k+1:end) = p(end:-1:k+1);
end

How to generate random matlab vector with these constraints

I'm having trouble creating a random vector V in Matlab subject to the following set of constraints: (given parameters N,D, L, and theta)
The vector V must be N units long
The elements must have an average of theta
No 2 successive elements may differ by more than +/-10
D == sum(L*cosd(V-theta))
I'm having the most problems with the last one. Any ideas?
Edit
Solutions in other languages or equation form are equally acceptable. Matlab is just a convenient prototyping tool for me, but the final algorithm will be in java.
Edit
From the comments and initial answers I want to add some clarifications and initial thoughts.
I am not seeking a 'truly random' solution from any standard distribution. I want a pseudo randomly generated sequence of values that satisfy the constraints given a parameter set.
The system I'm trying to approximate is a chain of N links of link length L where the end of the chain is D away from the other end in the direction of theta.
My initial insight here is that theta can be removed from consideration until the end, since (2) in essence adds theta to every element of a 0 mean vector V (shifting the mean to theta) and (4) simply removes that mean again. So, if you can find a solution for theta=0, the problem is solved for all theta.
As requested, here is a reasonable range of parameters (not hard constraints, but typical values):
5<N<200
3<D<150
L==1
0 < theta < 360
I would start by creating a "valid" vector. That should be possible - say calculate it for every entry to have the same value.
Once you got that vector I would apply some transformations to "shuffle" it. "Rejection sampling" is the keyword - if the shuffle would violate one of your rules you just don't do it.
As transformations I come up with:
switch two entries
modify the value of one entry and modify a second one to keep the 4th condition (Theoretically you could just shuffle two till the condition is fulfilled - but the chance that happens is quite low)
But maybe you can find some more.
Do this reasonable often and you get a "valid" random vector. Theoretically you should be able to get all valid vectors - practically you could try to construct several "start" vectors so it won't take that long.
Here's a way of doing it. It is clear that not all combinations of theta, N, L and D are valid. It is also clear that you're trying to simulate random objects that are quite complex. You will probably have a hard time showing anything useful with respect to these vectors.
The series you're trying to simulate seems similar to the Wiener process. So I started with that, you can start with anything that is random yet reasonable. I then use that as a starting point for an optimization that tries to satisfy 2,3 and 4. The closer your initial value to a valid vector (satisfying all your conditions) the better the convergence.
function series = generate_series(D, L, N,theta)
s(1) = theta;
for i=2:N,
s(i) = s(i-1) + randn(1,1);
end
f = #(x)objective(x,D,L,N,theta)
q = optimset('Display','iter','TolFun',1e-10,'MaxFunEvals',Inf,'MaxIter',Inf)
[sf,val] = fminunc(f,s,q);
val
series = sf;
function value= objective(s,D,L,N,theta)
a = abs(mean(s)-theta);
b = abs(D-sum(L*cos(s-theta)));
c = 0;
for i=2:N,
u =abs(s(i)-s(i-1)) ;
if u>10,
c = c + u;
end
end
value = a^2 + b^2+ c^2;
It seems like you're trying to simulate something very complex/strange (a path of a given curvature?), see questions by other commenters. Still you will have to use your domain knowledge to connect D and L with a reasonable mu and sigma for the Wiener to act as initialization.
So based on your new requirements, it seems like what you're actually looking for is an ordered list of random angles, with a maximum change in angle of 10 degrees (which I first convert to radians), such that the distance and direction from start to end and link length and number of links are specified?
Simulate an initial guess. It will not hold with the D and theta constraints (i.e. specified D and specified theta)
angles = zeros(N, 1)
for link = 2:N
angles (link) = theta(link - 1) + (rand() - 0.5)*(10*pi/180)
end
Use genetic algorithm (or another optimization) to adjust the angles based on the following cost function:
dx = sum(L*cos(angle));
dy = sum(L*sin(angle));
D = sqrt(dx^2 + dy^2);
theta = atan2(dy/dx);
the cost is now just the difference between the vector given by my D and theta above and the vector given by the specified D and theta (i.e. the inputs).
You will still have to enforce the max change of 10 degrees rule, perhaps that should just make the cost function enormous if it is violated? Perhaps there is a cleaner way to specify sequence constraints in optimization algorithms (I don't know how).
I feel like if you can find the right optimization with the right parameters this should be able to simulate your problem.
You don't give us a lot of detail to work with, so I'll assume the following:
random numbers are to be drawn from [-127+theta +127-theta]
all random numbers will be drawn from a uniform distribution
all random numbers will be of type int8
Then, for the first 3 requirements, you can use this:
N = 1e4;
theta = 40;
diffVal = 10;
g = #() randi([intmin('int8')+theta intmax('int8')-theta], 'int8') + theta;
V = [g(); zeros(N-1,1, 'int8')];
for ii = 2:N
V(ii) = g();
while abs(V(ii)-V(ii-1)) >= diffVal
V(ii) = g();
end
end
inline the anonymous function for more speed.
Now, the last requirement,
D == sum(L*cos(V-theta))
is a bit of a strange one...cos(V-theta) is a specific way to re-scale the data to the [-1 +1] interval, which the multiplication with L will then scale to [-L +L]. On first sight, you'd expect the sum to average out to 0.
However, the expected value of cos(x) when x is a random variable from a uniform distribution in [0 2*pi] is 2/pi (see here for example). Ignoring for the moment the fact that our limits are different from [0 2*pi], the expected value of sum(L*cos(V-theta)) would simply reduce to the constant value of 2*N*L/pi.
How you can force this to equal some other constant D is beyond me...can you perhaps elaborate on that a bit more?

MATLAB Tree Construction

Now, I have separate any pair that is in common between the two input files. Find out the mean between that pair like this : (correlation in first text file)X(correlation in second text file)/ (correlation in first text file)+(correlation in second text file). Again store these in a separate matrix.
Building a tree :
Now, out of all the elements in both the input files, select the 10 most frequent ones. Each of these form the root of a separate K tree.The algorithm goes like this : For the word at the root level, check all its harmonic mean values with the other tags in the matrix that is developed in the previous step. Select the top two highest harmonic means, and put the other word in the tag pair as the child node of the root.
Can someone please guide me through the MATLAB steps of going through this? Thank you for your time.
Okay, so start by putting the data in a useful format; maybe count the number of distinct words, and make an N-by-M matrix of binary values (I'll call this data1). Each of the N rows will describe the words associated with a single image. Each of the M columns will descibe the images for which a single word is tagged. Therefore, the value at (N, M) is 0 if tag M is not in image N, and 1 if it is.
From this matrix, to find correlation between all pairs of words, you could do:
correlations1 = zeros(M, M);
for i=1:M
for j=1:M
correlations1(i, j) = corr(data1(:, i), data1(:, j));
end
end
now the matrix correlations tells you the correlation between tags. Do the same for the other text file. You can make a matrix of harmonic means with:
h_means = correlations1.*correlations2./(correlations1+correlations2);
You can find the 30 most freqent tags by counting the number of 1s in each column of the data matrix. Since we want to find the most common tags in both files, we'll add the data matricies first:
[~, tag_ranks] = sort(sum(data1 + data2, 1), 'descending'); %get the indices in sorted order
top_tags = tag_ranks(1:30);
For the tree building at the end, you will either want to create a tree class (see classdef), or store the tree in an array. To find the top two highest harmonic means, you will want to look in the h_means matrix; for a tag m1, we can do:
[~, tag_ranks] = sort(h_means(m1, :), 'descending');
top_tag = tag_ranks(1);
second_tag = tag_ranks(2);
You will then need to insert these tags into the tree and repeat.