Fitting code-chunks to a probability distribution function on two variables - matlab

I am sorry for this complicated problem, but, I will try my best to explain myself.
This is basically a Hidden Markov Model question. I have two columns of data. The data in these two columns are independent of each other, however, together represent a specific movement which can be character-coded. I assign a character in 3rd column by putting conditions on column1 and column2 entry. Note: the characters are finite (~10-15).
For example:-
if (column1(i)>0.5) && (column2(i)<15)
column3(i)='D';
I end up with a string something like this
AAAAADDDDDCCCCCFFFFAAAACCCCCFFFFFFDDD
So, each of the character gets repeated but not of constant lengths (e.g first time A's appear 5 times while second time A's appear 4 times only).
Now, let us take the first chunk of A's (AAAAA), each A containing an ordered pair of column1 and column2 values. Now, comparing with the second chunk of A's (AAAA), the values of column1 and column2 should be similar to those of first chunk. Usually, the values in each columns would be either increasing or decreasing or constant throughout a chunk. And the values of the columns in both chunks should be similar. For example, column1 goes from -1 to -5 in 5 unequal samples but in second chunk it goes from -1.2 to -5.1 in 4 unequal steps.
What I want is a fitting of a probability distribution over column1 and column2 values (independently) for each set of repeated characters (e.g. for A's and then D's then C's then F's and then again A's).
And the final goal is following:-
given n elements in column1, column2, column3, I want to predict what is (n+1) element is going to be in column 3, how many times it is going to repeat itself (with probability e.g. 70% chance it is going to repeat itself 4 times and 20% chance it is going to repeat itself 5 times). Also, what is the probability distribution of column1 and column2 is going to be for the predicted character.
Please feel free to ask questions if I fail to explain it well.

Related

D trigger scheme test

I want to test my D trigger scheme and I don't know how to do that. I don't even know how "truth table" works and maybe someone can help/explain how to make a test for my D trigger scheme?
A logical table works as follows:
Each row represents a case, each column represents a statement. A cell represents the logical value of a statement. It is wise to differentiate mentally atomic columns from molecular columns. In your case, you have three atomic columns, they represent the cases. These are the first three columns in your logical table. The other columns are molecular columns, that is, their value is composed by other columns.
D is said to be the value of XOR-ing the third and the fifth column. If you look in each row the value of the third and the fifth column in the table, you will see that D4 is true (1) if they differ and it is false (0) if they are similar.
I am not sure what do you mean by testing the D trigger scheme, but if by that you mean that we should test whether the formula matches to the values, then we can say that yes, it matches, since the general concept of XOR is matched in all cases in your logical table.

How to implement data I have to svmtrain() function in MATLAB?

I have to write a script using MATLAB which will classify my data.
My data consists of 1051 web pages (rows) and 11000+ words (columns). I am holding the word occurences in the matrix for each page. The first 230 rows are about computer science course (to be labeled with +1) and remaining 821 are not (to be labeled with -1). I am going to label few part of these rows (say 30 rows) by myself. Then SVM will label the remaining unlabeled rows.
I have found that I could solve my problem using MATLAB's svmtrain() and svmclassify() methods. First I need to create SVMStruct.
SVMStruct = svmtrain(Training,Group)
Then I need to use
Group = svmclassify(SVMStruct,Sample)
But the point that I do not know what Training and Group are. For Group Mathworks says:
Grouping variable, which can be a categorical, numeric, or logical
vector, a cell vector of strings, or a character matrix with each row
representing a class label. Each element of Group specifies the group
of the corresponding row of Training. Group should divide Training
into two groups. Group has the same number of elements as there are
rows in Training. svmtrain treats each NaN, empty string, or
'undefined' in Group as a missing value, and ignores the corresponding
row of Training.
And for Training it is said that:
Matrix of training data, where each row corresponds to an observation
or replicate, and each column corresponds to a feature or variable.
svmtrain treats NaNs or empty strings in Training as missing values
and ignores the corresponding rows of Group.
I want to know how I can adopt my data to Training and Group? I need (at least) a little code sample.
EDIT
What I did not understand is that in order to have SVMStruct I have to run
SVMStruct = svmtrain(Training, Group);
and in order to have Group I have to run
Group = svmclassify(SVMStruct,Sample);
Also I still did not get what Sample should be like?
I am confused.
Training would be a matrix with 1051 rows (the webpages/training instances) and 11000 columns (the features/words). I'm assuming you want to test for the existence of each word on a webpage? In this case you could make the entry of the matrix a 1 if the word exists for a given webpage and a 0 if not.
You could initialize the matrix with Training = zeros(1051,11000); but filling the entries would be up to you, presumably done with some other code you've written.
Group is a 1-D column vector with one entry for every training instance (webpage) than tells you which of two classes the webpage belongs to. In your case you would make the first 230 entries a "+1" for computer science and the remaining 821 entries a "-1" for not.
Group = zeros(1051,1); % gives you a matrix of zeros with 1051 rows and 1 column
Group(1:230) = 1; % set first 230 entries to +1
Group(231:end) = -1; % set the rest to -1

How to make volume range calculation more RAM friendly?

I am trying to find out the mean, media and percentile ranges of price movements for a given volume to be filled using trade data. Attaching the code below. The problem is that the code gives me wsfull error when i run it on ~80k records. I am using a 4g linux box. At the moment I can only run it for ~30k records and even then q uses >70% of my ram.
Is there any way to make it more memory friendly?
rangeForVol : {[symIn; vol; dt]
data: select from table where sym=symIn, date=dt;
data: update cumVol: sums quantity, cVol: sums quantity from data;
data: update cumVolTgt: cumVol + vol from data;
data: update pxLst: price[where each ((cumVol>=/:cVol) and (cumVol<=/:cumVolTgt))=1] from data;
.Q.gc[];
data: update minPx: min each pxLst, maxPx: max each pxLst from data;
data: update range: maxPx - minPx from data;
data
};
select count i by floor range%0.5 from rangeForVol[`ABC; 2500; 2012.06.04]
The code you quote above almost certainly does not do what you were trying to achieve.
The column cumVol and cVol are both identical (in that they contain a running total of that day's volume). Later you calculate cumVol>=/:cVol. /: means that for every element in cVol you will compare it to the entire vector cumVol. As they are identical, you will get the identity matrix (plus some extra 1b for any non-distinct values).
q)(til 4)=\:til 4
1000b
0100b
0010b
0001b
It seems you wanted to perform an element-wise comparison between the two vectors (though comparing a vector to itself also doesn't make sense), and if you want to do this explicitly, each-both would be the correct adverb (='). However, in q, the = operator will implicitly apply item-wise to two vectors of the same length (or a vector and a scalar, as is happening in your each-left example), making any adverb unnecessary.
The fact you are creating two n x n matrices when you probably intended a length n vector is probably the reason you're running out of memory.

Preserving matrix columns using Matlab brush/select data tool

I'm working with matrices in Matlab which have five columns and several million rows. I'm interested in picking particular groups of this data. Currently I'm doing this using plot3() and the brush/select data tool.
I plot the first three columns of the matrix as X,Y, Z and highlight the matrix region I'm interested in. I then use the brush/select tool's "Create variable" tool to export that region as a new matrix.
The problem is that when I do that, the remaining two columns of the original, bigger matrix are dropped. I understand why- they weren't plotted and hence the figure tool doesn't know about them. I need all five columns of that subregion though in order to continue the processing pipeline.
I'm adding the appropriate 4th and 5th column values to the exported matrix using a horrible nested if loop approach- if columns 1, 2 and 3 match in both the original and exported matrix, attach columns 4/5 of the original matrix to the exported one. It's bad design and agonizingly slow. I know there has to be a Matlab function/trick for this- can anyone help?
Thanks!
This might help:
1. I start with matrix 1 with columns X,Y,Z,A,B
2. Using the brush/select tool, I create a new (subregion) matrix 2 with columns X,Y,Z
3. I then loop through all members of matrix 2 against all members of matrix 1. If X,Y,Z match for a pair of rows, I append A and B
from that row in matrix 1 to the appropriate row in matrix 2.
4. I become very sad as this takes forever and shows my ignorance of Matlab.
If I understand your situation correctly here is a simple way to do it:
Assuming you have a matrix like so: M = [A B C D E] where each letter is a Nx1 vector.
You select a range, this part is not really clear to me, but suppose you can create the following:
idxA,idxB and idxC, that are 1 if they are in the region and 0 otherwise.
Then you can simply use:
M(idxA&idxB&idxC,:)
and you will get the additional two columns as well.

Extract a specific row from a combination matrix

Suppose I have 121 elements and want to get all combinations of 4 elements taken at a time, i.e. 121c4.
Since combnk(1:121, 4) takes a lot of time, I want to go for 2% of that combination by providing:
z = 1:50:length(121c4(:, 1))
For example: 1st row, 5th row, 100th row and so on, up to 121c4, picking only those rows from a 121c4 matrix without generating the complete combination (it's consuming too much for large numbers like 625c4).
If you haven't defined an ordering on the combinations, why not just use
randi(121,p,4)
where p is the number of combinations you want in your set ? With this approach you may, or may not, want to replace duplicates.
If you have defined an ordering on the combinations, tell us what it is.