Correlation matrix from categorical and non-categorical variables (Matlab) - matlab

In Matlab, I have a dataset in a table of the form:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
UR F 12 U FT TEA MOTHER 1 11
GB M 22 R FT SER FATHER 5 15
GB M 12 R FT OTH FATHER 3 12
GB M 11 R PT POL FATHER 2 10
Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.
Anyone would have a take on how this can be done? Thank you.

As said above, you first need to transform your categorical and binary variables to numerical values.
So if your data is in a table (T) do something like:
T.SCHOOL = categorical(T.SCHOOL);
A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.
You could then transform your categorical columns to double:
T.SCHOOL = double(T.SCHOOL);
Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.
Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.
If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:
dummy_schools = dummyvar(T.SCHOOL);
Which returns a matrix of size nrows x unique(T.SCHOOL).
And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.
I hope this helps :)

I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
1 1 12 1 1 1 1 1 11
2 2 22 2 1 2 2 5 15
2 2 12 2 1 3 2 3 12
2 2 11 2 2 4 2 2 10
and then do the correlation.

Related

Matlab Dimensions Swapped in Meshgrid

In something which made me spent several hours, I have found an inconsistency in how Matlab deals with dimensions. I somebody can explain to me OR indicate me how to report this to Matlab, please enlight me.
For size, ones,zeros,mean, std, and most every other old and common commands existing inside Matlab, the dimension arrangement is like the classical one and like the intended standard (as per the size of every dimension), the first dimension is along the column vector, the second dimension is along the row vector, and the following are the non graphical following indexes.
>x(:,:,1)=[1 2 3 4;5 6 7 8];
>x(:,:,2)=[9 10 11 12;13 14 15 16];
>m=mean(x,1)
m(:,:,1) = 3 4 5 6
m(:,:,2) = 11 12 13 14
m=mean(x,2)
m(:,:,1) =
2.5000
6.5000
m(:,:,2) =
10.5000
14.5000
m=mean(x,3)
m = 5 6 7 8
9 10 11 12
d=size(m)
d = 2 4 2
However, for graphical commands like stream3,streamline and others relying on the meshgrid output format, the dimensions 1 and 2 are swapped!: the first dimension is the row vector and the second dimension is the column vector, and the following (third) is the non graphical index.
> [x,y]=meshgrid(1:2,1:3)
x = 1 2
1 2
1 2
y = 1 1
2 2
3 3
Then, for stream3 to operate with classically arranged matrices, we should use permute(XXX,[2 1 3]) at every 3D argument:
xyz=stream3(permute(x,[2 1 3]),permute(y,[2 1 3]),permute(z,[2 1 3])...
,permute(u,[2 1 3]),permute(v,[2 1 3]),permute(w,[2 1 3])...
,xs,ys,zs);
If anybody can explain why this happens, and can indicate to me why this is not a bug, welcome.
This behavior is not a bug because it is clearly documented as the intended behavior: https://www.mathworks.com/help/matlab/ref/meshgrid.html. Specifically:
[X,Y,Z]= meshgrid(x,y,z) returns 3-D grid coordinates defined by the vectors x, y, and z. The grid represented by X, Y, and Z has size length(y)-by-length(x)-by-length(z).
Without speaking to the original authors, the exact motivation may be a bit obscure, but I suspect it has to do with the fact that the y-axis is generally associated with the rows of an image, while x is the columns.
Columns are either "j" or "x" in the documentation, rows are either "i" or "y".
Some functions deal with spatial coordinates. The documentation will refer to "x, y, z". These functions will thus take column values before row values as input arguments.
Some functions deal with array indices. The documentation will refer to "i, j" (or sometimes "i1, i2, i3, ..., in", or using specific names instead of "i" before the dimension number). These functions will thus take row values before column values as input arguments.
Yes, this can be confusing. But if you pay attention to the names of the variables in the documentation, you will quickly figure out the right order.
With meshgrid in particular, if the "x, y, ..." order of arguments is confusing, use ndgrid instead, which takes arguments in array indexing order.

Clustering matrix distance between 3 time series

I have a question about the application of clustering techniques more concretely the K-means.
I have a data frame with 3 sensors (A,B,C):
time A | B | C |
8:00:00 6 10 11
8:30:00 11 17 20
9:00:00 22 22 15
9:30:00 20 22 21
10:00:00 17 26 26
10:30:00 16 45 29
11:00:00 19 43 22
11:30:00 20 32 22
... ... ... ...
And I want to group sensors that have the same behavior.
My question is: Looking at the dataframe above, I must calculate the correlation of each object of the data frame and then apply the Euclidean distance on this correlation matrix, thus obtaining a 3 * 3 matrix with the value of distances?
Or do I transpose my data frame and then compute the dist () matrix with Euclidean metric only and then I will have a 3 * 3 matrix with the distances value.
You have just three sensors. That means, you'll need three values, d(A B), d(B,C) and d(A B). Any "clustering" here does not seem to make sense to me? Certainly not k-means. K-means is for points (!) In R^d for small d.
Choose any form of time series similarity that you like. Could be simply correlation, but also DTW and the like.
Q1: No. Why: The correlation is not needed here.
Q2: No. Why: I'd calculate the distances differently
For the first row, R' built-in s dist() function (which uses Euclidean distance by default)
dist(c(6, 10, 11))
gives you the intervals between each value
1 2
------
2| 4
3| 5 1
item 2 and 3 are closest to each other. That's simple.
But there is no single way to calculate the distance between a point and a group of points. There you need a linkage function (min/max/average/...)
What I would do using R's built-in kmeans() function:
Ignore the date column,
(assuming there are no NA values in any A,B,C columns)
scale the data if necessary (here they all seem to have same order of magnitude)
perform KMeans analysis on the A,B,C columns, with k = 1...n ; evaluate results
perform a final KMeans with your suitable choice of k
get the cluster assignments for each row
put them in a new column to the right of C

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

why should i transpose in neural network in matlab?

I would like to ask a question about matlab transpose symbol. For example in this case:
input=input';
It makes transpose of input but i want to learn why we should use transpose via usin Artificial Neural Network in matlab?
Second Question is:
I am trying to create a classification using ANN in matlab. I showed results like that:
a=sim(neuralnetworkname,test)
test is represens my test data in Neural network.
and the results is like that:
a =
Columns 1 through 12
2.0374 3.9589 3.2162 2.0771 2.0931 3.9947 3.1718 3.9813 2.1528 3.9995 3.8968 3.9808
Columns 13 through 20
3.9996 3.7478 2.1088 3.9932 2.0966 2.0644 2.0377 2.0653
If the result of a is about 2, it would benign, if the result of a is about 4,it is malignant.
So, I want to calculate that :for example,there are 100 benign in 500 data.(100/500) How can i write screen this 100/500
I tried to be clear, but if i didn't clear enough, I can try to explain more.Thanks.
First Question
You don't need to transpose input values everytime. Matlab nntool normally gets input values column by column by default. So you have two choice: 1. Change dataset order 2. Transpose input
Second Question
Suppose you have matrix like this:
a=[1 2 3 4 5 6 7 8 9 0 0 0];
To count how many elements below 8, write this:
sum(a<8) %[1 2 3 4 5 6 7 0 0 0]
Output will be:
10

Scramble an nx1 matrix in matlab efficiently?

I need to randomly scramble the values of an nx1 matrix in matlab. I'm not sure how to do this efficiently, I need to do it many times for n > 40,000.
Example
Matrix before:
1 2 2 2 3 4 5 5 4 3 2 1
Scrambled:
3 5 2 1 2 2 3 4 1 4 5 2
thank you
If your data is stored in matrix data, then you can generate "scrambled" data using randperm like so:
scrambled = data(randperm(numel(data)));
This is sampling without replacement, so every value in data will appear once in scrambled.
For sampling with replacement (values in data may appear in scrambled multiple times and some may not appear at all), you could use randi like this:
scrambled = data(randi(numel(data),1,numel(data)));