Finding correlation in an enum type data - matlab

I have the following dataset containing information about countries
5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,0,0,0,0,1,0,0,1,0,0,
3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,
4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,
...
The sixth column in each row indicates the main religion of the country: 0 is catholic, 1 is other christian, 2 is muslim, etc. Some of the other data is about if different colors are present in the flag of the country symbols they contain, and so on.
The description of the data can be found here. I have removed the string data columns though so it doesn't fit exactly like the information shown.
My problem is that I want to use co-variance matrices and Pearson correlation to see if, for example, the fact that a flag has the color red in it will tell anything about if the religion of that country has a bigger chance of being something than something else. But since the religion is enumerated, I am a bit lost on how to progress with this problem.

Your problem is that, despite the fact that your data is ordered, this order is arbitrary. The "distance" between "muslim" (enum val=1) to "hindu" (enum val=3) is not 2.
The most straight-forward way of tackling this issue is to convert enum values to binary indicator vectors:
Suppose you have
enum {
Catholic = 0
Protestant,
Muslim,
Jewish,
Hindu,
...
NumOfRel };
You replace the single entry of enum val with a binary vector of length NumOfRel with zeros everywhere except for a single 1 at the appropriate place:
For a Protestant entry, you'll have the following binary vector:
[ 0 1 0 0 ... ]
For a Jewish:
[ 0 0 0 1 0 ... ]
And so on...
This way, the "distance" between different religions is always 1.

Related

pure data [hist] implementation

No idea how to use [hist] in Pure Data.
And the three arguments of [hist] is:
the value of first class,
the value of last class,
the number of classes.
I cannot figure out the first and second argument meaning? And how am I going to pass the output of [hist] to [tabwrite] and generate an array diagram in Pure Data.
It seems you are using the [hist] object from smlib.
The histogram will contain <number of classes> bins of equal size, with the first bin being equivalent to the <value of first class> and the last bin being equivalent to <value of last class>-1 (the offset is arguably a bug).
So, the value of first class is the minimum expected input value (x>=min), and the value of last class is the maximum expected input value (x<<max).
Any input value exceeding those boundaries will be clipped.
Examples:
[3, absolute(
|
[hist 2 5 3]
|
[print]
This will create a 3-bin histogram, with the bins 2±0.5 (with clipping this means x<2.5), 3±0.5 and 4±0.5 (with clipping that is 3.5<x).
The input 3 will be filed into the second bin, so the absolute histogram is 0 1 0.
Similarily:
[3, absolute(
|
[hist 3 6 3]
|
[print]
This will create a 3-bin histogram, with the bins 3±0.5, 4±0.5 and 5±0.5.
The input 3 will be now filed into the first bin, so the absolute histogram is 1 0 0.
Displaying the histogram:
You can set the table-values by sending a list of number to the table, prefixed with the starting index:
[relative(
|
[hist 0 100 100]
|
[list prepend 0]
|
[s $0-histo]
[table $0-histo 100]
Alternatively check the [array] object (which also can be accessed via [tabread] and the like)

Bound constraints ignored Matlab

I have the following code working out the efficient frontier for a portfolio of assets:
lb=Bounds(:,1);
ub=Bounds(:,2);
P = Portfolio('AssetList', AssetList,'LowerBound', lb, 'UpperBound', ub, 'Budget', 1);
P = P.estimateAssetMoments(AssetReturns)
[Passetmean, Passetcovar] = P.getAssetMoments
pwgt = P.estimateFrontier(20);
[prsk, pret] = P.estimatePortMoments(pwgt);
It works fine apart from the fact that it ignores the constraints to some extent (results below). How do I set the constraints to be hard constraints- i.e. prevent it from ignoring an upper bound of zero? For example, when I set an upper and lower bound to zero (i.e. I do not want a particular asset to be included in a portfolio) I still get values in the calculated portfolio weights for that asset, albeit very small ones, coming out as part of the optimised portfolio.
Lower bounds (lb), upper bounds (ub), and one of the portfolio weights (pwgt) are set out below:
lb ub pwgt(:,1)
0 0 1.06685493772574e-16
0 0 4.17200995972422e-16
0 0 0
0 0 2.76688394418301e-16
0 0 3.39138439553466e-16
0.192222222222222 0.252222222222222 0.192222222222222
0.0811111111111111 0.141111111111111 0.105624956477606
0.0912121212121212 0.151212121212121 0.0912121212121212
0.0912121212121212 0.151212121212121 0.0912121212121212
0.0306060606060606 0.0906060606060606 0.0306060606060606
0.0306060606060606 0.0906060606060606 0.0306060606060606
0.121515151515152 0.181515151515152 0.181515151515152
0.0508080808080808 0.110808080808081 0.110808080808081
0.00367003367003366 0.0636700336700337 0.0388531580005063
0.00367003367003366 0.0636700336700337 0.0636700336700338
0.00367003367003366 0.0636700336700337 0.0636700336700337
0 0 0
0 0 0
0 0 1.29236898960272e-16
I could use something like: pwgt=floor(pwgt*1000)/1000;, but is there not a more elegant solution than this?
The point is that your bound has not been ignored.
You are calculating with floating point numbers, and hence 0 and 4.17200995972422e-16 are both close enough to 0 to let your program allow them.
My recommendation would indeed be to round your results (or simply display less decimals with format short), however I would do the rounding like this:
pwgt=round(pwgt*100000)/100000;
Note that the other results may also be 'above' the upper bound, however this will not become visible due to the insignificance.
I had issues like this with a laminate/engineering properties code, which was propogating errors all over everything. I fixed it by taking all of the values I had, and systematically converting them from double to sym, and suddenly my 1e-16 values became real zeros, that I could also eval(val) and still see as zeros! This may help, but you may have to go inside of the .m files you're running, and have the numbers convert to sym with val = sym(val).
I can't remember for certain, but I think Matlab functions MIGHT change sym to double once they receive the data for their own internal processing.

How to take one particular number or a range of particular number from a set of number?

I am looking for to take one particular number or range of numbers from a set of number?
Example
A = [-10,-2,-3,-8, 0 ,1, 2, 3, 4 ,5,7, 8, 9, 10, -100];
How can I just take number 5 from the set of above number and
How can I take a range of number for example from -3 to 4 from A.
Please help.
Thanks
I don't know what you are trying to accomplish by this. But you could check each entry of the set and test it it's in the specified range of numbers. The test for a single number could be accomplished by testing each number explicitly or as a special case of range check where the lower and the upper bound are the same number.
looping and testing, no matter what the programming language is, although most programming languages have builtin methods for accomplishing this type of task (so you may want to specify what language are you supposed to use for your homework):
procfun get_element:
index=0
for element in set:
if element is 5 then return (element,index)
increment index
your "5" is in element and at set[index]
getting a range:
procfun getrange:
subset = []
index = 0
for element in set:
if element is -3:
push element in subset
while index < length(set)-1:
push set[index] in subset
if set[index] is 4:
return subset
increment index
#if we met "-3" but we didn't met "4" then there's no such range
return None
#keep searching for a "-3"
increment index
return None
if ran against A, subset would be [-3,-8, 0 ,1, 2, 3, 4]; this is a "first matched, first grabbed" poorman's algorithm. on sorted sets the algorithms can get smarter and faster.

Change the class of columns in a data frame

First of all, excuse me if I do any mistakes, but English is not a language I use very often.
I have a data frame with numbers. A small part of the data frame is this:
nominal 2
2
2
2
ordinal
2
1
1
2
So, I want to use the gower distance function on these numbers.
Here ( http://rgm2.lab.nig.ac.jp/RGM2/R_man-2.9.0/library/StatMatch/man/gower.dist.html ) says that in order to use gower.dist, all nominal variables must be of class "factor" and all ordinal variables of class "ordered".
By default, all the columns are of class "integer" and mode "numeric". In order to change the class of the columns, i use these commands:
DF=read.table("clipboard",header=TRUE,sep="\t")
# I select all the cells and I copy them to the clipboard.
#Then R, with this command, reads the data from there.
MyHeader=names(DF) # I save the headers of the data frame to a temp matrix
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="nominal") DF[[i]]=as.factor(DF[[i]])
}
for (i in 1:length(DF)) {
if (MyHeader[[i]]=="ordinal") DF[[i]]=as.ordered(DF[[i]])
}
The first for/if loop changes the class from integer to factor, which is what I want, but the second changes the class of ordinal variables to: "ordered" "factor".
I need to change all the columns with the header "ordinal" to "ordered", as the gower.dist function says.
Thanks in advance,
B.T.
What you are doing is fine --- if perhaps a little inelegantly.
With your ordered factor, you have something like:
> foo <- as.ordered(1:10)
> foo
[1] 1 2 3 4 5 6 7 8 9 10
Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10
> class(foo)
[1] "ordered" "factor"
Notice that it has two classes, indicating that it is an ordered factor and that is is a factor:
> is.ordered(as.ordered(1:10))
[1] TRUE
> is.factor(as.ordered(1:10))
[1] TRUE
In some senses, you might like to think that foo is an ordered factor but also inherits from the factor class too. Alternatively, if there isn't a specific method that handles ordered factors, but there is a method for factors, R will use the factor method. As far as R is concerned, an ordered factor is an object with classes "ordered" and "factor". This is what your function for Gower's distance will require.
You could easily do this with:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
which gives you a dataframe with the correct structure. If you work with data frames, please stay away from [[]] unless you know very well what you're doing. Take Dirks advice, and check Owen's R Guide as well. You definitely need it.
If i do the conversion as I showed above, gower.dist() works perfectly fine. On a sidenote, the gowers distance can easily be calculated using the daisy() function as well:
DF <- data.frame(
ordinal= c(1,2,3,1,2,1),
nominal= c(2,2,2,2,2,2)
)
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
library(cluster)
daisy(DF,metric="gower")
library(StatMatch)
gower.dist(DF)

Problem looking at data between 0 and -1

I'm trying to write a program that cleans data, using Matlab. This program takes in the max and min that the data can be, and throws out data that is less than the min or greater than the max. There looks like a small issue with the cleaning part. This case ONLY happens when the minimum range of the variable being checked is 0. If this is the case, for one reason or another, the program won't throw away data points that are between 0 and -1. I've been trying to fix this for some time now, and noticed that this is the only case where this happens, and if you try to run a SQL query selecting data that is < 0, it will leave out data between 0 and -1, so effectively the same error as what's happening to me. Wondering if anyone might recognize this and know what it could be.
I would write such a function as:
function data = cleanseData(data, limits)
limits = sort(limits);
data = data( limits(1) <= data & data <= limits(2) );
end
an example usage:
a = rand(100,1)*10;
b = cleanseData(a, [-2 5]);
c = cleanseData(a, [0 -1]);
-1 is less than 0, so 0 should be the max value. And if this is the case it will keep points between -1 and 0 by your definition of the cleaning operation:
and throws out data that is less than the min or greater than the max.
If you want to throw away (using the above definition)
data points that are between 0 and -1
then you need to set 0 as the min value and -1 as the max value --- which does not make sense.
Also, I think you mean
and throws out data that is less than the min AND greater than the max.
It may be that the floats are getting casted to ints before the comparison. I don't know matlab, but in python int(-0.5)==0, which could explain the extra data points getting in. You can test this by setting the min to -1, if you then also get values from -1 to -2 then you'll need to make sure casting isn't being done.
If I try to mimic your situation with SQL, and run the following query against a datatable that has 1.00, 0.00, -0.20, -0.80. -1.00, -1.20 and -2.00 in the column SomeVal, it correctly returns -0.20 and -0.80, which is as expected.
SELECT SomeVal
FROM SomeTable
WHERE (SomeVal < 0) AND (SomeVal > - 1)
The same is true for MatLab. Perhaps there's an error in your code. Dheck the above statement with your own SELECT statement to see if something's amiss.
I can imagine such a bug if you do something like
minimum = 0
if minimum and value < minimum