Calculating group means with own group excluded in MATLAB - matlab

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.

Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

Related

Running total using two columns

Given a table with data like:
A
B
Qty.
Running Total
5
5
5
10
5
15
I can create the running total using the formula =SUM($A$2:A2) and then drag down to get the running total after each quantity (here Qty.)
What may I do for calculating running total using two columns which may or may not be consecutive as shown below:
A
B
C
D
Qty. 1
Other
Qty. 2
RT
2
blah
2
4
2
phew
2
8
3
xyz
2
13
Place in cell D2 the formula =SUM(A2,C2,D1). Do not pay attention to the fact that the function will refer to a non-numeric cell D1 - the SUM() function will not break, unlike ordinary addition =A2+C2+D1. Now, just stretch the formula down.

Correlation matrix from categorical and non-categorical variables (Matlab)

In Matlab, I have a dataset in a table of the form:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
UR F 12 U FT TEA MOTHER 1 11
GB M 22 R FT SER FATHER 5 15
GB M 12 R FT OTH FATHER 3 12
GB M 11 R PT POL FATHER 2 10
Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.
Anyone would have a take on how this can be done? Thank you.
As said above, you first need to transform your categorical and binary variables to numerical values.
So if your data is in a table (T) do something like:
T.SCHOOL = categorical(T.SCHOOL);
A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.
You could then transform your categorical columns to double:
T.SCHOOL = double(T.SCHOOL);
Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.
Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.
If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:
dummy_schools = dummyvar(T.SCHOOL);
Which returns a matrix of size nrows x unique(T.SCHOOL).
And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.
I hope this helps :)
I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
1 1 12 1 1 1 1 1 11
2 2 22 2 1 2 2 5 15
2 2 12 2 1 3 2 3 12
2 2 11 2 2 4 2 2 10
and then do the correlation.

How to group a column in tableau based on value of another column

I am new to tableau and need help in figuring this out.I have a dataset in below format:
hid:id for the house the customer belong
cid:customer id
hID CustomerID
1 A
1 B
1 C
2 D
2 E
3 F
3 G
3 H
3 I
4 J
5 K
5 L
5 M
5 N
5 O
So A,B belong to house 1 so count of hid '1' is 3 so:
hid count of members
1 3
2 2
3 3
4 1
5 5
I want to show a graph in tableau as size of house that is X-axis :Size of house and Y-axis :Count no of house with same size so for above data the values as below:
Size of house no of house
1 1
2 1
3 2
4 0
5 1
The final graph should be:
In Tableau jargon, you're looking to bin based upon an aggregate value. Take a look at the following blog post for a more detailed description/walk-through.
One way to accomplish this is by leveraging talbeau's level of detail calculations. Creating a calculated field along the lines of:
{FIXED [hID] : COUNTD([CustomerID])}
You can then create a bin field by right clicking on the new field and binning based on a parameter, or a a static size (1?) of your choosing.
To create the visual, place this second bin field on the row shelf and on the column shelf drag the hID dimension and right click to convert to a measure by selecting Count Distinct.
As a side note, depending on whether you set your bin field as continuous or discrete, the 4 bin in your sample data will or will not appear.

MATLAB/OCTAVE - Branching loops? or parallel looping?

Still new to the programing game but I need a little help! I'm not exactly sure how to describe what I want to do but I'll give it my best shot. I have a set of numbers produced by an algorithm I've put together. e.g. :
....
10 10 10
11 11 11
12 1 2
13 3 4
14 12 13
15 6 7
16 5 15
17 8 9
....
Essentially what I want to do is assign these index numbers to groups. Lets say I start with the number 14 in the first column. It is going to belong to group 1, so I label it in a new column in row 14 "1" for group one. The second and the third column show other index numbers that are grouped with the index 14. So I use a code like:
FindLHS = find(matrix(:,1)==matrix(14,2));
and
FindRHS = find(matrix(:,1)==matrix(14,3));
so clearly this will produce the results of
FindLHS = 12
FindRHS = 13
I will then proceed to label both 12 and 13 as belonging to group "1" as I did for 14
now my problem is I want to do this same procedure for both 12 and 13 of finding and labelling the indexs for 12 and 13 being (1,2) and (3,4). Is there a way to repeat that code for both idx of 1,2,3 and 4? because the real dataset has over 5000 data points in it...
Do you understand what I mean?
Thanks
James
All you really want to do is find wherever matrix(:,1) contains one of the numbers you've already found, include the numbers in the second and third columns into your group list (presuming they aren't already there), and stop when that list stops growing, right? This may not be the most efficient way of doing it but it gives you the basic idea:
while ~(numel(oldnum)==numel(num))
oldnum = num;
idx = ismember(matrix(:,1),oldnum)
num = unique(matrix(idx,:))
end
Output:
num =
1
2
3
4
12
13
14
Now if your first column is literally just your numbers 1 through 5000 in order, you don't need to even find the index, you can just use your number list directly.
To do this for multiple groups you would just need an outer loop that stores the information for each group, then picks out the next unused number. I'm presuming that your individual groups are consistent so that no matter which of those numbers you pick you end up with the same result - e.g. starting at 2 or 14 gives you the same result (if not, it becomes more complex).

Matlab general matrix indexing for accesing several rows

Edit for clarity:
I have two matrices, p.valor 2x1000 and p.clase 1x1000. p.valor consists of random numbers spanning from -6 to 6. p.clase contains, in order, 200 1:s, 200 2:s and 600 3:s. What I wan´t to do is
Print p.valor using a diferent color/prompt for each clase determined in p.clase, as in following figure.
I first wrote this, in order to find out which locations in p.valor represented where the 1,2 respective 3 where in p.clase
%identify the locations of all 1,2 respective 3 in p.clase
f1=find(p.clase==1);
f2=find(p.clase==2);
f3=find(p.clase==3);
%define vectors in p.valor representing the locations of 1,2,3 in p.clase
x1=p.valor(f1);
x2=p.valor(f2);
x3=p.valor(f3);
There is 200 ones (1) in p.valor, thus, is x1=(1:200). The problem is that each number one(1) (and, respectively 2 and 3) represents TWO elements in p.valor, since p.valor has 2 rows. So even though p.clase and thus x1 now only have one row, I need to include the elements in the same colums as all locations in f1.
So the different alternatives I have tried have not yet been succesfull. Examples:
plot(x1(:,1), x1(:,2),'ro')
hold on
plot(x2(:,1),x2(:,2),'k.')
hold on
plot(x3(:,1),x3(:,2),'b+')
and
y1=p.valor(201:400);
y2=p.valor(601:800);
y3=p.valor(1401:2000);
scatter(x1,y1,'k+')
hold on
scatter(x2,y1,'b.')
hold on
scatter(x3,y1,'ro')
and
y1=p.valor(201:400);
y2=p.valor(601:800);
y3=p.valor(1401:2000);
plot(x1,y1,'k+')
hold on
plot(x2,y2,'b.')
hold on
plot(x3,y3,'ro')
My figures have the axisies right, but the plotted values does not match the correct figure provided (see top of the question).
Ergo, my question is: how do I include tha values on the second row in p.valor in my plotted figure?
I hope this is clearer!
Values from both rows simultaneously can be accessed using this syntax:
X=p.value(:,findX)
In this case, resulting X matrix will be a matrix having 2 rows and length(findX) columns.
M = magic(5)
M =
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
11 18 25 2 9
M2 = M(1:2, :)
M2 =
17 24 1 8 15
23 5 7 14 16
Matlab uses column major indexing. So to get to the next row, you actually just have to add 1. Adding 2 to an index on M2 here gets you to the next column, or adding 5 to an index on M
e.g. M2(3) is 24. To get to the next row you just add one i.e. M2(4) returns 5.To get to the next column add the number of rows so M2(2 + 2) gets you 1. If you add the number of columns like you suggested you just get gibberish.
So your method is very wrong. Freude's method is 100% correct, it's much easier to use subscript indexing than linear indexing for this. But I just wanted to explain why what you were trying doesn't work in Matlab. (aside from the fact that X=p.value(findX findX+1000) gives you a syntax error, I assume you meant X=p.value([findX findX+1000]))