How to average and equally compare categories with different number of data elements? - average

I am extracting a list of categories that contain a number of listed values. I am then averaging and doing a compare i. here is the general explanation.
Example:
Category 1 has 2 elements
Category 2 has 5 elements
Category 3 has 9 elements
Category 4 has 10 elements
Category 5 has 17 elements
Category 6 has 26 elements
Category 7 has 55 elements
Within each category, there are individual elements that contain a score. I am attempting to compare the average score for the overall category compared to another category equally.
The problem is that because each category contains a different amount of elements, the average comparison to evaluate is not the same. For example, comparing Category 1 with 2 elements to a Category 7 with 55 elements.
If Category 1 had 55 elements, then I could say that I am equally comparing the overall value to Category with 55 elements also.
My first thought was to say that each category must have 10 scores to equally compare.
For Category 1, I thought about just taking the 2 scores, and then add 8 zeros to show that the category is weaker due to not having the rest of the 8, while comparing against Category 7 with it's strongest top 10 scores out of the 52, but I don't believe that will provide any useful result.
The same would apply to Category 2 with 5 elements, that 5 zeros are factored in to make 10.
The same would apply to Category 3 with 9 elements, that 1 zero are factored in to make 10.
What I am trying to do is find a way to compare apples to apples by knowing that each category is compared against a set limit of 10 scores to gauge which is stronger in score relative to the others categories.
Is there a process or method in which I can address this? Is there a better way to approach this?
Thank you!

We can't decide for you which of the aggregate function is the most appropriate to your case. Usually, people use average or max like :
select category, count(1), avg(score), max(score) from scores group by category

Related

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

MATLAB/OCTAVE - Branching loops? or parallel looping?

Still new to the programing game but I need a little help! I'm not exactly sure how to describe what I want to do but I'll give it my best shot. I have a set of numbers produced by an algorithm I've put together. e.g. :
....
10 10 10
11 11 11
12 1 2
13 3 4
14 12 13
15 6 7
16 5 15
17 8 9
....
Essentially what I want to do is assign these index numbers to groups. Lets say I start with the number 14 in the first column. It is going to belong to group 1, so I label it in a new column in row 14 "1" for group one. The second and the third column show other index numbers that are grouped with the index 14. So I use a code like:
FindLHS = find(matrix(:,1)==matrix(14,2));
and
FindRHS = find(matrix(:,1)==matrix(14,3));
so clearly this will produce the results of
FindLHS = 12
FindRHS = 13
I will then proceed to label both 12 and 13 as belonging to group "1" as I did for 14
now my problem is I want to do this same procedure for both 12 and 13 of finding and labelling the indexs for 12 and 13 being (1,2) and (3,4). Is there a way to repeat that code for both idx of 1,2,3 and 4? because the real dataset has over 5000 data points in it...
Do you understand what I mean?
Thanks
James
All you really want to do is find wherever matrix(:,1) contains one of the numbers you've already found, include the numbers in the second and third columns into your group list (presuming they aren't already there), and stop when that list stops growing, right? This may not be the most efficient way of doing it but it gives you the basic idea:
while ~(numel(oldnum)==numel(num))
oldnum = num;
idx = ismember(matrix(:,1),oldnum)
num = unique(matrix(idx,:))
end
Output:
num =
1
2
3
4
12
13
14
Now if your first column is literally just your numbers 1 through 5000 in order, you don't need to even find the index, you can just use your number list directly.
To do this for multiple groups you would just need an outer loop that stores the information for each group, then picks out the next unused number. I'm presuming that your individual groups are consistent so that no matter which of those numbers you pick you end up with the same result - e.g. starting at 2 or 14 gives you the same result (if not, it becomes more complex).

Nearest Neighbour Classifier for multiple features

I have a dataset set that looks like this:
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Class
Obj 2 2 2 8 5 1
Obj 2 8 3 3 4 2
Obj 1 7 4 4 8 1
Obj 4 3 5 9 7 2
The rows contain objects, which have a number of features. I have put 5 features for demonstration purposes but there is approximately 50 features per object, with the final column being the class label for each object.
I want to create and run the nearest neighbour classifier algorithm on this data set and retrieve the error rate.I have managed to get the NN algorithm working for each feature, a short Pseudo code example is below. For each feature I loop through each object, assigning object j according to its nearest neighbours.
for i = 1:Number of features
for j = 1:Number of objects
distance between data(j,i) and values of feature i
order by shortest distance
sum or the class labels k shortest distances
assign class with largest number of labels
end
error = mean(labels~=assigned)
end
The issue I have is how would I work out the 1-NN algorithm for multiple features. I will have a selection of the features from my dataset say features 1,2 and 3. I want to calculate the error rate if I add feature 5 into my set of selected features. I want to work out the error using 1NN. Would I find the nearest value out of all my features 1-3 in my selected feature?
For example, for my data set above:
Adding feature 5 - For object 1 of feature 5 the closest number to that is object 4 of feature 3. As this has a class label of 2 I will assign object 1 of feature 5 the class 2. This is obviously a misclassification but I would continue to classify all other objects in Feature 5 and compare the assigned and actual values.
Is this the correct way to perform the 1NN against multiple features?

postgresql compute min value of columns conditiong on a value of other columns

can I do this with the standard SQL or I need to create a function for the following problem?
I have 14 columns, which represent 2 properties of 7 consecutive objects (the order from 1 to 7 is important), so
table.object1prop1, ...,table.object1prop7,table.objects2prop2, ..., table.objects2prop7.
I need compute the minimum value of the property 2 of the 7 objects that have smaller values than a specific threshold for property 1.
The values of the property 1 of the 7 objects take values on a ascending arithmetic scale. So property 1 of the object 1 will ever be smaller than property 2 of the objects 1.
Thanks in advance for any clue!
This would be easier if the data were normalized. (Hint, any time you find a column name with a number in it, you are looking at a big red flag that the schema is not in 3rd normal form.) With the table as you describe, it will take a fair amount of code, but the greatest() and least() functions might be your best friends.
http://www.postgresql.org/docs/current/interactive/functions-conditional.html#FUNCTIONS-GREATEST-LEAST
If I had to write code for this, I would probably feed the values into a CTE and work from there.

Matlab: Sum elements in array into another array

Suppose I have an array age=[16 17 25 18 32 89 43 55] which holds the ages of a certain list of people. I also have a second array called groups=[1 1 2 1 3 2 1 4] denotes to which group each person belongs, i.e the person whose age is 55 is the only person in group 4, there are three people in group 1 etc.
I want to calculate the combined sum of ages in each group. That is, the result I want to get in this case is an array of 4 elements, it's first entry containing the sum of ages of people belonging to group #1 (16+17+18+43), second entry containing the sum of ages of people belonging to group #2 (23+89) etc.
I know of course how to do this with a for loop, but is it possible to do this using some variation of sum or something similar, so as to tap into matlab's vector optimization?
The code in #Ismail's answer is fine, but you could also try this:
>> accumarray(groups', age')
ans =
94
114
32
55
I find it hard to get an appreciation from the documentation exactly what accumarray can do in its full generality, but this is a great example of a simple usage. It's worth learning how to use it effectively, as once you've worked it out it's very powerful - and it will be a lot faster (when used on a larger example) than arrayfun.
You can use arrayfun and unique as follows:
arrayfun(#(x) sum(age(groups==x)), unique(groups))