What is the difference between attributes and points in clustering? - cluster-analysis

I was going through K-means clustering and I noticed that its complexity is O(n * K * I * d),
where n = number of points
K = number of clusters
I = number of iterations, and
d = number of attributes.
Could anyone please explain me the difference between points and attributes?

It is common that we define samples as some features. For example, we have a dataset of students that each student has these attributes or features: first_name, last_name, grade, degree
in this example, if we have a dataset having information of 20 students, we will have a data of size (20, 4) where 20 is the number of samples(points) and 4 is the number of attributes(features)
I hope this description can help you.

Related

Sort Matlab data into groups

I have a column of numerical data (imported from excel) and I would like to sort each of the column entries into 4 different groups based on custom size ranges, then calculate how many column entries are in each group, as a fraction of the total number of entries in the column.
For example, if my column was 1,3,13,11,5,9. I want to calculate how many entries fit into group 1-3, how many fit into group 4-7, and so on. Then calculate the amount of entries in each group as a fraction of the total number of column entries. ie, 6 in this example.
Does anyone know how to do this best?
Thanks
Hannah :)
Sry I misread your question:
here is the updated code
ranges = [1 3
4 7
8 11
12 13];
groups = size(ranges,1);
a = [ 1,3,13,11,5,9];
counter = zeros(groups,1);
for i=1:groups
counter(i) = sum(a>=ranges(i,1) & a<=ranges(i,2));
end
relative_counter = counter / numel(a);
Old answer:
I do not understand how you get your group bounds (in your question the first group has 3 elements and the 2nd group has 4?)
have a look at the following code. (be careful and test how it should behave at group boarders)
groups =4;
a = [ 1,3,13,11,5,9];
range = max(a)-min(a);
rangePerGroup = range/groups;
a_noOffset = a-min(a);
counter = zeros(groups,1);
for i=1:groups
counter(i) = sum(a_noOffset>=rangePerGroup*(i-1) & a_noOffset<=rangePerGroup*i);
end
relative_counter = counter / numel(a);

How do I derive weights given a weighted average?

Ok I feel like this has to have a simple solution but I can't for the life of me figure it out. I have a given weighted average, let's say total returns of a portfolio. And I want to break that out into returns from equity and returns from bonds. I know the returns of each and my total return, but I don't know how to calculate what weights I had in each.
I know I can use goal seek in Excel to get the answer, but there has to be some calculation I can use.
Ex: Total Return (weighted average of stocks and bonds) = 3.48%, Stock Returns = 5.21%, Bond Returns = 0.59%
If i understand well your question, you want to know how to calculate that the weight of the stock portfolio is 1.67 the weight of the Bond Portfolio.
1.0059 bond + 1.0521 stock = 1.0348 ( bond + stock ) and stock = w * bond
after replacing stock by w*bond on the first equation, and isolating w , you can find that the initial portfolio contains 1.67 stocks for 1 bond. Hope it helps.

Iterating in a matrix avoiding loop in MATLAB

I am posing an interesting and useful question that needs to be carried out in MATLAB. It is about efficiency of programming by avoiding using Loops"
Assume a matrix URm whose columns are products and rows are people. The matrix entries are rating of people to these products, and this matrix is sparse as each person normally rates only few products.
URm [n_u, n_i]
Another matrix of interest is F, which contains attribute for each of the products and the attribute is of fixed length:
F [n_f,n_i]
We divide the URm into two sub-matrices randomly: URmTrain and URmTest where the former is used for training the system and the latter for test. These two matrices have similar rows (users) but they could have different number of columns (products).
We can find the similarity between items very fast using pdist() or Matrix transpose:
S = F * F' ;
For each row (user) in URmTest:
URmTestp = zeros(size(URmTest));
u = 1 ; %% Example user 1
for i = 1 : size(URmTest,2)
indTrain = find(URmTrain(u,:)) ; % For each user, search for items in URmTrain that have been rated by the the user (i.e. the have a rating greater than zero)
for j = 1 : length(indTrain)
URmTestp(u,i) = URmTestp(u,i) + S(i,indTrain(j))*URmTrain(u,indTrain(j))
end
end
where URmp is the predicted version of URm and we can compute an error on how good our prediction has been.
Example
Lets's make a simple example. Let's assume the items user 1 has rated items 3 , 5 and 17:
indTrain = [3 5 17]
For each item j in URmTest, I want to predict the rating using the following formula:
URmTestp(u,j) = S(j,3)*URmTrain(u,3) + S(j,5)*URmTrain(u,5) + S(j,17)*URmTrain(u,17)
Once completed this process needs to be repeated for all users.
As URm is typically very big, I prefer options which use least amount of 'loops'. We may be able to take advantage of bsxfun but I am not sure if we can.
Please suggest me ides that can help on accelerating this process as rapid as possible. Thank you
I'm still not sure I completely understand your problem. But it seems to me that if you pre-compute s_ij as
s_ij = F.' * F %'// [ni x ni] matrix
then what you're after is simply
URmTestp(u,indTest) = URmTrain(u,indTrain) * s_ij(indTrain,indTest);
% or
%URmTestp(u,:) = URmTrain(u,indTrain) * s_ij(indTrain,:);
or if you only compute a smaller s_ij block only for the necessary arrays,
s_ij = F(:,indTrain).' * F(:,indTest);
then
URmTestp(u,indTest) = URmTrain(u,indTrain) * s_ij;
Alternatively, you can always compute the necessary subblock of s_ij on the fly:
URmTestp(u,indTest) = URmTrainp(u,indTrain) * F(:,indTrain).'*F(:,indTest);
If I understand correctly that indTest and indTrain are functions of u, such as
URmTestp = zeros(n_u,n_i); %// pre-allocate here!
for u=1:n_u
indTest = testCell{u};
indTrain = trainCell{u};
URmTestp(u,indTest) = URmTrainp(u,indTrain) * F(:,indTrain).'*F(:,indTest); %'
...
end
then probably not much can be vectorized on this loop, unless there's a very tricky indexing scheme that allows you to use linear indices. I'd stick with this setup.

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).

How do I create ranking (descending) table in matlab based on inputs from two separate data tables? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have four data sets (please bear with me here):
1st Table: List of 10 tickers (stock symbols) in one column in txt format in matlab.
2nd table: dates in numerical format in one column (10 days in double format).
3rd table: I have 10*10 data set of random numbers (assume 0-1 for simplicity). (Earnings Per Share growth EPS for example)--so I want high EPS growth in my ranking for portfolio construction.
4th table: I have another 10*10 data set of random numbers (assume 0-1 for simplicity). (Price to earnings ratios for example daily).-so I want low P/E ratio in my ranking for portfolio construction.
NOW: I want to rank portfolio of stocks each day made up of 3 stocks (largest values) from table one for a particular day and bottom three stocks from table 2 (smallest values). The output must be list of tickers for each day (3 in this case) based on combined ranking of the two factors (table 3 & 4 as described).
Any ideas? In short I need to end up with a top bucket with three tickers...
It is not entirely clear from the post what you are trying to achieve. Here is a take based on guessing, with various options.
Your first two "tables" store symbols for stocks and days (irrelevant for ranking). Your third and fourth are scores arranged in a stock x day manner. Let's assume stocks vertical, days horizontal and stocks symbolized with a value in [1:10].
N = 10; % num of stocks
M = 10; % num of days
T3 = rand(N,M); % table 3 stocks x days
T4 = rand(N,M); % table 4 stocks x days
Sort the score tables in ascending and descending order (to get upper and lower scores per day, i.e. per column):
[Sl,L] = sort(T3, 'descend');
[Ss,S] = sort(T4, 'ascend');
Keep three largest and smallest:
largest = L(1:3,:); % bucket of 3 largest per day
smallest = S(1:3,:); % bucket of 3 smallest per day
IF you need the ones in both (0 is nan):
% Inter-section of both buckets
indexI = zeros(3,M);
for i=1:M
z = largest(ismember(largest(:,i),smallest(:,i)));
if ~isempty(z)
indexI(1:length(z),i) = z;
end
end
IF you need the ones in either one (0 is nan):
% Union of both buckets
indexU = zeros(6,M);
for i=1:M
z = unique([largest(:,i),smallest(:,i)]);
indexU(1:length(z),i) = z;
end
IF you need a ranking of scores/stocks from the set of largest_of_3 and smallest_of_4:
scoreAll = [Sl(1:3,:); Ss(1:3,:)];
indexAll = [largest;smallest];
[~,indexSort] = sort(scoreAll,'descend');
for i=1:M
indexBest(:,i) = indexAll(indexSort(1:3,i),i);
end
UPDATE
To get a weighted ranking of the final scores, define the weight vector (1 x scores) and use one of the two options below, before sorting scoreAllW instead of scoreAll:
w = [0.3 ;0.3; 0.3; 0.7; 0.7; 0.7];
scoreAllW = scoreAll.*repmat(w,1,10); % Option 1
scoreAllW = bsxfun(#times, scoreAll, w); % Option 2