Sort Matlab data into groups - matlab

I have a column of numerical data (imported from excel) and I would like to sort each of the column entries into 4 different groups based on custom size ranges, then calculate how many column entries are in each group, as a fraction of the total number of entries in the column.
For example, if my column was 1,3,13,11,5,9. I want to calculate how many entries fit into group 1-3, how many fit into group 4-7, and so on. Then calculate the amount of entries in each group as a fraction of the total number of column entries. ie, 6 in this example.
Does anyone know how to do this best?
Thanks
Hannah :)

Sry I misread your question:
here is the updated code
ranges = [1 3
4 7
8 11
12 13];
groups = size(ranges,1);
a = [ 1,3,13,11,5,9];
counter = zeros(groups,1);
for i=1:groups
counter(i) = sum(a>=ranges(i,1) & a<=ranges(i,2));
end
relative_counter = counter / numel(a);
Old answer:
I do not understand how you get your group bounds (in your question the first group has 3 elements and the 2nd group has 4?)
have a look at the following code. (be careful and test how it should behave at group boarders)
groups =4;
a = [ 1,3,13,11,5,9];
range = max(a)-min(a);
rangePerGroup = range/groups;
a_noOffset = a-min(a);
counter = zeros(groups,1);
for i=1:groups
counter(i) = sum(a_noOffset>=rangePerGroup*(i-1) & a_noOffset<=rangePerGroup*i);
end
relative_counter = counter / numel(a);

Related

Frequency of maximum values occuring in columns

I have a 35 x 24 matrix of numbers. Each of the 35 rows obviously has a maximum value. I'm trying to write a short piece of code which determines which of the 24 columns contains the most of these max values. The catch is that no loops are allowed.
For example if the maximum values for 30 different rows all happened to lie in column 7 then I would want MATLAB to return the answer 7 since this is the column with the most max row values.
If the values in each row are unique, we can simply use the second output of max combined with the mode to figure out which column contains the maximum of each row most often.
% Find the column which contains the maximum value
[~, column] = max(data, [], 2);
result = mod(column);
However, below is a more general solution that allows a given maximum value to occur multiple times per row.
maximaPerColumn = sum(bsxfun(#eq, data, max(data, [], 2)), 1);
result = find(maximaPerColumn == max(maximaPerColumn));
Explanation
First we want compute the maximum value for each row (maximum across columns, dimension 2).
rowMaxima = max(data, [], 2);
Then we want to replace each row with a 1 if the value is equal to the max of that row and 0 otherwise. We can easily do this using bsxfun.
isMaxOfRow = bsxfun(#eq, data, rowMaxima);
Then we want to figure out how many times a given column contains a max of a row. We can simply sum down the columns to get this.
maximaPerColumn = sum(isMaxOfRow, 1);
Now we want to find the column which contained the maximum number of maxima. We use find due to the fact that more than one column could contain the same number of maxima.
result = find(maximaPerColumn == max(maximaPerColumn));
I think you are looking for this:
sum(A==max(A,[],2))
Example:
A = [1 1 1;
2 2 2;
3 2 1]
M = sum(A==max(A,[],2))
Returns:
[3 2 2]
The first column has the most row wise max values. You could use find to identify this column.
find(M==max(M))

Delete rows and columns of a matrix

I have the matrix below:
a = [1 2 1 4;
3 4 9 16;
0 0 -2 -4;
0 0 -6 -8]
How can I arbitrary remove any given rows or columns? for example second row and third column of the above matrix?
Just assign the column or line to the empty matrix:
a(2,:) = [];
a(:,3) = [];
Note : I compare the other solution to mine, following the link put inside. On a big array (created as rand(1e4)) and on 10 runs where I delete 2 columns and 2 rows, the average times are 0.932ms for the empty-matrix assignment, and 0.905ms for the kept-row (or -column) assignment. So the gap seen there is not as big as 1.5x mentioned in the link. Always perform a little benchmark first :) !
Edit Fastest solution is to create the index mask for rows and columns, and reassign your array with these masks. Ex:
a = rand(10000);
kr = true(size(a,1),1);
kr([72,6144]) = false; % some rows to delete
kc = true(1,size(a,2));
kc([1894,4512]) = false; % some columns to delete
a = a(kr,kc);
On this test, it's clearly twice faster than performing the suppression on rows and columns separately.
A slightly more efficient way (although possibly more complicated to set up) is to reassign all the rows you want to keep (when compared with setting the rows you want to delete to the empty matrix). So for example if you want to delete rows 5 and 7 from a matrix you can either do
A = A([1:4, 6, 8:end],:)
or
A = A(setdiff(1:size(A,1), [5,7] ),:)
but the best method is likely to use logical indexing (which is often a natural step in Matlab workflows anyway):
idx = true(size(A,1),1);
idx([5,7]) = false;
A = A(idx,:)

find unique times among years in time series

Suppose I have a date vector shown here by tt and a corresponding data series corresponding to aa. For example:
dd = datestr(datenum('2007-01-01 00:00','yyyy-mm-dd HH:MM'):1/24:...
datenum('2011-12-31 23:00','yyyy-mm-dd HH:MM'),...
'yyyy-mm-dd HH:MM');
tt = datevec(datenum(dd,'yyyy-mm-dd HH:MM'));
tt(1002,:) = [];
aa = rand(length(tt),1)
How is it possible to ensure that the hours and days are consistent among the years?
For example, I only want to keep times that are the same among years e.g.
2009-01-01 01:00
would be the same as
2010-01-01 01:00
ad so on.
If one year has a measurements at
2009-01-01 02:00
but yyyy-01-01 02:00
is not present in the other years, this time should re removed.
I would like the to return tt and aa where only those times that are consistent among the years are kept. how can this be done?
I was considering finding the indices for the unique years first as:
[~,~,iyears] = unique(tt(:,1),'rows');
and then find the indices for the unique month, day, and hour as:
[~,~,iid] = unique(tt(:,2:4),'rows');
but I am not sure how to combine these to give the desired output?
The solution below uses a loop to store data in an unitialized array, which can be inefficient, but unless your dataset is huge (with many, many years) it should do the job. The general idea is to break the dataset up into years. I am storing the resulting time-vectors in a cell array, because they probably won't have the same length. I then do a set-intersection of all the time vectors, to get a vector of common times. From there it's straight forward.
years = unique(tt(:,1), 'rows');
% Put the "sub-times" of each year into cell array
for ii = 1:length(years)
times_each_year{ii} = tt(tt(:,1)==years(ii),2:end);
end
% Do intersection of all "sub-times" sets
common_times = times_each_year{1};
for ii = 2:length(years)
common_times = intersect(common_times, times_each_year{ii},'rows');
end
% Find and delete the points that are not member of the "sub-times":
idx = ~ismember(tt(:,2:end),common_times,'rows');
deleted_points = datestr(tt(idx,:)); % for later review
tt(idx,:) = [];
However, note that the deleted_points vector contains more points than one might expect. That's because 2008 was a leap year, and all the points corresponding to Febr. 29th were deleted.
Another such oddity might await you if your data is "contaminated" by daylight savings time.
Code
a1 = str2num(datestr(tt,'mmddHHMM')); %// If in your data minutes are always 00, you can use 'mmddHH' instead and save some runtime
k1 = unique(a1);
gt1 = histc(a1,k1);
valid_rows = ismember(a1,k1(gt1==max(gt1)));
new_tt = tt(valid_rows,:); %// Desired tt output
new_aa = aa(valid_rows,:); %// Desired aa output
Explanation
To understand how it works, let's test out the code at a micro-level. Let's assume some small data that corresponds to tt -
data1 = [4 5 1 4 5 1 4 5 6]
data1 is the data collected over few sets and resembles tt that has data over few years with month, date, hour and minutes when these four parameters are conglomerated into a single parameter.
One can notice it would represent data from three sets/years with data as {4,5}, {1,4,5} and {1,4,5,6}. Our job is to found out all those values in data1 that is repeated across all the three years/sets of data. Thus, the final output must be {4,5}.
Let's see how this can be coded up.
Step 1: Get the unique values
unique_val = unique(data1)
We would have - [1 4 5 6]
Step 2: Get the count of unique values in the data
count_unique_val = histc(data1,unique_val)
Output is - [2 3 3 1]
Step 3: Get the indices from the unique values array where their counts are equal to the maximum of counts, indicating those are the unique values that are repeated across all the sets.
index1 = count_unique_val==max(count_unique_val)
Output comes out as - [0 1 1 0]
Step 4: Get those "consistent" unique values
consistent_val = unique_val(index1)
Gives us - [4 5], which is what we were looking for.
Step 5: Finally get the indices where the consistent data is present,
which can be used later on to select the rows with "consistent" data.
index_consistent_val = ismember(data1,consistent_val)
Output is - [1 1 0 1 1 0 1 1 0], which makes sense too.
Please note that in the original code a1 = str2num(datestr(tt,'mmddHHMM')); gets us the single parameter from the four parameters of month, date, hour and minutes as discussed in the comments earlier too.

How do I create ranking (descending) table in matlab based on inputs from two separate data tables? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have four data sets (please bear with me here):
1st Table: List of 10 tickers (stock symbols) in one column in txt format in matlab.
2nd table: dates in numerical format in one column (10 days in double format).
3rd table: I have 10*10 data set of random numbers (assume 0-1 for simplicity). (Earnings Per Share growth EPS for example)--so I want high EPS growth in my ranking for portfolio construction.
4th table: I have another 10*10 data set of random numbers (assume 0-1 for simplicity). (Price to earnings ratios for example daily).-so I want low P/E ratio in my ranking for portfolio construction.
NOW: I want to rank portfolio of stocks each day made up of 3 stocks (largest values) from table one for a particular day and bottom three stocks from table 2 (smallest values). The output must be list of tickers for each day (3 in this case) based on combined ranking of the two factors (table 3 & 4 as described).
Any ideas? In short I need to end up with a top bucket with three tickers...
It is not entirely clear from the post what you are trying to achieve. Here is a take based on guessing, with various options.
Your first two "tables" store symbols for stocks and days (irrelevant for ranking). Your third and fourth are scores arranged in a stock x day manner. Let's assume stocks vertical, days horizontal and stocks symbolized with a value in [1:10].
N = 10; % num of stocks
M = 10; % num of days
T3 = rand(N,M); % table 3 stocks x days
T4 = rand(N,M); % table 4 stocks x days
Sort the score tables in ascending and descending order (to get upper and lower scores per day, i.e. per column):
[Sl,L] = sort(T3, 'descend');
[Ss,S] = sort(T4, 'ascend');
Keep three largest and smallest:
largest = L(1:3,:); % bucket of 3 largest per day
smallest = S(1:3,:); % bucket of 3 smallest per day
IF you need the ones in both (0 is nan):
% Inter-section of both buckets
indexI = zeros(3,M);
for i=1:M
z = largest(ismember(largest(:,i),smallest(:,i)));
if ~isempty(z)
indexI(1:length(z),i) = z;
end
end
IF you need the ones in either one (0 is nan):
% Union of both buckets
indexU = zeros(6,M);
for i=1:M
z = unique([largest(:,i),smallest(:,i)]);
indexU(1:length(z),i) = z;
end
IF you need a ranking of scores/stocks from the set of largest_of_3 and smallest_of_4:
scoreAll = [Sl(1:3,:); Ss(1:3,:)];
indexAll = [largest;smallest];
[~,indexSort] = sort(scoreAll,'descend');
for i=1:M
indexBest(:,i) = indexAll(indexSort(1:3,i),i);
end
UPDATE
To get a weighted ranking of the final scores, define the weight vector (1 x scores) and use one of the two options below, before sorting scoreAllW instead of scoreAll:
w = [0.3 ;0.3; 0.3; 0.7; 0.7; 0.7];
scoreAllW = scoreAll.*repmat(w,1,10); % Option 1
scoreAllW = bsxfun(#times, scoreAll, w); % Option 2

How to count matches in several matrices?

Making a dichotomous study, I have to count how many times a condition takes place?
The study is based on two kinds of matrices, ones with forecasts and others with analyzed data.
Both in the forecast and analysis matrices, in case a condition is satisfied we add 1 to a counter. This process is repeated for a points distributed in a grid.
Are there any functions in MATLAB that help me with counting or any script that supports this procedure?
Thanks guys!
EDIT:
The case goes about precipitation registered and forecasted. When both exceed a threshold I consider it as a hit. I have Europe divided in several grid points, and I have to count how many times the forecast is correct. I also have 50 forecasts for each year, so the result (hit/no hit) must be a cumulative action.
I've trying with count and sum functions, but they reduce the spatial dimension of the matrices.
It's difficult to tell exactly what you are trying to do but the following may help.
forecasted = [ 40 10 50 0 15];
registered = [ 0 15 30 0 10];
mismatch = abs( forecasted - registered );
maxDelta = 10;
forecastCorrect = mismatch <= maxDelta
totalCorrectForecasts = sum(forecastCorrect)
Results:
forecastCorrect =
0 1 0 1 1
totalCorrectForecasts =
3