in matlab multiple count ifs in matrix - matlab

Say I have the following data, S =
Year Week Postcode
2009 24 2035
2009 24 4114
2009 24 4127
2009 26 4114
2009 26 4556
2009 27 7054
2009 27 6061
2009 27 4114
2009 27 2092
2009 27 2315
2009 27 7054
2009 27 4217
2009 27 4551
2009 27 2035
2010 1 4132
2010 1 2155
2010 5 4114 ... (>60000 rows)
In Matlab, I would like to create a matrix with:
column 1: year (2006-2014)
column 2: week (1-52 for each year)
then the next n columns are unique postcodes where the data in each of these columns counts the occurrences from my data, S.
For example:
year week 2035 4114 4127 4556 7054
2009 24 1 1 1 0 0
2009 25 0 0 0 0 0
2009 26 0 1 0 1 0
2009 27 1 1 0 0 2
2009 28 0 0 0 0 0
Thanks if you can help!

Here is a working script which achieves this tabulation. The output is in the data table. You should:
Read the documentation on unique, tables, logical indexing, sortrows. As these are the key tools I used below.
Adapt the script to work with your data. This may involve changing matrices to cell arrays to deal with string inputs etc.
Possibly adapt this to be a function, for cleaner use if this is used regularly / on different data.
Code, fully commented for explanation:
% Use rng for repeatability in rand, n = num data entries
rng('default')
n = 100;
% Set up test data. You would use 3 equal length vectors of real data here
years = floor(rand(n,1)*9 + 2006); % random integer between 2006,2014
weeks = floor(rand(n,1)*52 + 1); % random integer between 1, 52
postcodes = floor(rand(n,1)*10)*7 + 4000; % arbitrary integers over 4000
% Create year/week values like 2017.13, get unique indices
[~, idx, ~] = unique(years + weeks/100);
% Set up table with year/week data
data = table();
data.Year = years(idx);
data.Week = weeks(idx);
% Get columns
uniquepostcodes = unique(postcodes);
% Cycle over unique columns, assign data
for ii = 1:numel(uniquepostcodes)
% Variable names cannot start with a numeric value, make start with 'p'
postcode = ['p', num2str(uniquepostcodes(ii))];
% Create data column variable for each unique postcode
data.(postcode) = zeros(size(data.Year,1),1);
% Count occurences of postcode in each date row
% This uses logical indexing of original data, looking for all rows
% which satisfy year and week of current row, and postcode of column.
for jj = 1:numel(data.Year)
data.(postcode)(jj) = sum(years == data.Year(jj) & ...
weeks == data.Week(jj) & ...
postcodes == uniquepostcodes(ii));
end
end
% Sort week/year data so all is chronological
data = sortrows(data, [1,2]);
% To check all original data was counted, you could run
% sum(sum(table2array(data(:,3:end))))
% ans = n, means that all data points were counted somewhere
On my PC, this takes less than 2.4 seconds for n = 60,000. There are almost definitely optimisations which can be made, but for something which may be used infrequently, this seems acceptable.
There is a linear increase in processing time, relative to the number of unique postcodes. This is because of the loop structure. So if you double the unique postcodes (20 rather than my example of 10) the time is nearer 4.8 seconds - twice as long.
If this solves your problem, consider accepting this as the answer.

Related

What data should I use on predicting month values using linear regression

Predicting next month values using linear regression.
I am using 6 month based historical values to predict future values.
I use vaccinated count on dependent variable and use months for independent variable and converted it to integer starts on 1.
Example.
Historical Data:
Month dependent variable independent variable
Jun 15 1
Jul 14 2
Aug 18 3
Sep 19 4
Oct 20 5
Nov 22 6
Is that correct?
Dependent Variable = Vaccinated Count
Independent Variable = Month converted to number start from 1
Expecting to give me some ideas if my data is correct
See picture below.
Python simple linear regression:
Hardcover
Date
2000-04-01 139
2000-04-02 128
2000-04-03 172
2000-04-04 139
2000-04-05 191
df['Time'] = np.arange(len(df.index))
Hardcover Time
Date
2000-04-01 139 0
2000-04-02 128 1
2000-04-03 172 2
2000-04-04 139 3
2000-04-05 191 4
fig, ax = plt.subplots()
ax.plot('Time', 'Hardcover', data=df, color='0.75')
ax = sns.regplot(x='Time', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Hardcover Sales');

Calculating mean over column with condition

So my question is as follows: I have a matrix (let's take
A = [ 1 11 22 33; 44 13 12 33; 1 14 33 44,]
as an example) where I want to calculate the mean for all columns separately. The tricky part is that I only want to calculate the mean for those numbers in each column which are greater than the column 25th percentile.
I was thinking to simply create the 25th percentile and then use this as a criterion for selecting rows. This, unfortunately, does not work.
In order to further clarify: What should happen is to go through each column and calculate the 25th percentile
prctile(A,25,1)
And then calculating the mean only for those numbers which are respectively to their column bigger than the percentile.
Any help?
Thanks!
You can create a version of A which is NaN for values below the 25th percentile, then use the 'omitnan' flag in mean to exclude those points:
A = [1 11 22 33; 44 13 12 33; 1 14 33 44];
B = A; % copy to leave A unaltered
B( B <= prctile(B,25,1) ) = NaN; % Turn values to NaN which we want to exclude
C = mean( B, 1, 'omitnan' ); % Omit the NaN values from the caculation
% C >>
% [ 15.33 13.50 27.50 36.67 ]

Finding index of vector from its original matrix

I have a matrix of 2d lets assume the values of the matrix
a =
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
17 24 1 8 15
11 18 25 2 9
This matrix is going to be divided into three different matrices randomly let say
b =
17 24 1 8 15
23 5 7 14 16
c =
4 6 13 20 22
11 18 25 2 9
d =
10 12 19 21 3
17 24 1 8 15
How can i know the index of the vectors in matrix d for example in the original matrix a,note that the values of the matrix can be duplicated.
for example if i want to know the index of {10 12 19 21 3} in matrix a?
or the index of {17 24 1 8 15} in matrix a,but for this one should return only on index value?
I would appreciate it so much if you can help me with this. Thank you in advance
You can use ismember with the 'rows' option. For example:
tf = ismember(a, c, 'rows')
Should produce:
tf =
0
0
1
0
0
1
To get the indices of the rows, you can apply find on the result of ismember (note that it's redundant if you're planning to use this vector for matrix indexing). Here find(tf) return the vector [3; 6].
If you want to know the number of the row in matrix a that matches a single vector, you either use the method explained and apply find, or use the second output parameter of ismember. For example:
[tf, loc] = ismember(a, [10 12 19 21 3], 'rows')
returns loc = 4 for your example. Note that here a is the second parameter, so that the output variable loc would hold a meaningful result.
Handling floating-point numbers
If your data contains floating point numbers, The ismember approach is going to fail because floating-point comparisons are inaccurate. Here's a shorter variant of Amro's solution:
x = reshape(c', size(c, 2), 1, []);
tf = any(all(abs(bsxfun(#minus, a', x)) < eps), 3)';
Essentially this is a one-liner, but I've split it into two commands for clarity:
x is the target rows to be searched, concatenated along the third dimension.
bsxfun subtracts each row in turn from all rows of a, and the magnitude of the result is compared to some small threshold value (e.g eps). If all elements in a row fall below it, mark this row as "1".
It depends on how you build those divided matrices. For example:
a = magic(5);
d = a([2 1 2 3],:);
then the matching rows are obviously: 2 1 2 3
EDIT:
Let me expand on the idea of using ismember shown by #EitanT to handle floating-point comparisons:
tf = any(cell2mat(arrayfun(#(i) all(abs(bsxfun(#minus, a, d(i,:)))<1e-9,2), ...
1:size(d,1), 'UniformOutput',false)), 2)
not pretty but works :) This would be necessary for comparisons such as: 0.1*3 == 0.3
(basically it compares each row of d against all rows of a using an absolute difference)

How to perform repeated regression in matlab?

I have an excel file that contains 5 columns and 48 rows (water demand, population and rainfall data for four years (1997-2000) of each month)
Year Month Water_Demand Population Rainfall
1997 1 355 4500 25
1997 2 375 5000 20
1997 3 320 5200 21
.............% rest of the month data of year 1997.
1997 12 380 6000 24
1998 1 390 6500 23
1998 2 370 6700 20
............. % rest of the month data of year 1998
1998 12 400 6900 19
1999 1
1999 2
.............% rest of the month data of year 1997 and 2000
2000 12 390 7000 20
i want to do the multiple linear regression in MATLAB. Here dependent variable is water demand and independent variable is population and rainfall. I have written the code for this for all the 48 rows
A1=data(:,3);
A2=data(:,4);
A3=data(:,5);
x=[ones(size(A1)),A2,A3];
y=A1;
b=regress(y,x);
yfit=b(1)+b(2).*A2+b(3).*A3;
Now I want to do the repetition. First, I want to exclude the row number 1 (i.e. exclude year 1997, month 1 data) and do the regression with rest of the 47 rows data. Then I want to exclude row number 2, and do the regression with data of row number 1 and row 3-48. Then I want exclude row number 3 and do the regression with data of row number 1-2 and row 4-48. There is alway 47 row data point as I exclude one row in each run. Finally, I want to get a table of regression coefficient and yfit of each run.
A simple way I can think of is creating a for loop and a temporary "under test" matrix that is exactly the matrix you have without the line you want to exclude, like this
C = zeros(3,number_of_lines);
for n = 1:number_of_lines
under_test = data;
% this excludes the nth line of the matrix
under_test(n,:) = [];
B1=under_test(:,3);
B2=under_test(:,4);
B3=under_test(:,5);
x1=[ones(size(B1)),B2,B3];
y1=B1;
C(:,n)=regress(y1,x1);
end
I'm sure you can optimize this by using some of the matlab functions that operate on vectors, without using the for loop. But I think for only 48 lines it should be fast enough.

How to extract new matrix from existing one

I have a large number of entries arranged in three columns. Sample of the data is:
A=[1 3 2 3 5 4 1 5 ;
22 25 27 20 22 21 23 27;
17 15 15 17 12 19 11 18]'
I want the first column (hours) to control the entire matrix to create new matrix as follows:
Anew=[1 2 3 4 5 ; 22.5 27 22.5 21 24.5; 14 15 16 19 15]'
Where the 2nd column of Anew is the average value of each corresponding hour for example:
from matrix A:
at hour 1, we have 2 values in 2nd column correspond to hour 1
which are 22 and 23 so the average is 22.5
Also the 3rd column: at hour 1 we have 17 and 11 and the
average is 14 and this continues to the hour 5 I am using Matlab
You can use ACCUMARRAY for this:
Anew = [unique(A(:,1)),...
cell2mat(accumarray(A(:,1),1:size(A,1),[],#(x){mean(A(x,2:3),2)}))]
This uses the first column A(:,1) as indices (x) to pick the values in columns 2 and 3 for averaging (mean(A(x,2:3),1)). The curly brackets and the call to cell2mat allow you to work on both columns at once. Otherwise, you could do each column individually, like this
Anew = [unique(A(:,1)), ...
accumarray(A(:,1),A(:,2),[],#mean), ...
accumarray(A(:,1),A(:,3),[],#mean)]
which may actually be a bit more readable.
EDIT
The above assumes that there's no missing entry for any of the hours. It will result in an error otherwise. Thus, a more robust way to calculate Anew is to allow for missing values. For easy identification of the missing values, we use the fillval input argument to accumarray and set it to NaN.
Anew = [(1:max(A(:,1)))', ...
accumarray(A(:,1),A(:,2),[],#mean,NaN), ...
accumarray(A(:,1),A(:,3),[],#mean,NaN)]
You can use consolidator to do the work for you.
[Afinal(:,1),Afinal(:,2:3)] = consolidator(A(:,1),A(:,2:3),#mean);
Afinal
Afinal =
1 22.5 14
2 27 15
3 22.5 16
4 21 19
5 24.5 15