Predicting next month values using linear regression.
I am using 6 month based historical values to predict future values.
I use vaccinated count on dependent variable and use months for independent variable and converted it to integer starts on 1.
Example.
Historical Data:
Month dependent variable independent variable
Jun 15 1
Jul 14 2
Aug 18 3
Sep 19 4
Oct 20 5
Nov 22 6
Is that correct?
Dependent Variable = Vaccinated Count
Independent Variable = Month converted to number start from 1
Expecting to give me some ideas if my data is correct
See picture below.
Python simple linear regression:
Hardcover
Date
2000-04-01 139
2000-04-02 128
2000-04-03 172
2000-04-04 139
2000-04-05 191
df['Time'] = np.arange(len(df.index))
Hardcover Time
Date
2000-04-01 139 0
2000-04-02 128 1
2000-04-03 172 2
2000-04-04 139 3
2000-04-05 191 4
fig, ax = plt.subplots()
ax.plot('Time', 'Hardcover', data=df, color='0.75')
ax = sns.regplot(x='Time', y='Hardcover', data=df, ci=None, scatter_kws=dict(color='0.25'))
ax.set_title('Time Plot of Hardcover Sales');
I've made a calendar sheet and would like to fill it using an Arrayformula or some kind of Lookup.
The problem is, the code in each cell is different, do I need it all to be the same code or is it possible to do an Arrayformula that does a different formula for each line?
I spent ages getting the calendar code working but would now like to simplify the code and I'm not sure what my next step should be:
https://docs.google.com/spreadsheets/d/1u_J7bmOFyDlYXhcL5dW3CHFJ1esySAKK_yPc6nFTdLA/edit?usp=sharing
Any advice would be much appreciated.
I've added a new sheet in your file called 'Aresvik'.
The green cells have new formula.
Cell B3 can be =date(B1,1,1)
Then each successive month can be =eomonth(B3,0)+1, =eomonth(J3,0)+1 etc.
The date formula in cell B5 is:
=arrayformula(iferror(vlookup(sequence(7,7,1),{array_constrain(sequence(40,1),day(eomonth(B3,0))+weekday(B3,3),1),query({flatten(split(rept(",",day(eomonth(B3,0))-1),",",0,0));sequence(day(eomonth(B3,0)),1,1)},"offset "&day(eomonth(B3,0))-weekday(B3,3)&" ",0)},2,false),))
It can be copied to each other cell below Mo, so B5 will change to J5, R5, Z5 etc.
Notes
The concept revolves around using the SEQUENCE function to generate a grid of numbers, 6 rows, 7 columns:
sequence(6,7)
which looks like this:
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 32 33 34 35
36 37 38 39 40 41 42
Then using these numbers in a VLOOKUP to get a corresponding date for the calendar. If the first of the month falls on a Thursday (April 2021), the vlookup range needs 3 gaps at the top of the list of dates. player0 has a more elegant solution than my original query using offset, so I've incorporated it below. Cell Z3 is the date 1/4/2021:
=arrayformula(
iferror(
vlookup(sequence(6,7),
{sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0),
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1,Z3)}},
2,false)
,))
The first column in the vlookup range is:
sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0)
which is an array of numbers from 0, corresponding with the number of days in the month plus the number of gaps before the 1st day.
The second column in the vlookup range is:
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1,Z3)}},
It is an array of 2 columns in this format: {x;y}, where y sits below x because of the ;.
These are the gaps: iferror(sequence(weekday(Z3,2),1)/0,), followed by the date numbers: sequence(day(eomonth(Z3,0)),1,Z3)
(Example below is April 2021):
0
1
2
3
4
5
6 44317
7 44318
8 44319
9 44320
10 44321
11 44322
12 44323
13 44324
14 44325
15 44326
16 44327
17 44328
18 44329
19 44330
20 44331
21 44332
22 44333
23 44334
24 44335
25 44336
26 44337
27 44338
28 44339
29 44340
30 44341
31 44342
32 44343
33 44344
34 44345
35 44346
36 44347
The vlookup takes each number in the initial sequence (6x7 layout), and brings back the corresponding date from col2 in the range, based on a match in col1.
When the first day of the month is a Monday, iferror(sequence(weekday(BB1,2),1)/0,) generates a gap in col2 of the vlookup range. This is why col1 in the vlookup range has to start with 0.
I've updated the sheet at https://docs.google.com/spreadsheets/d/1u_J7bmOFyDlYXhcL5dW3CHFJ1esySAKK_yPc6nFTdLA/edit#gid=68642071
Values on the calendar are dates so the formatting has to be d.
If you want numbers, then use:
=arrayformula(
iferror(
vlookup(sequence(6,7),
{sequence(day(eomonth(Z3,0))+weekday(Z3,2),1,0),
{iferror(sequence(weekday(Z3,2),1)/0,);sequence(day(eomonth(Z3,0)),1)}},
2,false)
,))
shorter solution:
=INDEX(IFNA(VLOOKUP(SEQUENCE(6, 7), {SEQUENCE(DAY(EOMONTH(B3, ))+WEEKDAY(B3, 2), 1, ),
{IFERROR(ROW(INDIRECT("1:"&WEEKDAY(B3, 2)))/0); SEQUENCE(DAY(EOMONTH(B3, )), 1, B3)}}, 2, )))
I have hourly data from ECMWF ERA5 for each day in a specific year. I want to convert that data from hourly to daily. Copernicus has a Python code for this here https://confluence.ecmwf.int/display/CKB/ERA5%3A+How+to+calculate+daily+total+precipitation.
I want to know what is the matlab code to do this? I was upload the netcdf file in my google drive here:
https://drive.google.com/open?id=1qm5AGj5zRC3ifD1_V-ne2nDT1ch_Khik
time steps of each day are:
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Notice to cover total precipitation for 1st January 2017 for example, we need two days of data:
1st January 2017 time = 01 - 23 will give you total precipitation data to cover 00 - 23 UTC for 1st January 2017
2nd January 2017 time = 00 will give you total precipitation data to cover 23 - 24 UTC for 1st January 2017
here is ncdisp():
>> ncdisp(filename)
Source:
C:\Users\Behzad\Desktop\download.nc
Format:
64bit
Global Attributes:
Conventions = 'CF-1.6'
history = '2019-11-01 07:36:15 GMT by grib_to_netcdf-2.14.0: /opt/ecmwf/eccodes/bin/grib_to_netcdf -o /cache/data6/adaptor.mars.internal-1572593007.3569295-19224-27-449cad76-bcd6-4cfa-9767-8a3c1219c0bb.nc /cache/tmp/449cad76-bcd6-4cfa-9767-8a3c1219c0bb-adaptor.mars.internal-1572593007.35751-19224-4-tmp.grib'
Dimensions:
longitude = 49
latitude = 41
time = 8760
Variables:
longitude
Size: 49x1
Dimensions: longitude
Datatype: single
Attributes:
units = 'degrees_east'
long_name = 'longitude'
latitude
Size: 41x1
Dimensions: latitude
Datatype: single
Attributes:
units = 'degrees_north'
long_name = 'latitude'
time
Size: 8760x1
Dimensions: time
Datatype: int32
Attributes:
units = 'hours since 1900-01-01 00:00:00.0'
long_name = 'time'
calendar = 'gregorian'
tp
Size: 49x41x8760
Dimensions: longitude,latitude,time
Datatype: int16
Attributes:
scale_factor = 3.0792e-07
add_offset = 0.010089
_FillValue = -32767
missing_value = -32767
units = 'm'
long_name = 'Total precipitation'
tp is my variable which have 3 dimensions (lon*lat*time) = 49*41*8760
I want it in the 49*41*365 for a non-leap year.
The result should be the daily values for the whole year.
While some vectorized versions may exist that reshape your vector into 4 dimensions, a simple for loop will do the job.
tp_daily=zeros(size(tp,1),size(tp,2),365);
for ii=0:364
day=tp(:,:,ii*24+1:(ii+1)*24); %grab an entire day
tp_daily(:,:,ii+1)=sum(day,3); % add the third dimension
end
I'm plotting a box plot with overlaid data from the following concatenated matrix:
data = [10 16 24 31 12 26 23 33;11 15 27 27 12 24 22 36;12 15 24 25 14 25 22 37;10 16 27 24 14 27 23 41;12 15 NaN NaN 15 NaN 22 NaN;13 18 NaN NaN 16 NaN 22 NaN]
The code for this plot is:
datas=sort(data);
datainbox=datas(ceil(end/4)+1:floor(end*3/4),:);
[n1,n2]=size(datainbox);
dataoutbox=datas([1:ceil(end/4) floor(end*3/4)+1:end],:);
n3=size(dataoutbox,1);
% calculate quartiles
dataq=quantile(data,[.25 .5 .75]);
% calculate range between box and outliers = between 1.5*IQR from quartiles
dataiqr=iqr(data);
datar=[dataq(1,:)-dataiqr*1.5;dataq(3,:)+dataiqr*1.5];
dataoutbox(dataoutbox<ones(n3,1)*datar(1,:)|dataoutbox>ones(n3,1)*datar(2,:))=nan;
figure()
hold on
bp = boxplot(data);
plot(ones(n1,1)*[1 2 3 4 5 6 7 8]+.4*(rand(n1,n2)-.5),datainbox,'k.','MarkerSize',12)
plot(ones(n3,1)*[1 2 3 4 5 6 7 8]+.4*(rand(n3,n2)-.5),dataoutbox,'.','color',[1 1 1]*.5,'MarkerSize',12)
set(bp,'linewidth',1);
As indicated above, I am sorting the data into 'datainbox' and 'dataoutbox' based on the IQR. The code works as expected (credit to JJM Driesson) except for the data columns containing NaNs, where as shown in the plot the data is not sorted correctly. How should I modify the above code to exclude NaNs from calculations and prevent this from influencing the plot?
Thank you for your time,
Laura
You should process every column separately. You can select the NaN values as follows: col = data(~isnan(data(:, i)), i);
If you want all the boxplots in the same figure, you can try to use this answer.
I have an excel file that contains 5 columns and 48 rows (water demand, population and rainfall data for four years (1997-2000) of each month)
Year Month Water_Demand Population Rainfall
1997 1 355 4500 25
1997 2 375 5000 20
1997 3 320 5200 21
.............% rest of the month data of year 1997.
1997 12 380 6000 24
1998 1 390 6500 23
1998 2 370 6700 20
............. % rest of the month data of year 1998
1998 12 400 6900 19
1999 1
1999 2
.............% rest of the month data of year 1997 and 2000
2000 12 390 7000 20
i want to do the multiple linear regression in MATLAB. Here dependent variable is water demand and independent variable is population and rainfall. I have written the code for this for all the 48 rows
A1=data(:,3);
A2=data(:,4);
A3=data(:,5);
x=[ones(size(A1)),A2,A3];
y=A1;
b=regress(y,x);
yfit=b(1)+b(2).*A2+b(3).*A3;
Now I want to do the repetition. First, I want to exclude the row number 1 (i.e. exclude year 1997, month 1 data) and do the regression with rest of the 47 rows data. Then I want to exclude row number 2, and do the regression with data of row number 1 and row 3-48. Then I want exclude row number 3 and do the regression with data of row number 1-2 and row 4-48. There is alway 47 row data point as I exclude one row in each run. Finally, I want to get a table of regression coefficient and yfit of each run.
A simple way I can think of is creating a for loop and a temporary "under test" matrix that is exactly the matrix you have without the line you want to exclude, like this
C = zeros(3,number_of_lines);
for n = 1:number_of_lines
under_test = data;
% this excludes the nth line of the matrix
under_test(n,:) = [];
B1=under_test(:,3);
B2=under_test(:,4);
B3=under_test(:,5);
x1=[ones(size(B1)),B2,B3];
y1=B1;
C(:,n)=regress(y1,x1);
end
I'm sure you can optimize this by using some of the matlab functions that operate on vectors, without using the for loop. But I think for only 48 lines it should be fast enough.