How to normalize data by assigning NaN or zero to missing rows in matlab? - matlab

I use matlab 2014b.
I have a wrong program step. I don't know what is wrong with my script.
I have data per 10 minutes. time and value RR.
09/10/2014 3:00 0
09/10/2014 3:10 0
09/10/2014 3:30 0.4
09/10/2014 3:50 0.4
09/10/2014 4:00 0.4
09/10/2014 4:10 0.4
09/10/2014 4:20 0.4
09/10/2014 4:30 0.4
10/10/2014 4:40 0.4
09/10/2014 4:50 0.4
09/10/2014 5:00 0.4
09/10/2014 5:10 0.4
09/10/2014 5:20 0.4
....
Data consists of 12176x2 cell
It can be seen that after the second row there is no a time information nor data for 3:20 a.m. I want to get data per 10 minutes with empty data filled with 0 / NAN.
I use matlab 2014b and there is no retime function in that version.
I say thank you to anyone who has tried to help and advise.
times = out(:,1);
dn = datenum(times);
min_time = min(dn);
min_time_dv = datevec(min_time);
min_time_dv(5) = floor(min_time_dv(5) / 10) * 10;
first_slot_dn = datenum(min_time_dv);
max_time = max(dn);
max_time_dv = datevec(max_time);
max_time_dv(5) = floor(max_time_dv(5) / 10) * 10;
last_slot_dn = datenum(max_time_dv);
ten_mins_as_days = 1 / (24 * 60/10);
slot_dns = first_slot_dn : ten_mins_as_days : last_slot_dn;
slot_ds = datestr(slot_dns);
times_minutes = [cellstr(slot_ds(1:end,:))];
slot_ds = datestr(slot_dns);
[~, slot_idx] = histc(dn, slot_dns);
mean_RR1 = accumarray(slot_idx, RR(:), [length(slot_dns),1],#nanmean);
output1 = [cellstr(slot_ds(1:end,:)), num2cell(mean_RR1)];
I wrote a script and it has normalized well per 10 minutes. But the missing time was not filled by NAN. There was an error in normalizing the NAN values. Is there a suggestion to fix it?
I expect the results to be like this:
09/10/2014 3:00 0
09/10/2014 3:10 0
09/10/2014 3:20 NAN
09/10/2014 3:30 0.4
10/10/2014 3:40 NAN

I suggest using MATLAB's ismember function, which is very efficient in my experience.
time_input = ['09/10/2014 3:00'; '09/10/2014 3:10'; '09/10/2014 3:30'];
time_gaps = datetime(time_input,'InputFormat','dd/MM/yyyy H:mm');
L = length(time_gaps);
values_gaps = repelem(0.1,L);
% create normalized time vector
time_norm = datetime((time_gaps(1):minutes(10):time_gaps(end))');
% locate gaps by comparing the time vectors
[~, loc] = ismember(time_gaps,time_norm);
% create normalized value vector + fill at locations
values_norm = NaN(length(time_norm),1);
values_norm(loc) = values_gaps;
I did not bother using your code, as it was confusing to me. I use R2018b with datetime function, but it should be available to you in R2014b.

Related

How do i convert monthly data into quarterly data in matlab as part of a table

How do I convert monthly data into quarterly data?
For example:
my 1 table looks like this:
TB3m
0.08
0.07
0.06
0.12
0.13
0.14
my second table is table of dates:
dates
1975/01/31
1975/02/28
1975/03/31
1975/04/30
1975/05/31
1975/06/30
I want to convert table 1 such that it takes the average of 3 months to give me quarterly data.
It should ideally look like:
TB3M_quarterly
0.07
0.13
so that it can match my other quarterly dates table which looks like:
dates_quarterly
1975/03/31
1975/06/30
Over all my data is from 1975 January to 2021 june which would give me around 186 quarterly data. Please suggest what I can use. It is thefirst time I am using matlab
In case you can't understand a command, search for it in MATLAB documentation. One quick way to do so in Windows is to click on the function you don't understand and then press F1. For example, if you can't understand what calmonths is doing, then click on calminths and press F1.
TBm = [0.08; 0.07; 0.06; 0.12; 0.13; 0.14]; %values
% months
t1 = datetime(1975, 01, 31); %1st month
% datetime is a datatype to represent time
t2 = datetime(1975, 06, 30); %change this as your last month
dates = t1:calmonths(1):t2;
dates = dates'; %row vector to column vector
TBm_reshaped = reshape(TBm, 3, []);
TB3m_quarterly = mean(TBm_reshaped);
dates_quarterly = dates(3:3:end);
TB3m_quarterly = TB3m_quarterly';
T = table(dates_quarterly, TB3m_quarterly)
I suggest you organize your data in a timetable, and then use the function retime()
% Replicating your data
dates = datetime(1975,1,31) : calmonths(1) : datetime(1975,6,30);
T = timetable([8 7 6 12 13 14]'./100, 'RowTimes', dates);
% Using retime to get quarterly values:
T_quarterly = retime(T, 'quarterly', 'mean')
Here you aggregate by taking the mean of the monthly data. For other aggregation methods, look at the documentation for retime()

stan number of effective sample size

I reproduced the results of a hierarchical model using the rethinking package with just rstan() and I am just curious why n_eff is not closer.
Here is the model with random intercepts for 2 groups (intercept_x2) using the rethinking package:
Code:
response = c(rnorm(500,0,1),rnorm(500,200,10))
predicotr1_continuous = rnorm(1000)
predictor2_categorical = factor(c(rep("A",500),rep("B",500) ))
data = data.frame(y = response, x1 = predicotr1_continuous, x2 = predictor2_categorical)
head(data)
library(rethinking)
m22 <- map2stan(
alist(
y ~ dnorm( mu , sigma ) ,
mu <- intercept + intercept_x2[x2] + beta*x1 ,
intercept ~ dnorm(0,10),
intercept_x2[x2] ~ dnorm(0, sigma_2),
beta ~ dnorm(0,10),
sigma ~ dnorm(0, 10),
sigma_2 ~ dnorm(0,10)
) ,
data=data , chains=1 , iter=5000 , warmup=500 )
precis(m22, depth = 2)
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
intercept 9.96 9.59 -5.14 25.84 1368 1
intercept_x2[1] -9.94 9.59 -25.55 5.43 1371 1
intercept_x2[2] 189.68 9.59 173.28 204.26 1368 1
beta 0.06 0.22 -0.27 0.42 3458 1
sigma 6.94 0.16 6.70 7.20 2927 1
sigma_2 43.16 5.01 35.33 51.19 2757 1
Now here is the same model in rstan():
# create a numeric vector to indicate the categorical groups
data$GROUP_ID = match( data$x2, levels( data$x2 ) )
library(rstan)
standat <- list(
N = nrow(data),
y = data$y,
x1 = data$x1,
GROUP_ID = data$GROUP_ID,
nGROUPS = 2
)
stanmodelcode = '
data {
int<lower=1> N;
int nGROUPS;
real y[N];
real x1[N];
int<lower=1, upper=nGROUPS> GROUP_ID[N];
}
transformed data{
}
parameters {
real intercept;
vector[nGROUPS] intercept_x2;
real beta;
real<lower=0> sigma;
real<lower=0> sigma_2;
}
transformed parameters { // none needed
}
model {
real mu;
// priors
intercept~ normal(0,10);
intercept_x2 ~ normal(0,sigma_2);
beta ~ normal(0,10);
sigma ~ normal(0,10);
sigma_2 ~ normal(0,10);
// likelihood
for(i in 1:N){
mu = intercept + intercept_x2[ GROUP_ID[i] ] + beta*x1[i];
y[i] ~ normal(mu, sigma);
}
}
'
fit22 = stan(model_code=stanmodelcode, data=standat, iter=5000, warmup=500, chains = 1)
fit22
Inference for Stan model: b212ebc67c08c77926c59693aa719288.
1 chains, each with iter=5000; warmup=500; thin=1;
post-warmup draws per chain=4500, total post-warmup draws=4500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept 10.14 0.30 9.72 -8.42 3.56 10.21 16.71 29.19 1060 1
intercept_x2[1] -10.12 0.30 9.73 -29.09 -16.70 -10.25 -3.50 8.36 1059 1
intercept_x2[2] 189.50 0.30 9.72 170.40 182.98 189.42 196.09 208.05 1063 1
beta 0.05 0.00 0.21 -0.37 -0.10 0.05 0.20 0.47 3114 1
sigma 6.94 0.00 0.15 6.65 6.84 6.94 7.05 7.25 3432 1
sigma_2 43.14 0.09 4.88 34.38 39.71 42.84 46.36 53.26 3248 1
lp__ -2459.75 0.05 1.71 -2463.99 -2460.68 -2459.45 -2458.49 -2457.40 1334 1
Samples were drawn using NUTS(diag_e) at Thu Aug 31 15:53:09 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
My Questions:
the n_eff is larger using rethinking(). There is simulation differences but do you think something else is going on here?
Besides the n_eff being different the percentiles of the posterior distributions are different. I was thinking rethinking() and rstan() should return similar results with 5000 iterations since rethinking is just calling rstan. Are differences like that normal or something different between the 2 implementations?
I created data$GROUP_ID to indicate the categorical groupings. Is this the correct way to incorporate categorical variables into a hierarchical model in rstan()? I have 2 groups and if I had 50 groups I use the same data$GROUP_ID vector but is that the standard way?
Thank you.

multiple training data for cascade-forward backpropagation network

I am training my neural network with data from 3 consecutive days and testing it with data from a 4th day. The values in this example are randomly chosen and have no relation with reality. I want the neural network to learn the current, depending on the temperature and the solar radiation.
%% initialize data for training
Temperature_Day1 = [25 26 27 26 25];
Temperature_Day2 = [25 24 24 23 24];
Temperature_Day3 = [21 20 22 21 20];
SolarRadiation_Day1 = [990 944 970 999 962];
SolarRadiation_Day2 = [993 947 973 996 967];
SolarRadiation_Day3 = [993 948 973 998 965];
Current_Day1 = [0.11 0.44 0.44 0.45 0.56];
Current_Day2 = [0.41 0.34 0.43 0.55 0.75];
Current_Day3 = [0.34 0.98 0.34 0.76 0.71];
Day1 = [Temperature_Day1; SolarRadiation_Day1]; % 2-by-5
Day2 = [Temperature_Day2; SolarRadiation_Day2]; % 2-by-5
Day3 = [Temperature_Day3; SolarRadiation_Day3]; % 2-by-5
%% training input and training target
Training_Input = [Day1; Day2; Day3]; % 6-by-5
Training_Target = [Current_Day1; Current_Day2; Current_Day3]; % 3-by-5
%% training the network
hiddenLayers= 2;
net = newcf(Training_Input, Training_Target, hiddenLayers);
y = sim(net, Training_Input);
net.trainParam.epochs = 100;
net = train(net, Training_Input, Training_Target);
%% initialize data for prediction
Temperature_Day4 = [45 23 22 11 24];
SolarRadiation_Day4 = [960 984 980 993 967];
Current_Day4 = [0.14 0.48 0.37 0.46 0.77];
Day4 = [Temperature_Day4; SolarRadiation_Day4]; % 2-by-5
Test_Input = [Day4; Day4; Day4]; % same dimension as Training_Input; subject to question
%% prediction
Predicted_Target = sim(net, Test_Input); % yields 3-by-5
My question is: How do I train it with the data of 3 days and then predict the target of the 4th day? Since training and testing inputs must have the same dimension, how do I test it for only one day? Here it is solved by just concatenating three identical data sets of the test input. However, this also yields 3 different data sets for the predicted target.
What is here the right way to do it?
BTW: I have seen this type of question many times, but the answers are never satisfying because they always suggest to change the dimensions of the test input without considering the nature of the problem (which is that only one data set is available for testing). So please don't mark this as a duplicate.
The features that you have for your network are Temperature and SolarRadiation, each taken at specific times during the day. The day on which these readings are taken are irrelevant (otherwise you wouldn't be able to predict the outputs for day 4 given data for days 1-3).
This means that we can simply pass each observation separately by concatenating the days horizontally (and similarly for the target data):
Training_Input = [Day1, Day2, Day3]; % 2-by-15
Training_Target = [Current_Day1, Current_Day2, Current_Day3]; % 1-by-15
The resulting network will give you one output (Current) per observation in the test set, so you don't need to duplicate:
Day4 = [Temperature_Day4; SolarRadiation_Day4]; % 2-by-5
Test_Input = [Day4]; % 2-by-5
PredictedTarget will now be 1-by-5 showing the predicted Current for each of the test observations.
You might consider adding a third feature as input to your net representing the time at which each observations was taken. Assuming that you have t timeslots each day at which observations are taken (thus, length(Temperature) == length(SolarRadiation) == t for all days) and observation s is taken at the same time every day, you can add a feature called TimeSlot:
TimeSlot_Day1 = 1:numel(Temperature_Day1);
TimeSlot_Day2 = 1:numel(Temperature_Day2);
TimeSlot_Day3 = 1:numel(Temperature_Day3)];
Day1 = [Temperature_Day1; SolarRadiation_Day1; TimeSlot_Day1]; % 3-by-5
Day2 = [Temperature_Day2; SolarRadiation_Day2; TimeSlot_Day2]; % 3-by-5
Day3 = [Temperature_Day3; SolarRadiation_Day3; TimeSlot_Day3]; % 3-by-5

Convert milliseconds into hours and plot

I'm trying to convert an array of milliseconds and its respective data. However I want to do so in hours and minutes.
Millis = [60000 120000 180000 240000....]
Power = [ 12 14 12 13 14 ...]
I've set it up so the data records every minute, hence the 60000 millis (= 1 minimte). I am trying to plot time on the x axis and power on the y. I would like to have the x axis displayed in hours and minutes with each respective power data corresponding to its respective time.
I've tried this
for i=2:length(Millis)
Conv2Min(i) = Millis(i) / 60000;
Time(i) = startTime + Conv2Min(i);
if (Time(i-1) > Time(i) + 60)
Time(i) + 100;
end
end
s = num2str(Time);
This in attempt to turn the milliseconds into hours starting at 08:00 and once 60 minutes have past going to 09:00, the problem is plotting this. I get a gap between 08:59 and 09:00. I also cannot maintain the 0=initial 0.
In this scenario it is preferable to work with datenum values and then use datetick to set the format of the tick labels of your plot to 'HH:MM'.
Let's suppose that you started taking measurements at t_1 = [HH_1, MM_1] and stopped taking measurements at t_2 = [HH_2, MM_2].
A cool trick to generate the array of datenum values is to use the following expression:
time_datenums = HH_1/24 + MM_1/1440 : 1/1440 : HH_2/24 + MM_2/1440;
Explanation:
We are creating a regularly-spaced vector time_datenums = A:B:C using the colon (:) operator, where A is the starting datenum value, B is the increment between datenum values and C is the ending datenum value.
Since your measurements have been taken every minute (60000 milliseconds), then the increment between datenum values should be of 1 minute too. As a day has 24 hours, that makes 1440 minutes a day, so use B = 1/1440 as the increment between vector elements, to get 1 minute increments.
For A and C we simply need to divide the hour digits by 24 and the minute digits by 1440 and sum them up like this:
A = HH_1/24 + MM_1/1440
C = HH_2/24 + MM_2/1440
So for example, if t_1 = [08, 00], then A = 08/24 + 00/1440. As simple as that.
Notice that this procedure doesn't use the datenum function at all, and still, it manages to generate a valid array of datenum values only taking into consideration the time of the datenum, without needing to bother about the date of the datenum. You can learn more about this here and here.
Going back to your original problem, let's have a look at the code:
time_millisec = 0:60000:9e6; % Time array in milliseconds.
power = 10*rand(size(time_millisec)); % Random power data.
% Elapsed time in milliseconds.
elapsed_millisec = time_millisec(end) - time_millisec(1);
% Integer part of elapsed hours.
elapsed_hours_int = fix(elapsed_millisec/(1000*60*60));
% Fractional part of elapsed hours.
elapsed_hours_frac = (elapsed_millisec/(1000*60*60)) - elapsed_hours_int;
t_1 = [08, 00]; % Start time 08:00
t_2 = [t_1(1) + elapsed_hours_int, t_1(2) + elapsed_hours_frac*60]; % Compute End time.
HH_1 = t_1(1); % Hour digits of t_1
MM_1 = t_1(2); % Minute digits of t_1
HH_2 = t_2(1); % Hour digits of t_2
MM_2 = t_2(2); % Minute digits of t_2
time_datenums = HH_1/24+MM_1/1440:1/1440:HH_2/24+MM_2/1440; % Array of datenums.
plot(time_datenums, power); % Plot data.
datetick('x', 'HH:MM'); % Set 'HH:MM' datetick format for the x axis.
This is the output:
I would use datenums:
Millis = [60000 120000 180000 240000 360000];
Power = [ 12 14 12 13 14 ];
d = [2017 05 01 08 00 00]; %starting point (y,m,d,h,m,s)
d = repmat(d,[length(Millis),1]);
d(:,6)=Millis/1000; %add them as seconds
D=datenum(d); %convert to datenums
plot(D,Power) %plot
datetick('x','HH:MM') %set the x-axis to datenums with HH:MM as format
an even shorter approach would be: (thanks to codeaviator for the idea)
Millis = [60000 120000 180000 240000 360000];
Power = [ 12 14 12 13 14 ];
D = 8/24+Millis/86400000; %24h / day, 86400000ms / day
plot(D,Power) %plot
datetick('x','HH:MM') %set the x-axis to datenums with HH:MM as format
I guess, there is an easier way using datetick and datenum, but I couldn't figure it out. This should solve your problem for now:
Millis=6e4:6e4:6e6;
power=randi([5 15],1,numel(Millis));
hours=floor(Millis/(6e4*60))+8; minutes=mod(Millis,(6e4*60))/6e4; % Calculate the hours and minutes of your Millisecond vector.
plot(Millis,power)
xlabels=arrayfun(#(x,y) sprintf('%d:%d',x,y),hours,minutes,'UniformOutput',0); % Create time-strings of the format HH:MM for your XTickLabels
tickDist=10; % define how often you want your XTicks (e.g. 1 if you want the ticks every minute)
set(gca,'XTick',Millis(tickDist:tickDist:end),'XTickLabel',xlabels(tickDist:tickDist:end))

Collapse/mean data in Matlab with respect to a different set of data

I have two sets of data, but the sets have a different sizes.
Each set contains the measurements itself (MeasA and MeasB, both double) and the time point (TimeA and TimeB, datenum or julian date) when the measuring happened.
Now I want to match the smaller data set to the bigger one, and to do this, I want to mean the data points of the bigger set around the data resp. time points of the smaller set, to finally do some correlation analysis.
Edit:
Small Example how the data would look like:
MeasA = [2.7694 -1.3499 3.0349 0.7254 -0.0631];
TimeA = [0.2 0.4 0.7 0.8 1.3];
MeasB = [0.7147 -0.2050 -0.1241 1.4897 1.4090 1.4172 0.6715 -1.2075 0.7172 1.6302];
TimeB = [0.1 0.2 0.3 0.6 0.65 0.68 0.73 0.85 1.2 1.4];
And now I want to collapse MeasB and TimeB so that I get the mean of the measurement close to the timepoints in TimeA, so for example TimeB should look like this:
TimeB = [mean([0.1 0.2]) mean([0.3 0.6]) mean([0.65 0.68 0.73]) mean([0.85]) mean([1.2 1.4])]
TimeB = [0.15 0.4 0.69 0.85 1.3]
And then collapse MeasB like this too:
MeasB = [mean([0.7147 -0.2050]) mean([-0.1241 1.4897]) mean([1.4090 1.4172 0.6715]) mean([-1.2075]) mean([0.7172 1.6302])];
MeasB = [0.2549 0.6828 1.1659 -1.2075 1.1737]
The function interp1 is your friend.
You can get a new set of measurement for your set B, at the same time than set A by using:
newMeasB = interp1( TimeB , MeasB , TimeA ) ;
The first 2 parameters are your original Time and Measurements of the set you want to re interpolate, the last parameter is the new x axis (time in your example) on which you want the interpolated values to be calculated.
This way you do not end up with different sets of time between your 2 sets of measurements, you can compare them point by point.
Check the documentation of interp1 for more explanations and for options about the interpolation or any potential extrapolation.
edit:
Matlab doc used to have a great illustration of the function but I can't find it online so here goes:
So with the linear method, if the value is interpolated exactly between 2 points, the function will return the exact mean. If the interpolation is done closer to one point than another, the value returned will be proportionally closer to the value of the closest point.
The NaN can appear on the sides (beginning or end of returned vector) if the TimeA was not completely overlapped by timeB. The function cannot "interpolate" because there is no anchor point. However, the different options of interp1 allow you to "extrapolate" outside of the input range, or to assign another default value instead of the NaNs.