The final goal I am trying to achieve is the generation of a ten minutes time series: to achieve this I have to perform an FFT operation, and it's the point I have been stumbling upon.
Generally the aimed time series will be assigned as the sum of two terms: a steady component U(t) and a fluctuating component u'(t). That is
u(t) = U(t) + u'(t);
So generally, my code follows this procedure:
1) Given data
time = 600 [s];
Nfft = 4096;
L = 340.2 [m];
U = 10 [m/s];
df = 1/600 = 0.00167 Hz;
fn = Nfft/(2*time) = 3.4133 Hz;
This means that my frequency array should be laid out as follows:
f = (-fn+df):df:fn;
But, instead of using the whole f array, I am only making use of the positive half:
fpos = df:fn = 0.00167:3.4133 Hz;
2) Spectrum Definition
I define a certain spectrum shape, applying the following relationship
Su = (6*L*U)./((1 + 6.*fpos.*(L/U)).^(5/3));
3) Random phase generation
I, then, have to generate a set of complex samples with a determined distribution: in my case, the random phase will approach a standard Gaussian distribution (mu = 0, sigma = 1).
In MATLAB I call
nn = complex(normrnd(0,1,Nfft/2),normrnd(0,1,Nfft/2));
4) Apply random phase
To apply the random phase, I just do this
Hu = Su*nn;
At this point start my pains!
So far, I only generated Nfft/2 = 2048 complex samples accounting for the fpos content. Therefore, the content accounting for the negative half of f is still missing. To overcome this issue, I was thinking to merge the real and imaginary part of Hu, in order to get a signal Huu with Nfft = 4096 samples and with all real values.
But, by using this merging process, the 0-th frequency order would not be represented, since the imaginary part of Hu is defined for fpos.
Thus, how to account for the 0-th order by keeping a procedure as the one I have been proposing so far?
Related
Continuing from this question: Is it possible to use lqmm with a mira object?
I have tried to get the random effects for the mixed models (lmm and lqmm), and it has been hard.
library(lqmm)
library(mice)
library(lme4)
library(mitml)
summary(airquality)
imputed<-mice(airquality,m=5)
summary(imputed)
fit1<-lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, data=airquality,na.action=na.omit)
fit1
summary(fit1)
fit2<-with(imputed, lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, na.action=na.omit))
#did not work because it does not recognize a data frame
fit2 <- with(imputed,
lqmm(Ozone ~ Solar.R + Wind + Temp + Day,
data = data.frame(mget(ls())),
random = ~1, tau = 0.5, group = Month, na.action = na.omit))
tidy.lqmm <- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
broom:::as_tidy_tibble(data.frame(
estimate = coef(x),
std.error = sqrt(
diag(summary(x, covariance = TRUE,
R = 50)$Cov[names(coef(x)),
names(coef(x))]))))
}
glance.lqmm <- function(x, ...) {
broom:::as_glance_tibble(
logLik = as.numeric(stats::logLik(x)),
df.residual = summary(x)$rdf,
nobs = stats::nobs(x),
na_types = "rii")
}
pool(fit2)
summary(pool(fit2))
So far so good, but I want to build a table that resembles the sJPlot::tab_model function on lmer objects. For this, I need to extract the random effects from the pooled estimates. This is where I am lost. I have not been able to do it with the linear mixed model, much less with the linear quantile mixed model.
Extract the Random-effects from an LMM and LQMM.
##LMM
fit3 <- with(imputed,
lmer(Ozone ~ Solar.R + Wind + Temp + Day+ (1|Month)))
library(broom.mixed)
summary(pool(fit3))
library(sjPlot)
tab_model(fit3$analyses) #this will give me the results for each of the 5 lmer, but I need the pooled ones
pool(fit3)$glanced # also this will five me the random effects for each of the 5 lmer
stargazer(fit3$analyses,type="text")#this would give the AIC, LL, and BYC
#something like
tab_model(pool(fit3$analyses))
"Error ....
Could not access model information."
#or
tab_model(pool(fit3)) #A data frame is not a valid object for this function.
##LQMM
tab_model(fit2$analyses)#gives me less information
pool(fit2)$glanced #gives the 5 models logLik, df.residual and nobs individually
UPDATE: I could find a formula to extract the random effects of the LMM thanks to Can I pool imputed random effect model estimates using the mi package?. However, does not work for LQMM
testEstimates(as.mitml.result(fit3), extra.pars = T)$extra.pars
testEstimates(as.mitml.result(fit2), extra.pars = T)$extra.pars
Error in UseMethod("vcov") : no applicable method for 'vcov'
applied to an object of class "lqmm"
The Pooled Random Effects information I need for both lmer and lqmm with sjPlot::tab_model and stargazer:
sigma^2: Pooled Residual Variance.
tau_00Month: Pooled Variance
explained by the month (between month differences).
ICC: Pooled
sigma^2/ (sigma^2+tau_00Month).
N_Month: n. Months used in the
regression.
Marginal R2/ Conditional R2: The marginal R-squared
considers only the variance of the fixed effects, while the
conditional R-squared takes both the fixed and random effects into
account.
Residual Scale Parameter: also would appreciate it if somebody clarifies what this is calculating.
Log-Likelihood:
Akaike Inf. Crit.:
Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.
I have a matrix time-series data for 8 variables with about 2500 points (~10 years of mon-fri) and would like to calculate the mean, variance, skewness and kurtosis on a 'moving average' basis.
Lets say frames = [100 252 504 756] - I would like calculate the four functions above on over each of the (time-)frames, on a daily basis - so the return for day 300 in the case with 100 day-frame, would be [mean variance skewness kurtosis] from the period day201-day300 (100 days in total)... and so on.
I know this means I would get an array output, and the the first frame number of days would be NaNs, but I can't figure out the required indexing to get this done...
This is an interesting question because I think the optimal solution is different for the mean than it is for the other sample statistics.
I've provided a simulation example below that you can work through.
First, choose some arbitrary parameters and simulate some data:
%#Set some arbitrary parameters
T = 100; N = 5;
WindowLength = 10;
%#Simulate some data
X = randn(T, N);
For the mean, use filter to obtain a moving average:
MeanMA = filter(ones(1, WindowLength) / WindowLength, 1, X);
MeanMA(1:WindowLength-1, :) = nan;
I had originally thought to solve this problem using conv as follows:
MeanMA = nan(T, N);
for n = 1:N
MeanMA(WindowLength:T, n) = conv(X(:, n), ones(WindowLength, 1), 'valid');
end
MeanMA = (1/WindowLength) * MeanMA;
But as #PhilGoddard pointed out in the comments, the filter approach avoids the need for the loop.
Also note that I've chosen to make the dates in the output matrix correspond to the dates in X so in later work you can use the same subscripts for both. Thus, the first WindowLength-1 observations in MeanMA will be nan.
For the variance, I can't see how to use either filter or conv or even a running sum to make things more efficient, so instead I perform the calculation manually at each iteration:
VarianceMA = nan(T, N);
for t = WindowLength:T
VarianceMA(t, :) = var(X(t-WindowLength+1:t, :));
end
We could speed things up slightly by exploiting the fact that we have already calculated the mean moving average. Simply replace the within loop line in the above with:
VarianceMA(t, :) = (1/(WindowLength-1)) * sum((bsxfun(#minus, X(t-WindowLength+1:t, :), MeanMA(t, :))).^2);
However, I doubt this will make much difference.
If anyone else can see a clever way to use filter or conv to get the moving window variance I'd be very interested to see it.
I leave the case of skewness and kurtosis to the OP, since they are essentially just the same as the variance example, but with the appropriate function.
A final point: if you were converting the above into a general function, you could pass in an anonymous function as one of the arguments, then you would have a moving average routine that works for arbitrary choice of transformations.
Final, final point: For a sequence of window lengths, simply loop over the entire code block for each window length.
I have managed to produce a solution, which only uses basic functions within MATLAB and can also be expanded to include other functions, (for finance: e.g. a moving Sharpe Ratio, or a moving Sortino Ratio). The code below shows this and contains hopefully sufficient commentary.
I am using a time series of Hedge Fund data, with ca. 10 years worth of daily returns (which were checked to be stationary - not shown in the code). Unfortunately I haven't got the corresponding dates in the example so the x-axis in the plots would be 'no. of days'.
% start by importing the data you need - here it is a selection out of an
% excel spreadsheet
returnsHF = xlsread('HFRXIndices_Final.xlsx','EquityHedgeMarketNeutral','D1:D2742');
% two years to be used for the moving average. (250 business days in one year)
window = 500;
% create zero-matrices to fill with the MA values at each point in time.
mean_avg = zeros(length(returnsHF)-window,1);
st_dev = zeros(length(returnsHF)-window,1);
skew = zeros(length(returnsHF)-window,1);
kurt = zeros(length(returnsHF)-window,1);
% Now work through the time-series with each of the functions (one can add
% any other functions required), assinging the values to the zero-matrices
for count = window:length(returnsHF)
% This is the most tricky part of the script, the indexing in this section
% The TwoYearReturn is what is shifted along one period at a time with the
% for-loop.
TwoYearReturn = returnsHF(count-window+1:count);
mean_avg(count-window+1) = mean(TwoYearReturn);
st_dev(count-window+1) = std(TwoYearReturn);
skew(count-window+1) = skewness(TwoYearReturn);
kurt(count-window +1) = kurtosis(TwoYearReturn);
end
% Plot the MAs
subplot(4,1,1), plot(mean_avg)
title('2yr mean')
subplot(4,1,2), plot(st_dev)
title('2yr stdv')
subplot(4,1,3), plot(skew)
title('2yr skewness')
subplot(4,1,4), plot(kurt)
title('2yr kurtosis')
I have two signals, let's call them 'a' and 'b'. They are both nearly identical signals (recorded from the same input and contain the same information) however, because I recorded them at two different 'b' is time shifted by an unknown amount. Obviously, there is random noise in each.
Currently, I am using cross correlation to compute the time shift, however, I am still getting improper results.
Here is the code I am using to calculate the time shift:
function [ diff ] = FindDiff( signal1, signal2 )
%FINDDIFF Finds the difference between two signals of equal frequency
%after an appropritate time shift is applied
% Calculates the time shift between two signals of equal frequency
% using cross correlation, shifts the second signal and subtracts the
% shifted signal from the first signal. This difference is returned.
length = size(signal1);
if (length ~= size(signal2))
error('Vectors must be equal size');
end
t = 1:length;
tx = (-length+1):length;
x = xcorr(signal1,signal2);
[mx,ix] = max(x);
lag = abs(tx(ix));
shifted_signal2 = timeshift(signal2,lag);
diff = signal1 - shifted_signal2;
end
function [ shifted ] = timeshift( input_signal, shift_amount )
input_size = size(input_signal);
shifted = (1:input_size)';
for i = 1:input_size
if i <= shift_amount
shifted(i) = 0;
else
shifted(i) = input_signal(i-shift_amount);
end
end
end
plot(FindDiff(a,b));
However the result from the function is a period wave, rather than random noise, so the lag must still be off. I would post an image of the plot, but imgur is currently not cooperating.
Is there a more accurate way to calculate lag other than cross correlation, or is there a way to improve the results from cross correlation?
Cross-correlation is usually the simplest way to determine the time lag between two signals. The position of peak value indicates the time offset at which the two signals are the most similar.
%// Normalize signals to zero mean and unit variance
s1 = (signal1 - mean(signal1)) / std(signal1);
s2 = (signal2 - mean(signal2)) / std(signal2);
%// Compute time lag between signals
c = xcorr(s1, s2); %// Cross correlation
lag = mod(find(c == max(c)), length(s2)) %// Find the position of the peak
Note that the two signals have to be normalized first to the same energy level, so that the results are not biased.
By the way, don't use diff as a name for a variable. There's already a built-in function in MATLAB with the same name.
Now there are two functions in Matlab:
one called finddelay
and another called alignsignals that can do what you want, I believe.
corr finds a dot product between vectors (v1, v2). If it works bad with your signal, I'd try to minimize a sum of squares of differences (i.e. abs(v1 - v2)).
signal = sin(1:100);
signal1 = [zeros(1, 10) signal];
signal2 = [signal zeros(1, 10)];
for i = 1:length(signal1)
signal1shifted = [signal1 zeros(1, i)];
signal2shifted = [zeros(1, i) signal2];
d2(i) = sum((signal1shifted - signal2shifted).^2);
end
[fval lag2] = min(d2);
lag2
It is computationally worse than cross-calculation which can be speeded up by using FFT. As far as I know you can't do this with euclidean distance.
UPD. Deleted wrong idea about cross-correlation with periodic signals
You can try matched filtering in frequency domain
function [corr_output] = pc_corr_processor (target_signal, ref_signal)
L = length(ref_signal);
N = length(target_signal);
matched_filter = flipud(ref_signal')';
matched_filter_Res = fft(matched_filter,N);
corr_fft = matched_filter_Res.*fft(target_signal);
corr_out = abs(ifft(corr_fft));
The peak of the matched filter maximum-index of corr_out above should give you the lag amount.
I am trying to implement Naive Bayes Classifier using a dataset published by UCI machine learning team. I am new to machine learning and trying to understand techniques to use for my work related problems, so I thought it's better to get the theory understood first.
I am using pima dataset (Link to Data - UCI-ML), and my goal is to build Naive Bayes Univariate Gaussian Classifier for K class problem (Data is only there for K=2). I have done splitting data, and calculate the mean for each class, standard deviation, priors for each class, but after this I am kind of stuck because I am not sure what and how I should be doing after this. I have a feeling that I should be calculating posterior probability,
Here is my code, I am using percent as a vector, because I want to see the behavior as I increase the training data size from 80:20 split. Basically if you pass [10 20 30 40] it will take that percentage from 80:20 split, and use 10% of 80% as training.
function[classMean] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
train_labels = shuffledMatrix_label(1:percent_data_80,:)
test_labels = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:)
new_trin_label = shuffledMatrix_label(1:percentOfRows)
%get unique labels in training
numClasses = size(unique(new_trin_label),1)
classMean = zeros(numClasses,size(new_train,2));
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_trin_label == kclass,:))
std(new_train(new_trin_label == kclass,:))
priorClassforK = length(new_train(new_trin_label == kclass))/length(new_train)
priorClassforK_1 = 1 - priorClassforK
end
end
end
end
First, compute the probability of evey class label based on frequency counts. For a given sample of data and a given class in your data set, you compute the probability of evey feature. After that, multiply the conditional probability for all features in the sample by each other and by the probability of the considered class label. Finally, compare values of all class labels and you choose the label of the class with the maximum probability (Bayes classification rule).
For computing conditonal probability, you can simply use the Normal distribution function.