level log regression interpretation? - forms

If I want to estimate a level-log regression by OLS, I do that because I believe that my x value (the independend variable) displays a diminishing marginal return on my y value (the dependend variable).
For example hours = beta0 + beta1*log(wage)
where
hours = hour worked per week
wage = hourly wage
Then OLS fits a linear line.
To interpret my beta1 cofficient I divide it by 100 by saying a 1 % increase in wage has a XX effect on hours worked per week.
But from my estimated beta1 cofficient, how can I see the diminishing effect the independend variable has on the dependend now that it is a linear line?
Suddenly after the estimation I cannot see how I can interpret this constant to be a diminishing effect on the dependend variable?
Kind Regards Maria

This should have been posted into the stat version of StackOverflow.
Anyways my suggestion is to try this (start with a basic linear model):
1) Check the plot of the residuals. If there is no sign of heteroscedasticity in the linear model, then stop. Otherwise if you can see a pattern in the residuals (funnel, sinusoids or anything else) continue. -> E[sigma_i]!=sigma for i=1..k where k = model dimensions.
2) Try with a squared model. In this case I would do:
Y = beta[0]+beta[1]*X+beta[2]*X^2
Then if your ideas are correct you should get a positive beta[1] and a negative beta[2]. Most likely with abs(beta[1])>abs(beta[2]). This mean that with for small value or X the effect of the squared component (negative) will be little to none, while with for a big value of X the negative squared component will be very strong.
Now go back to 1) if you get normal residuals you are done.
3) Try with:
Y = beta[0]+beta[1]*log(X)
and with:
Y = beta[0]+beta[1]*log(X^2)
And see which one gives you the best residuals.
There is only one issue in your reasoning. You don't have anymore a linear line, but a curve, as denoted by the relationship Y = b*LN(X). Therefore the log curve itself explains your "diminishing returns".

Related

How do I align two signals in MATLAB [duplicate]

I want to get the offset in samples between two datasets in Matlab (getting them synced in time), a quite common issue. Therefore I use the cross correlation function xcorr or the cross covariance function xcov (both provide similar results in most cases for this purpose). With artificial data it works fine, but I struggle with "real" data, even though it should be pretty much the same. Matlab always says the offset would be zero. I'm using this simple piece of code:
[crossCorr] = xcov(b, c);
[~, peakIndex] = max(crossCorr())
offset = peakIndex - length(b)
I've posted a fully runable example m-file with a downsampled data excerpt on pastebin:
Code with data on pastebin
EDIT: The downsampled excerpt seems to be not fully suitable for evaluating the effect. Here's a much larger sample with the original frequency, pease use this one instead. Unfortunately it was too big for pastebin.
As the plot shows it should be no problem at all to get the offset via cross covariance. I also tried to scale the data nicer in order to avoid numerical problems, but that didn't change anything at all.
Would be great if someone could tell me my mistake.
There's nothing wrong with your method in principle, I used exactly the same approach successfully for temporally aligning different audio recordings of the same signal.
However, it appears that for your time series, correlation (or covariance) is simply not the right measure to compare shifted versions – possibly because they contain components of a time scale comparable to the total length. An alternative is to use residual variance, i.e. the variance of the difference between shifted versions. Here is a (not particularly elegant) implementation of this idea:
lags = -1000 : 1000;
v = nan(size(lags));
for i = 1 : numel(lags)
lag = lags(i);
if lag >= 0
v(i) = var(b(1 + lag : end) - c(1 : end - lag));
else
v(i) = var(b(1 : end + lag) - c(1 - lag : end));
end
end
[~, ind] = min(v);
minlag = lags(ind);
For your (longer) data set, this results in minlag = 169. Plotting residual variance over lags gives:
Your data has a minor peak around 5 and a major peak around 101.
If I knew something about my data then I could might window around an acceptable range of offsets as shown below.
Code for initial exploration:
figure; clc;
subplot(2,1,1)
plot(1:numel(b), b);
hold on
plot(1:numel(c), c, 'r');
legend('b','c')
subplot(2,1,2)
plot(crossCorr,'.b-')
hold on
plot(peakIndex,crossCorr(peakIndex),'or')
legend('crossCorr','peak')
Initial Image:
If you zoom into the first peak you can see that it is not only high around 5, but it is polynomial "enough" to allow sub-element offsets. That is convenient.
Image showing :
Here is what the curve-fitting tool gives as the analytic for a cubic:
Linear model Poly3:
f(x) = p1*x^3 + p2*x^2 + p3*x + p4
Coefficients (with 95% confidence bounds):
p1 = 8.515e-013 (8.214e-013, 8.816e-013)
p2 = -3.319e-011 (-3.369e-011, -3.269e-011)
p3 = 2.253e-010 (2.229e-010, 2.277e-010)
p4 = -4.226e-012 (-7.47e-012, -9.82e-013)
Goodness of fit:
SSE: 2.799e-024
R-square: 1
Adjusted R-square: 1
RMSE: 6.831e-013
You can note that the SSE fits to roundoff.
If you compute the root (near n=4) you use the following matlab code:
% Coefficients
p1 = 8.515e-013
p2 = -3.319e-011
p3 = 2.253e-010
p4 = -4.226e-012
% Linear model Poly3:
syms('x')
f = p1*x^3 + p2*x^2 + p3*x + p4
xz1=fzero(#(y) subs(diff(f),'x',y), 4)
and you get the analytic root at 4.01420240431444.
EDIT:
Hmmm. How about fitting a gaussian mixture model to the convolution? You sweep through a good range of component count, you do between 10 and 30 repeats, and you find which component count has the best/lowest BIC. So you fit a gmdistribution to the lower subplot of the first figure, then test the covariance at the means of the components in decreasing order.
I would try the offset at the means, and just look at sum squared error. I would then pick the offset that has the lowest error.
Procedure:
compute cross correlation
fit cross correlation to Gaussian Mixture model
sweep a reasonable range of components (start with 1-10)
use a reasonable number of repeats (10 to 30 depending on run-to-run variation)
compute Bayes Information Criterion (BIC) for each level, pick the lowest because it indicates a reasonable balance of error and parameter count
each component is going to have a mean, evaluate that mean as a candidate offset and compute sum-squared error (sse) when you offset like that.
pick the offset of the component that gives best SSE
Let me know how well that works.
If the two signals misalign by non-integer number of samples, e.g. 3.7 samples, then the xcorr method may find the max value at 4 samples, it won't be able to find the accurate time shift. In this case, you should try a method called "unified change detection". The web-link for the paper is:
[http://www.phmsociety.org/node/1404/]
Good Luck.

Optimization algorithm in Matlab

I want to calculate maximum of the function CROSS-IN-TRAY which is
shown here:
So I have made this function in Matlab:
function f = CrossInTray2(x)
%the CrossInTray2 objective function
%
f = 0.0001 *(( abs(sin(x(:,1)).* sin(x(:,2)).*exp(abs(100 - sqrt(x(:,1).^2 + x(:,2).^2)/3.14159 )) )+1 ).^0.1);
end
I multiplied the whole formula by (-1) so the function is inverted so when I will be looking for the minimum of the inverted formula it will be actually the maximum of original one.
Then when I go to optimization tools and select the GA algorithm and define lower and upper bounds as -3 and 3 it shows me the result after about 60 iterations which is about 0.13 and the final point is something like [0, 9.34].
And how is this possible that the final point is not in the range defined by the bounds? And what is the actual maximum of this function?
The maximum is (0,0) (actually, when either input is 0, and periodically at multiples of pi). After you negate, you're looking for a minimum of a positive quantity. Just looking at the outer absolute value, it obviously can't get lower than 0. That trivially occurs when either value of sin(x) is 0.
Plugging in, you have f_min = f(0,0) = .0001(0 + 1)^0.1 = 1e-4
This expression is trivial to evaluate and plot over a 2d grid. Do that until you figure out what you're looking at, and what the approximate answer should be, and only then invoke an actual optimizer. GA does not sound like a good candidate for a relatively smooth expression like this. The reason you're getting strange answers is the fact that only one of the input parameters has to be 0. Once the optimizer finds one of those, the other input could be anything.

Solving equations involving dozens of ceil and floor functions in MATLAB?

I am tackling a problem which uses lots of equations in the form of:
where q_i(x) is the only unknown, c_i, C_j, P_j are always positive. We have two cases, the first when c_i, C_j, P_j are integers and the case when they are real. C_j < P_j for all j
How is this type of problems efficently solved in MATLAB especially when the number of iterations N is between 20 - 100?
What I was doing is q_i(x) - c_i(x) must be equal to the summation of integers. So i was doing an exhaustive search for q_i(x) which satisfies both ends of the equation. Clearly this is computationally exhaustive.
What if c_i(x) is a floating point number, this will even make the problem even more difficult to find a real q_i(x)?
MORE INFO: These equations are from the paper "Integrating Preemption Threshold to Fixed Priority DVS Scheduling Algorithms" by Yang and Lin.
Thanks
You can use bisection method to numerically find zeros of almost any well-behavior functions.
Convert your equation problem into a zero-finding problem, by moving all things to one side of the equal sign. Then find x: f(x)=0.
Apply bisection method equation solver.
That's it! Or may be....
If you have specific range(s) where the roots should fall in, then just perform bisection method for each range. If not, you still have to give a maximum estimation (you don't want to try some number larger than that), and make this as the range.
The problem of this method is for each given range it can only find one root, because it's always picking the left (or right) half of the range. That's OK if P_j is integer, as you can always find a minimum step of the function. Say P_j = 1, then only a change in q_i larger than 1 leads to another segment (and thus a possible different root). Otherwise, within each range shorter than 1 there will be at most one solution.
If P_j is an arbitrary number (such as 1e-10), unless you have a lower limit on P_j, most likely you are out of lucky, since you can't tell how fast the function will jump, which essentially means f(x) is not a well-behavior function, making it hard to solve.
The sum is a step function. You can discretize the problem by calculating where the floor function jumps for the next value; this is periodic for every j. Then you overlay the N ''rhythms'' (each has its own speed specified by the Pj) and get all the locations where the sum jumps. Each segment can have exactly 0 or 1 intersection with qi(x). You should visualize the problem for intuitive understanding like this:
f = #(q) 2 + (floor(q/3)*0.5 + floor(q/4)*3 + floor(q/2)*.3);
xx = -10:0.01:10;
plot(xx,f(xx),xx,xx)
For each step, it can be checked analytically if an intersection exists or not.
jumps = unique([0:3:10,0:4:10,0:2:10]); % Vector with position of jumps
lBounds = jumps(1:end-1); % Vector with lower bounds of stairs
uBounds = jumps(2:end); % Vector with upper bounds of stairs
middle = (lBounds+uBounds)/2; % center of each stair
fStep = f(middle); % height of the stairs
intersection = fStep; % Solution of linear function q=fStep
% Check if intersection is within the bounds of the specific step
solutions = intersection(intersection>=lBounds & intersection<uBounds)
2.3000 6.9000

QRS detection(peaks) of a raw ecg signal in matlab

I want to find the peaks of the raw ecg signal so that I can calculate the beats per minute(bpm).
I Have written a code in matlab which I have attached below.In the code below I am unable to find threshold point correctly which will help me in finding the peaks and hence the bpm.
%input the signal into matlab
[x,fs]=wavread('heartbeat.wav');
subplot(2,1,1)
plot(x(1:10000),'r-')
grid on
%lowpass filter the input signal with cutoff at 100hz
h=fir1(30,0.3126); %normalized cutoff freq=0.3126
y=filter(h,1,x);
subplot(2,1,2)
plot(y(1:10000),'b-')
grid on
% peaks are seen as pulses(heart beats)
beat_count=0;
for p=2:length(y)-1
th(p)=abs(max(y(p)));
if(y(p) >y(p-1) && y(p) >y(p+1) && y(p)>th(p))
beat_count=beat_count+1;
end
end
N = length(y);
duration_seconds=N/fs;
duration_minutes=duration_seconds/60;
BPM=beat_count/duration_minutes;
bpm=ceil(BPM);
Please help me as I am new to matlab
I suggest changing this section of your code
beat_count=0;
for p=2:length(y)-1
th(p)=abs(max(y(p)));
if(y(p) >y(p-1) && y(p) >y(p+1) && y(p)>th(p))
beat_count=beat_count+1;
end
end
This is definitely flawed. I'm not sure of your logic here but what about this. We are looking for peaks, but only the high peaks, so first lets set a threshold value (you'll have to tweak this to a sensible number) and cull everything below that value to get rid of the smaller peaks:
th = max(y) * 0.9; %So here I'm considering anything less than 90% of the max as not a real peak... this bit really depends on your logic of finding peaks though which you haven't explained
Yth = zeros(length(y), 1);
Yth(y > th) = y(y > th);
OK so I suggest you now plot y and Yth to see what that code did. Now to find the peaks my logic is we are looking for local maxima i.e. points at which the first derivative of the function change from being positive to being negative. So I'm going to find a very simple numerical approximation to the first derivative by finding the difference between each consecutive point on the signal:
Ydiff = diff(Yth);
No I want to find where the signal goes from being positive to being negative. So I'm going to make all the positive values equal zero, and all the negative values equal one:
Ydiff_logical = Ydiff < 0;
finally I want to find where this signal changes from a zero to a one (but not the other way around)
Ypeaks = diff(Ydiff_logical) == 1;
Now count the peaks:
sum(Ypeaks)
note that for plotting purpouse because of the use of diff we should pad a false to either side of Ypeaks so
Ypeaks = [false; Ypeaks; false];
OK so there is quite a lot of matlab there, I suggest you run each line, one by one and inspect the variable by both plotting the result of each line and also by double clicking the variable in the matlab workspace to understand what is happening at each step.
Example: (signal PeakSig taken from http://www.mathworks.com/help/signal/ref/findpeaks.html) and plotting with:
plot(x(Ypeaks),PeakSig(Ypeaks),'k^','markerfacecolor',[1 0 0]);
What do you think about the built-in
findpeaks(data,'Name',value)
function? You can choose among different logics for peak detection:
'MINPEAKHEIGHT'
'MINPEAKDISTANCE'
'THRESHOLD'
'NPEAKS'
'SORTSTR'
I hope this helps.
You know, the QRS complex does not always have the maximum amplitude, for pathologic ECG it can be present as several minor oscillations instead of one high-amplitude peak.
Thus, you can try one good algothythm, tested by me: the detection criterion is assumed to be high absolute rate of change in the signal, averaged within the given interval.
Algorithm:
- 50/60 Hz filter (e.g. for 50 Hz sliding window of 20 msec will be fine)
- adaptive hipass filter (for baseline drift)
- find signal's first derivate x'
- fing squared derivate (x')^2
- apply sliding average window with the width of QRS complex - approx 100-150 msec (you will get some signal with 'rectangles', which have width of QRS)
- use simple threshold (e.g. 1/3 of maximum of the first 3 seconds) to determine approximate positions or R
- in the source ECG find local maximum within +-100 msec of that R position.
However, you still have to eliminate artifacts and outliers (e.g. surges, when the electrod connection fails).
Also, you can find a lot of helpful information from this book: "R.M. Rangayyan - Biomedical Signal Analysis"

The deconv() function in MATLAB does not invert the conv() function

I would like to convolve a time-series containing two spikes (call it Spike) with an exponential kernel (k) in MATLAB. Call the convolved response "calcium1". I would like to recover the original spike ("reconSpike") data using deconvolution with the kernel. I am using the following code.
k1=zeros(1,5000);
k1(1:1000)=(1.1.^((1:1000)/100)-(1.1^0.01))/((1.1^10)-1.1^0.01);
k1(1001:5000)=exp(-((1001:5000)-1001)/1000);
k1(1)=k1(2);
spike = zeros(100000,1);
spike(1000)=1;
spike(1100)=1;
calcium1=conv(k1, spike);
reconSpike1=deconv(calcium1, k1);
The problem is that at the end of reconSpike, I get a chunk of very large, high amplitude waves that was not in the original data. Anyone know why and how to fix it?
Thanks!
It works for me if you keep the spike vector the same length as the k1 vector. i.e.:
k1=zeros(1,5000);
k1(1:1000)=(1.1.^((1:1000)/100)-(1.1^0.01))/((1.1^10)-1.1^0.01);
k1(1001:5000)=exp(-((1001:5000)-1001)/1000);
k1(1)=k1(2);
spike = zeros(5000, 1);
spike(1000)=1;
spike(1100)=1;
calcium1=conv(k1, spike);
reconSpike1=deconv(calcium1, k1);
Any reason you made them different?
You are running into either a problem with MATLAB's deconvolution algorithm, or floating point precision problems (or maybe both). I suspect it's floating point precision due to all the divisions and subtractions that take place during the deconvolution, but it might be worth contacting MathWorks directly to ask what they think.
Per MATLAB documentation, if [q,r] = deconv(v,u), then v = conv(u,q)+r must also hold (i.e., the output of deconv should always satisfy this). In your case this is violently violated. Put the following at the end of your script:
[reconSpike1 rem]=deconv(calcium1, k1);
max(conv(k1, reconSpike1) + rem - calcium1)
I get 6.75e227, which is not zero ;-) Next try changing the length of spike to 6000; you will get a small number (~1e-15). Gradually increase the length of spike; the error will get larger and larger. Note that if you put only one non-zero element into your spike, this behavior doesn't happen: the error is always zero. It makes sense; all MATLAB needs to do is divide everything by the same number.
Here's a simple demonstration using random vectors:
v = random('uniform', 1,2,100,1);
u = random('uniform', 1,2,100,1);
[q r] = deconv(v,u);
fprintf('maximum error for length(v) = 100 is %f\n', max(conv(u, q) + r - v))
v = random('uniform', 1,2,1000,1);
[q r] = deconv(v,u);
fprintf('maximum error for length(v) = 1000 is %f\n', max(conv(u, q) + r - v))
The output is:
maximum error for length(v) = 100 is 0.000000
maximum error for length(v) = 1000 is 14.910770
I don't know what you are really trying to accomplish, so it's hard to give further advice. But I'll just point out that if you have a problem where pulses are piling up and you want to extract information about each pulse, this can be a tricky problem. I know some people who work on things like this, so if you want some references let me know and I will ask them.
You should never expect that a deconvolution can simply undo a convolution. This is because the deconvolution is an ill-posed problem.
The problem comes from the fact that the convolution is an integral operator (in the continuous case you write down an integral int f(x) g(x-t) dx or something similar). Now, the inverse of computing an integral (the de-convolution) is to apply a differentiation. Unfortunately, the differential amplifies noise in the input. Thus, if your integral only has slight errors on it (and floating-point inaccuarcies might already be enough), you end up with a total different outcome after differentiation.
There are some possibilities how this amplification can be mitigated but these have to be tried on a per-application basis.