in the 2D array plotted below, we are interested in finding the "lump" region. As you can see it is not a continuous graph. Also, we know the approximate dimension of the "lump" region. A set of data are given below. First column contains the y values and the second contains the x values. Any suggestion as to how to detect lump regions like this?
21048 -980
21044 -956
21040 -928
21036 -904
21028 -880
21016 -856
21016 -832
21016 -808
21004 -784
21004 -760
20996 -736
20996 -712
20992 -684
20984 -660
20980 -636
20968 -612
20968 -588
20964 -564
20956 -540
20956 -516
20952 -492
20948 -468
20940 -440
20936 -416
20932 -392
20928 -368
20924 -344
20920 -320
20912 -296
20912 -272
20908 -248
20904 -224
20900 -200
20900 -176
20896 -152
20888 -128
20888 -104
20884 -80
20872 -52
20864 -28
20856 -4
20836 16
20812 40
20780 64
20748 88
20744 112
20736 136
20736 160
20732 184
20724 208
20724 232
20724 256
20720 280
20720 304
20720 328
20724 352
20724 376
20732 400
20732 424
20736 448
20736 472
20740 496
20740 520
20748 544
20740 568
20736 592
20736 616
20736 640
20740 664
20740 688
20736 712
20736 736
20744 760
20748 788
20760 812
20796 836
20836 860
20852 888
20852 912
20844 936
20836 960
20828 984
20820 1008
20816 1032
20820 1056
20852 1080
20900 1108
20936 1132
20956 1156
20968 1184
20980 1208
20996 1232
21004 1256
21012 1280
21016 1308
21024 1332
21024 1356
21028 1380
21024 1404
21020 1428
21016 1452
21008 1476
21004 1500
20992 1524
20980 1548
20956 1572
20944 1596
20920 1616
20896 1640
20872 1664
20848 1684
20812 1708
20752 1728
20664 1744
20640 1768
20628 1792
20628 1816
20620 1836
20616 1860
20612 1884
20604 1908
20596 1932
20588 1956
20584 1980
20580 2004
20572 2024
20564 2048
20552 2072
20548 2096
20536 2120
20536 2144
20524 2164
20516 2188
20512 2212
20508 2236
20500 2260
20488 2280
20476 2304
20472 2328
20476 2352
20460 2376
20456 2396
20452 2420
20452 2444
20436 2468
20432 2492
20432 2516
20424 2536
20420 2560
20408 2584
20396 2608
20388 2628
20380 2652
20364 2676
20364 2700
20360 2724
20352 2744
20344 2768
20336 2792
20332 2812
20328 2836
20332 2860
20340 2888
20356 2912
20380 2940
20428 2968
20452 2996
20496 3024
20532 3052
20568 3080
20628 3112
20652 3140
20728 3172
20772 3200
20868 3260
20864 3284
20864 3308
20868 3332
20860 3356
20884 3384
20884 3408
20912 3436
20944 3464
20948 3488
20948 3512
20932 3536
20940 3564
It may be just a coincidence, but the lump you show looks fairly parabolic. It's not completely clear what you mean by "know the approximate dimension of the lump region" but if you mean that you know approximately how wide it is (i.e. how much of the x-axis it takes up), you could simply slide a window of that width along the x-axis and do a parabolic fit (a.k.a. polyfit with degree 2) to all data that fits into the window at each point. Then, compute r^2 goodness-of-fit values at each point and the point with the r^2 closest to 1.0 would be the best fit. You'd probably need a threshold value and to throw out those where the x^2 coefficient was positive (to find lumps rather than dips) for sanity, but this might be a workable approach.
Even if the parabolic look is a coincidence, I think this would ba a reasonable approach--a downward pointing parabola is a pretty good description of a general "lump" by any definition I can think of.
Edit: Attempted Implementation Below
I got curious and went ahead and implemented my proposed solution (with slight modifications). First, here's the code (ugly but functional):
function [x, p] = find_lump(data, width)
n = size(data, 1);
f = plot(data(:,1),data(:,2), 'bx-');
hold on;
bestX = -inf;
bestP = [];
bestMSE = inf;
bestXdat = [];
bestYfit = [];
spanStart = 0;
spanStop = 1;
spanWidth = 0;
while (spanStop < n)
if (spanStart > 0)
% Drop first segment from window (since we'll advance x):
spanWidth = spanWidth - (data(spanStart + 1, 1) - x);
end
spanStart = spanStart + 1;
x = data(spanStart, 1);
% Advance spanStop index to maintain window width:
while ((spanStop < n) && (spanWidth <= width))
spanStop = spanStop + 1;
spanWidth = data(spanStop, 1) - x;
end
% Correct for overshoot:
if (spanWidth > width)
spanStop = spanStop - 1;
spanWidth = data(spanStop, 1) - x;
end
% Fit parabola to data in the current window:
xdat = data(spanStart:spanStop, 1);
ydat = data(spanStart:spanStop, 2);
p = polyfit(xdat, ydat, 2);
% Compute fit quality (mean squared error):
yfit = polyval(p,xdat);
r = yfit - ydat;
mse = (r' * r) / size(xdat,1);
if ((p(1) < -0.002) && (mse < bestMSE))
bestMSE = mse;
bestX = x;
bestP = p;
bestXdat = xdat;
bestYfit = yfit;
end
end
x = bestX;
p = bestP;
plot(bestXdat,bestYfit,'r-');
...and here's a result using the given data (I swapped the columns so column 1 is x values and column 2 is y values) with a window width parameter of 750:
Comments:
I opted to use mean squared error between the fit parabola and the original data within each window as the quality metric, rather than correlation coefficient (r^2 value) due to laziness more than anything else. I don't think the results would be much different the other way.
The output is heavily dependent on the threshold value chosen for the quadratic coefficient (see the bestMSE condition at the end of the loop). Truth be told, I cheated here by outputing the fit coefficients at each point, then selected the threshold based on the known lump shape. This is equivalent to using a lump template as suggested by #chaohuang and may not be very robust depending on the expected variance in the data.
Note that some sort of shape control parameter seems to be necessary if this approach is used. The reason is that any random (smooth) run of data can be fit nicely to some parabola, but not necessarily around the maximum value. Here's a result where I set the threshold to zero and thus only restricted the fit to parabolas pointing downwards:
An improvement would be to add a check that the fit parabola at least has a maximum within the window interval (that is, check that the first derivative goes to zero within the window so we at least find local maxima along the curve). This alone is not sufficient as you still might have a tiny little lump that fits a parabola better than an "obvious" big lump as seen in the given data set.
Related
I'm having surprisingly difficult time to figure out something which appears so simple. I have two known coordinates on a graph, (X1,Y1) and (X2,Y2). What I'm trying to identify are the coordinates for (X3,Y3).
I thought of using sin and cos but once I get here my brain stops working. I know that
sin O = y/R
cos O = x/R
so I thought of simply importing in the length of the line (in this case it was 2) and use the angles which are known. Seems very simple but for the life of me, my brain won't wrap around this.
The reason I need this is because I'm trying to print a line onto an image using poly2mask in matlab. The code has to work in the 2D space as I will be building movies using the line.
X1 = [134 134 135 147 153 153 167]
Y1 = [183 180 178 173 164 152 143]
X2 = [133 133 133 135 138 143 147]
Y2 = [203 200 197 189 185 173 163]
YZdist = 2;
for aa = 1:length(X2)
XYdis(aa) = sqrt((x2(aa)-x1(aa))^2 + (Y2(aa)-Y1(aa))^2);
X3(aa) = X1(aa) * tan(XYdis/YZdis);
Y3(aa) = Y1(aa) * tan(XYdis/YZdis);
end
polmask = poly2mask([Xdata X3],[Ydata Y3],50,50);
one approach would be to first construct a vector l connection points (x1,y1) and (x2,y2), rotate this vector 90 degrees clockwise and add it to the point (x2,y2).
Thus l=(x2-x1, y2-y1), its rotated version is l'=(y2-y1,x1-x2) and therefore the point of interest P=(x2, y2) + f*(y2-y1,x1-x2), where f is the desired scaling factor. If the lengths are supposed to be the same, then f=1 and thus P=(x2 + y2-y1, y2 + x1-x2).
I reproduced the results of a hierarchical model using the rethinking package with just rstan() and I am just curious why n_eff is not closer.
Here is the model with random intercepts for 2 groups (intercept_x2) using the rethinking package:
Code:
response = c(rnorm(500,0,1),rnorm(500,200,10))
predicotr1_continuous = rnorm(1000)
predictor2_categorical = factor(c(rep("A",500),rep("B",500) ))
data = data.frame(y = response, x1 = predicotr1_continuous, x2 = predictor2_categorical)
head(data)
library(rethinking)
m22 <- map2stan(
alist(
y ~ dnorm( mu , sigma ) ,
mu <- intercept + intercept_x2[x2] + beta*x1 ,
intercept ~ dnorm(0,10),
intercept_x2[x2] ~ dnorm(0, sigma_2),
beta ~ dnorm(0,10),
sigma ~ dnorm(0, 10),
sigma_2 ~ dnorm(0,10)
) ,
data=data , chains=1 , iter=5000 , warmup=500 )
precis(m22, depth = 2)
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
intercept 9.96 9.59 -5.14 25.84 1368 1
intercept_x2[1] -9.94 9.59 -25.55 5.43 1371 1
intercept_x2[2] 189.68 9.59 173.28 204.26 1368 1
beta 0.06 0.22 -0.27 0.42 3458 1
sigma 6.94 0.16 6.70 7.20 2927 1
sigma_2 43.16 5.01 35.33 51.19 2757 1
Now here is the same model in rstan():
# create a numeric vector to indicate the categorical groups
data$GROUP_ID = match( data$x2, levels( data$x2 ) )
library(rstan)
standat <- list(
N = nrow(data),
y = data$y,
x1 = data$x1,
GROUP_ID = data$GROUP_ID,
nGROUPS = 2
)
stanmodelcode = '
data {
int<lower=1> N;
int nGROUPS;
real y[N];
real x1[N];
int<lower=1, upper=nGROUPS> GROUP_ID[N];
}
transformed data{
}
parameters {
real intercept;
vector[nGROUPS] intercept_x2;
real beta;
real<lower=0> sigma;
real<lower=0> sigma_2;
}
transformed parameters { // none needed
}
model {
real mu;
// priors
intercept~ normal(0,10);
intercept_x2 ~ normal(0,sigma_2);
beta ~ normal(0,10);
sigma ~ normal(0,10);
sigma_2 ~ normal(0,10);
// likelihood
for(i in 1:N){
mu = intercept + intercept_x2[ GROUP_ID[i] ] + beta*x1[i];
y[i] ~ normal(mu, sigma);
}
}
'
fit22 = stan(model_code=stanmodelcode, data=standat, iter=5000, warmup=500, chains = 1)
fit22
Inference for Stan model: b212ebc67c08c77926c59693aa719288.
1 chains, each with iter=5000; warmup=500; thin=1;
post-warmup draws per chain=4500, total post-warmup draws=4500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept 10.14 0.30 9.72 -8.42 3.56 10.21 16.71 29.19 1060 1
intercept_x2[1] -10.12 0.30 9.73 -29.09 -16.70 -10.25 -3.50 8.36 1059 1
intercept_x2[2] 189.50 0.30 9.72 170.40 182.98 189.42 196.09 208.05 1063 1
beta 0.05 0.00 0.21 -0.37 -0.10 0.05 0.20 0.47 3114 1
sigma 6.94 0.00 0.15 6.65 6.84 6.94 7.05 7.25 3432 1
sigma_2 43.14 0.09 4.88 34.38 39.71 42.84 46.36 53.26 3248 1
lp__ -2459.75 0.05 1.71 -2463.99 -2460.68 -2459.45 -2458.49 -2457.40 1334 1
Samples were drawn using NUTS(diag_e) at Thu Aug 31 15:53:09 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
My Questions:
the n_eff is larger using rethinking(). There is simulation differences but do you think something else is going on here?
Besides the n_eff being different the percentiles of the posterior distributions are different. I was thinking rethinking() and rstan() should return similar results with 5000 iterations since rethinking is just calling rstan. Are differences like that normal or something different between the 2 implementations?
I created data$GROUP_ID to indicate the categorical groupings. Is this the correct way to incorporate categorical variables into a hierarchical model in rstan()? I have 2 groups and if I had 50 groups I use the same data$GROUP_ID vector but is that the standard way?
Thank you.
I'm in the process of coding what I'm learning about Linear Regression from the coursera Machine Learning course (MATLAB). There was a similar post that I found here, but I don't seem to be able to understand everything. Perhaps because my fundamentals in Machine Learning are a bit weak.
The problem I'm facing is that, for some data... both gradient descent (GD) and the Closed Form Solution (CFS) give the same hypothesis line. However, on one particular dataset, the results are different. I've read something about, that, if the data is singular, then results should be the same. However, I have no idea how to check whether or not my data is singular.
I will try to illustrate the best I can:
1) Firstly, here is the MATLAB code adapted from here. For the given dataset, everything turned out good where both GD and the CFS gave similar results.
The Dataset
X Y
2.06587460000000 0.779189260000000
2.36840870000000 0.915967570000000
2.53999290000000 0.905383540000000
2.54208040000000 0.905661380000000
2.54907900000000 0.938988900000000
2.78668820000000 0.966847400000000
2.91168250000000 0.964368240000000
3.03562700000000 0.914459390000000
3.11466960000000 0.939339440000000
3.15823890000000 0.960749710000000
3.32759440000000 0.898370940000000
3.37931650000000 0.912097390000000
3.41220060000000 0.942384990000000
3.42158230000000 0.966245780000000
3.53157320000000 1.05265000000000
3.63930020000000 1.01437910000000
3.67325370000000 0.959694260000000
3.92564620000000 0.968537160000000
4.04986460000000 1.07660650000000
4.24833480000000 1.14549780000000
4.34400520000000 1.03406250000000
4.38265310000000 1.00700090000000
4.42306020000000 0.966836480000000
4.61024430000000 1.08959190000000
4.68811830000000 1.06344620000000
4.97773330000000 1.12372390000000
5.03599670000000 1.03233740000000
5.06845360000000 1.08744520000000
5.41614910000000 1.07029880000000
5.43956230000000 1.16064930000000
5.45632070000000 1.07780370000000
5.56984580000000 1.10697580000000
5.60157290000000 1.09718750000000
5.68776170000000 1.16486030000000
5.72156020000000 1.14117960000000
5.85389140000000 1.08441560000000
6.19780260000000 1.12524930000000
6.35109410000000 1.11683410000000
6.47970330000000 1.19707890000000
6.73837910000000 1.20694620000000
6.86376860000000 1.12510460000000
7.02233870000000 1.12356720000000
7.07823730000000 1.21328290000000
7.15142320000000 1.25226520000000
7.46640230000000 1.24970650000000
7.59738740000000 1.17997060000000
7.74407170000000 1.18972990000000
7.77296620000000 1.30299340000000
7.82645140000000 1.26011340000000
7.93063560000000 1.25622670000000
My MATLAB code:
clear all; close all; clc;
x = load('ex2x.dat');
y = load('ex2y.dat');
m = length(y); % number of training examples
% Plot the training data
figure; % open a new figure window
plot(x, y, '*r');
ylabel('Height in meters')
xlabel('Age in years')
% Gradient descent
x = [ones(m, 1) x]; % Add a column of ones to x
theta = zeros(size(x(1,:)))'; % initialize fitting parameters
MAX_ITR = 1500;
alpha = 0.07;
for num_iterations = 1:MAX_ITR
thetax = x * theta;
% for theta_0 and x_0
grad0 = (1/m) .* sum( x(:,1)' * (thetax - y));
% for theta_0 and x_0
grad1 = (1/m) .* sum( x(:,2)' * (thetax - y));
% Here is the actual update
theta(1) = theta(1) - alpha .* grad0;
theta(2) = theta(2) - alpha .* grad1;
end
% print theta to screen
theta
% Plot the hypothesis (a.k.a. linear fit)
hold on
plot(x(:,2), x*theta, 'ob')
% Plot using the Closed Form Solution
plot(x(:,2), x*((x' * x)\x' * y), '--r')
legend('Training data', 'Linear regression', 'Closed Form')
hold off % don't overlay any more plots on this figure''
[EDIT: Sorry for the wrong labeling... It's not Normal Equation, but Closed Form Solution. My mistake]
The results for this code is as shown below (Which is peachy :D Same results for both GD and CFS) -
Now, I am testing my code with another dataset. The URL for the dataset is here - GRAY KANGAROOS. I converted it to CSV and read it into MATLAB. Note that I did scaling (divided by the maximum, since if I didn't do that, no hypothesis line appears at all and the thetas come out as Not A Number (NaN) in MATLAB).
The Gray Kangaroo Dataset:
X Y
609 241
629 222
620 233
564 207
645 247
493 189
606 226
660 240
630 215
672 231
778 263
616 220
727 271
810 284
778 279
823 272
755 268
710 278
701 238
803 255
855 308
838 281
830 288
864 306
635 236
565 204
562 216
580 225
596 220
597 219
636 201
559 213
615 228
740 234
677 237
675 217
629 211
692 238
710 221
730 281
763 292
686 251
717 231
737 275
816 275
The changes I made to the code to read in this dataset
dataset = load('kangaroo.csv');
% scale?
x = dataset(:,1)/max(dataset(:,1));
y = dataset(:,2)/max(dataset(:,2));
The results that came out was like this: [EDIT: Sorry for the wrong labeling... It's not Normal Equation, but Closed Form Solution. My mistake]
I was wondering if there is any explanation for this discrepancy? Any help would be much appreciate. Thank you in advance!
I haven't run your code, but let me trow you some theory:
If your code is right (it looks like it): Increase MAX_ITER and it will look better.
Gradient descend is not ensured to converge at MAX_ITER, and actually gradient descend is a quite slow method (convergence-wise).
The convergence of Gradient descend for a "standard" convex function (like the one you try to solve) looks like this (from the Internets):
Forget, about iteration number, as it depedns in the problem, and focus in the shape. What may be happening is that your maxiter falls somewhere like "20" in this image. Thus your result is good, but not the best!
However, solving the normal equations directly will give you the minimums square error solution. (I assume normal equation you mean x=(A'*A)^(-1)*A'*b). The problem is that there are loads of cases where you can not store A in memory, or in an ill-posed problem, the normal equation will lead to ill-conditioned matrices that will be numerically unstable, thus gradient descend is used.
more info
I think I figured it out.
I immaturely thought that a maximum iteration of 1500 was enough. I tried with a higher value (i.e. 5k and 10k), and both algorithms started to give the similar solution. So my main issue was the number of iterations. It needed more iteration to properly converge for that dataset :D
I am working with mtcars dataset and using linear regression
data(mtcars)
fit<- lm(mpg ~.,mtcars);summary(fit)
When I fit the model with lm it shows the result like this
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.5087 -1.3584 -0.0948 0.7745 4.6251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.87913 20.06582 1.190 0.2525
cyl6 -2.64870 3.04089 -0.871 0.3975
cyl8 -0.33616 7.15954 -0.047 0.9632
disp 0.03555 0.03190 1.114 0.2827
hp -0.07051 0.03943 -1.788 0.0939 .
drat 1.18283 2.48348 0.476 0.6407
wt -4.52978 2.53875 -1.784 0.0946 .
qsec 0.36784 0.93540 0.393 0.6997
vs1 1.93085 2.87126 0.672 0.5115
amManual 1.21212 3.21355 0.377 0.7113
gear4 1.11435 3.79952 0.293 0.7733
gear5 2.52840 3.73636 0.677 0.5089
carb2 -0.97935 2.31797 -0.423 0.6787
carb3 2.99964 4.29355 0.699 0.4955
carb4 1.09142 4.44962 0.245 0.8096
carb6 4.47757 6.38406 0.701 0.4938
carb8 7.25041 8.36057 0.867 0.3995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.833 on 15 degrees of freedom
Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
I found that none of variables are marked as significant at 0.05 significant level.
To find out significant variables I want to to do subset selection to find out best pair of vairables as predictors with response variable mpg.
The function regsubsets in the package leaps does best subset regression (see ?leaps). Adapting your code:
library(leaps)
regfit <- regsubsets(mpg ~., data = mtcars)
summary(regfit)
# or for a more visual display
plot(regfit,scale="Cp")
Imagine the following scenario:
I have two 4x100 matrices ang1_stab and ang2_stab. These contain four angles along the columns, like this:
195.7987 16.2722 14.4171 198.5878 199.2693...
80.2062 86.7363 89.2861 89.5454 89.3411...
68.7998 -83.8318 -80.3138 69.0636 -96.4913...
-5.3262 -23.3030 -20.6823 -18.9915 -16.7224...
95.3450 183.8212 171.0686 151.8887 177.9041...
21.4780 27.2389 23.4016 27.6631 17.2893...
-13.2767 -103.5548 -115.0615 39.6790 -112.3568...
-5.3262 -23.3030 -20.6823 -18.9915 -16.7224...
The fourth angle is always the same for both matrices, so it can be neglected.
The problem: some of the columns of ang1_stab and ang2_stab are swapped, so I need to find the columns that would fit better into the other matrix and then swap the respective columns.
The complication: The calculation of the given angles is ambigous and multiples of 90° might have been added/subtracted, e.g. the angle 16° should be considered closer to 195° than to 95°.
What I have tried so far:
fp1 = []; % define clusters
fp2 = [];
for j = 1:size(ang1_stab,2) % loop over all columns
tmp1 = ang1_stab(:,j); % temporary columns
tmp2 = ang2_stab(:,j);
if j == 1 % assign first cluster center
fp1 = [fp1, tmp1];
fp2 = [fp2, tmp2];
else
mns1 = median(fp1(1:3,:),2); % calculate cluster centers
mns2 = median(fp2(1:3,:),2);
% calculate distances to cluster centers
dif11 = sum(abs((mns1-tmp1(1:3))-round((mns1-mp1(1:3))/90)*90));
dif12 = sum(abs((mns1-tmp2(1:3))-round((mns1-tmp2(1:3))/90)*90));
dif21 = sum(abs((mns2-tmp1(1:3))-round((mns2-tmp1(1:3))/90)*90));
dif22 = sum(abs((mns2-tmp2(1:3))-round((mns2-tmp2(1:3))/90)*90));
if min([dif11,dif21])<min([dif12,dif22]) % assign to cluster
if dif11<dif21
fp1 = [fp1,tmp1];
fp2 = [fp2,tmp2];
else
fp1 = [fp1,tmp2];
fp2 = [fp2,tmp1];
end
else
if dif12<dif22
fp1 = [fp1,tmp2];
fp2 = [fp2,tmp1];
else
fp1 = [fp1,tmp1];
fp2 = [fp2,tmp2];
end
end
end
end
However:
This appraoch seems overly complicated and I was wondering if I can somehow replace it with an appropriate algorithm, e.g. kmeans. However, I don't know how to account for the ambiguity in the angles in that case.
The code is working, but the clustering does currently still throw points in the wrong cluster. I just cannot find why.
I would appreciate it, if someone could tell me how to adopt this to work with built-in routines like kmeans or so.
Edit:
A small toy example:
This could be the output that I am getting:
ang1_stab = [30 10 80 100; 28 15 90 95; 152 93 180 102];
ang2_stab = [150 90 3 100; 145 92 5 95; 32 10 82 102];
What I would like to achieve:
fp1 = [30 10 80 100; 28 15 90 95; 32 10 82 102];
fp2 = [150 90 3 100; 145 92 5 95; 152 93 180 102];
Note that the last columns have been swapped.
Also note that the third element in the last column of fp2 is approximately the mean of the other elements in that row, but 180° higher. I still need to be able to identify this is the right cluster.