Accord.net SimpleLinearRegression regress method obsolete? - linear-regression

I just started learning accord.net, and while going through some examples I noticed that the Regress method on the SimpleLinearRegression is obsolete.
Apparently I should use the OrdinaryLeastSquares class, but I cannot find anything that will return the residual sum of squares, similar to the Regress method.
Do I need to create this method by myself?

Here is a full example on how to learn a SimpleLinearRegression and still be able to compute the residual sum of squares as you have been doing using the previous version of the framework:
// This is the same data from the example available at
// http://mathbits.com/MathBits/TISection/Statistics2/logarithmic.htm
// Declare your inputs and output data
double[] inputs = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 };
double[] outputs = { 6, 9.5, 13, 15, 16.5, 17.5, 18.5, 19, 19.5, 19.7, 19.8 };
// Transform inputs to logarithms
double[] logx = Matrix.Log(inputs);
// Use Ordinary Least Squares to learn the regression
OrdinaryLeastSquares ols = new OrdinaryLeastSquares();
// Use OLS to learn the simple linear regression
SimpleLinearRegression lr = ols.Learn(logx, outputs);
// Compute predicted values for inputs
double[] predicted = lr.Transform(logx);
// Get an expression representing the learned regression model
// We just have to remember that 'x' will actually mean 'log(x)'
string result = lr.ToString("N4", CultureInfo.InvariantCulture);
// Result will be "y(x) = 6.1082x + 6.0993"
// The mean squared error between the expected and the predicted is
double error = new SquareLoss(outputs).Loss(predicted); // 0.261454
The last line in this example is the one that should be the most interesting to you. As you can see, the residual sum of squares that beforehand was being returned by the .Regress method can now be computed using the SquareLoss class. The advantages of this approach is that now you should be able to compute the most appropriate metric that matters the most to you, such as the ZeroOneLoss or the Euclidean loss or the Hamming loss.
In any case, I just wanted to reiterate that any methods marked as Obsolete in the framework are not going to stop working anytime soon. They are marked as obsolete meaning that new features will not be supported when using those methods, but your application will not stop working in case you have used any of those methods from within it.

Related

How to randomly sample data with seeding?

I would like to randomly choose elements from a finite set that contains both numbers and NaNs while seeding the random number generation procedure.
So far I can make it work without seeding:
data = [0, 1, 2, 3, 4, 5, nan];
sample = datasample(data, 50);
but if I want to seed the number generation:
seed = rng(100);
sample = datasample(seed, data, 50);
I get the following error:
Error using datasample (line 89)
Sample size K must be a non-negative integer.
even if the syntax for datasample is (*):
[y,...] = datasample(s,data,k,...)
I have tried using randsample, too, but I get similar results.
(*) https://it.mathworks.com/help/stats/datasample.html
The documentation isn't super explicit about the first input. You need to pass a RandStream object as the first input argument rather than the struct that rng generates (As a sidenote, the output of rng is the previous setting not the new settings).
Here is the equivalent of what it seems you were trying to do
stream = RandStream('mt19937ar', 'Seed', 100);
output = datasample(stream, data, k);
If you want to instead use rng to specify the seed, you can call rng and then use RandStream.getGlobalStream to get the current global random number stream and then pass that to datasample. This is slightly redudant though since datasample is going to use the global random number stream if one isn't provided.
rng(100)
stream = RandStream.getGlobalStream();
output = datasample(stream, data, k);

Use MATLAB cameraParams in OpenCV program

I have a MATLAB program that loads two images and returns two camera matrices and a cameraParams object with distortion coefficients, etc. I would now like to use this exact configuration to undistort points and so on, in an OpenCV program that triangulates points given their 2D locations in two different videos.
function [cameraMatrix1, cameraMatrix2, cameraParams] = setupCameraCalibration(leftImageFile, rightImageFile, squareSize)
% Auto-generated by cameraCalibrator app on 20-Feb-2015
The thing is, the output of undistortPoints is different in MATLAB and OpenCV even though both use the same arguments.
As an example:
>> undistortPoints([485, 502], defaultCameraParams)
ans = 485 502
In Java, the following test mimics the above (it passes).
public void testUnDistortPoints() {
Mat srcMat = new Mat(2, 1, CvType.CV_32FC2);
Mat dstMat = new Mat(2, 1, CvType.CV_32FC2);
srcMat.put(0, 0, new float[] { 485, 502 } );
MatOfPoint2f src = new MatOfPoint2f(srcMat);
MatOfPoint2f dst = new MatOfPoint2f(dstMat);
Mat defaultCameraMatrix = Mat.eye(3, 3, CvType.CV_32F);
Mat defaultDistCoefficientMatrix = new Mat(1, 4, CvType.CV_32F);
Imgproc.undistortPoints(
src,
dst,
defaultCameraMatrix,
defaultDistCoefficientMatrix
);
System.out.println(dst.dump());
assertEquals(dst.get(0, 0)[0], 485d);
assertEquals(dst.get(0, 0)[1], 502d);
}
However, say I change the first distortion coefficient (k1). In MATLAB:
changedDist = cameraParameters('RadialDistortion', [2 0 0])
>> undistortPoints([485, 502], changedDist)
ans = 4.8756 5.0465
In Java:
public void testUnDistortPointsChangedDistortion() {
Mat srcMat = new Mat(2, 1, CvType.CV_32FC2);
Mat dstMat = new Mat(2, 1, CvType.CV_32FC2);
srcMat.put(0, 0, new float[] { 485, 502 } );
MatOfPoint2f src = new MatOfPoint2f(srcMat);
MatOfPoint2f dst = new MatOfPoint2f(dstMat);
Mat defaultCameraMatrix = Mat.eye(3, 3, CvType.CV_32F);
Mat distCoefficientMatrix = new Mat(1, 4, CvType.CV_32F);
distCoefficientMatrix.put(0, 0, 2f); // updated
Imgproc.undistortPoints(
src,
dst,
defaultCameraMatrix,
distCoefficientMatrix
);
System.out.println(dst.dump());
assertEquals(4.8756, dst.get(0, 0)[0]);
assertEquals(5.0465, dst.get(0, 0)[1]);
}
It fails with the following output:
[0.0004977131, 0.0005151587]
junit.framework.AssertionFailedError:
Expected :4.8756
Actual :4.977131029590964E-4
Why are the results different? I thought Java's distortion coefficient matrix includes both the radial and tangential distortion coefficients.
Also, is CV_64FC1 a good choice of type for the camera / distortion coefficient matrices?
I was trying to test the effect of changing the camera matrix itself (i.e. the value of f_x), but it's not possible to set the 'IntrinsicMatrix' parameter when using cameraparams, so I want to solve the distortion matrix problem first.
Any help would be greatly appreciated.
There is a couple of things you have to take into account when working with calibration models.
First, note there exist several camera calibration and distortion models: Tsai, ATAN, Pinhole, Ocam. I assume you want to use the Pinhole model, which is the used by OpenCV and the most common one. It models from 2 to 6 parameters for radial distortion (denoted as k1...k6) and 2 for tangential distortion (denoted as p1, p2), as you can read in the OpenCV doc. Bouget's calibration toolbox for Matlab uses this model too.
Second, there is not a standardized way to arrange distortion parameters in a vector. OpenCV expects items in this order: [k1 k2 p1 p2 k3...k6], being k3...k6 optional.
So, check the documentation of your Matlab calibration software and look for what model it uses and in which order the parameters are arranged. Then, make sure it meets the order in OpenCV.
The calibration parameters for OpenCV are ok as CV_32F and CV_64F as I recall.
Update
I don't know in Java, but in C++, when you create a Mat, its initial values are unspecified, so that this code may be creating a matrix with a 2f in the first item and garbage in the remaining ones:
Mat distCoefficientMatrix = new Mat(1, 4, CvType.CV_32F);
distCoefficientMatrix.put(0, 0, 2f);
Could you check if this is the problem?
A note for the future, to make things trickier, take into account that the intrinsic calibration matrix in OpenCV is the transpose of the one in Matlab.

is There any function in opencv which is equivalent to matlab conv2

Is there any direct opencv function for matlab function conv2? I tried using cvFilter2D(), but it seems to be giving me different results than conv2().
For example:
CvMat * Aa = cvCreateMat(2, 2, CV_32FC1);
CvMat * Bb = cvCreateMat(2, 2, CV_32FC1);
CvMat * Cc = cvCreateMat(2, 2, CV_32FC1);
cvSetReal2D(Aa, 0, 0, 1);
cvSetReal2D(Aa, 0, 1, 2);
cvSetReal2D(Aa, 1, 0, 3);
cvSetReal2D(Aa, 1, 1, 4);
cvSetReal2D(Bb, 0, 0, 5);
cvSetReal2D(Bb, 0, 1, 5);
cvSetReal2D(Bb, 1, 0, 5);
cvSetReal2D(Bb, 1, 1, 5);
cvFilter2D(Aa, Cc, Bb);
This produces the matrix [20 30; 40 50]
In MATLAB:
>> A=[1 2; 3 4]
A =
1 2
3 4
>> B=[5 5; 5 5]
B =
5 5
5 5
>> conv2(A,B,'shape')
ans =
50 30
35 20
Please Help me.its very much useful for me.Thank you.
Regards
Arangarajan.
The numerical computing environment Matlab (or e.g. its free alternative GNU Octave) provides a function called conv2 for the two-dimensional convolution of a given matrix with a convolution kernel. While writing some C++ code based upon the free image processing library OpenCV, I found that OpenCV currently offers no equivalent method.
Although there is a filter2D() method that implements two-dimensional correlation and that can be used to convolute an image with a given kernel (by flipping that kernel and moving the anchor point to the correct position, as explained on the corresponding OpenCV documentation page), it would be nice to have a method offering the same border handling options as Matlab (“full”, “valid” or “same” convolution), e.g. for comparing results of the same algorithm implemented in both Matlab and C++ using OpenCV.
Here is what I came up with:
enum ConvolutionType {
/* Return the full convolution, including border */
CONVOLUTION_FULL,
/* Return only the part that corresponds to the original image */
CONVOLUTION_SAME,
/* Return only the submatrix containing elements that were not influenced by the border
*/
CONVOLUTION_VALID
};
void conv2(const Mat &img, const Mat& kernel, ConvolutionType type, Mat& dest) {
Mat source = img;
if(CONVOLUTION_FULL == type) {
source = Mat();
const int additionalRows = kernel.rows-1, additionalCols = kernel.cols-1;
copyMakeBorder(img, source, (additionalRows+1)/2, additionalRows/2,
(additionalCols+1)/2, additionalCols/2, BORDER_CONSTANT, Scalar(0));
}
Point anchor(kernel.cols - kernel.cols/2 - 1, kernel.rows - kernel.rows/2 - 1);
int borderMode = BORDER_CONSTANT;
filter2D(source, dest, img.depth(), flip(kernel), anchor, 0, borderMode);
if(CONVOLUTION_VALID == type) {
dest = dest.colRange((kernel.cols-1)/2, dest.cols - kernel.cols/2)
.rowRange((kernel.rows-1)/2, dest.rows - kernel.rows/2);
}
}
In my unit tests, this implementation yielded results that were almost identical with the Matlab implementation. Note that both OpenCV and Matlab do the convolution in Fourier space if the kernel is large enough. The definition of ‘large’ varies in both implementations, but results should still be very similar, even for large kernels.
Also, the performance of this method might be an issue for the ‘full’ convolution case, since the entire source matrix needs to be copied to add a border around it. Finally, If you receive an exception in the filter2D() call and you are using a kernel with only one column, this might be caused by this bug. In that case, set the borderMode variable to e.g. BORDER_REPLICATE instead, or use the latest version of the library from the OpenCV trunk.
If you are using convolution, there is problem at the edge of the matrix. The convolution mask needs values which are outside of the matrix. The algorithms from OpenCV and matlab use different strategies to cope with this problem. OpenCV just replicates the pixels of the border whereas matlab just assumes that all this pixels are zero.
So if you want to emulate the behaviour of matlab in OpenCV you can add this zero padding manually. There even is a dedicated function for this. Let me give you an example of how your code could be modified:
CvMat * Ccb = cvCreateMat(3, 3, CV_32FC1);
CvMat * Aab = cvCreateMat(3, 3, CV_32FC1);
cvCopyMakeBorder(Aa,Aab, cvPoint(0,0),IPL_BORDER_CONSTANT, cvScalarAll(0));
cvFilter2D(Aab, Ccb, Bb);
The result this gives is:
20.000 30.000 20.000
40.000 50.000 30.000
30.000 35.000 20.000
To get your intended result you just need to delete the first column and row to get rid of the additional data introduced by the border we added.

clustering and matlab

I'm trying to cluster some data I have from the KDD 1999 cup dataset
the output from the file looks like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
The clusters however don't look like clusters or its not accepting and working the way I need it to.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
The method:
Chose number of clusters (K)
Initialize centroids (K patterns randomly chosen from data set)
Assign each pattern to the cluster with closest centroid
Calculate means of each cluster to be its new centroid
Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)
This is what I'm trying to achieve:
This is what I'm getting:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)
Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
Finally, I suggest you familiarize yourself with a simpler dataset...

What kind of data/format should matlabs clustering toolbox use [duplicate]

I'm trying to cluster some data I have from the KDD 1999 cup dataset
the output from the file looks like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
The clusters however don't look like clusters or its not accepting and working the way I need it to.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
The method:
Chose number of clusters (K)
Initialize centroids (K patterns randomly chosen from data set)
Assign each pattern to the cluster with closest centroid
Calculate means of each cluster to be its new centroid
Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)
This is what I'm trying to achieve:
This is what I'm getting:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)
Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
Finally, I suggest you familiarize yourself with a simpler dataset...