classifier.setOptions( weka.core.Utils.splitOptions()) is taking only default values even if other values provided in matlab - matlab

import weka.core.Instances.*
filename = 'C:\Users\Girish\Documents\MATLAB\DRESDEN_NSC.csv';
loader = weka.core.converters.CSVLoader();
loader.setFile(java.io.File(filename));
data = loader.getDataSet();
data.setClassIndex(data.numAttributes()-1);
%% classification
classifier = weka.classifiers.trees.J48();
classifier.setOptions( weka.core.Utils.splitOptions('-C 0.25 -M 2') );
classifier.buildClassifier(data);
classifier.toString()
ev = weka.classifiers.Evaluation(data);
v(1) = java.lang.String('-t');
v(2) = java.lang.String(filename);
v(3) = java.lang.String('-split-percentage');
v(4) = java.lang.String('66');
prm = cat(1,v(1:4));
ev.evaluateModel(classifier, prm)
Result:
Time taken to build model: 0.04 seconds
Time taken to test model on training split: 0.01 seconds
=== Error on training split ===
Correctly Classified Instances 767 99.2238 %
Incorrectly Classified Instances 6 0.7762 %
Kappa statistic 0.9882
Mean absolute error 0.0087
Root mean squared error 0.0658
Relative absolute error 1.9717 %
Root relative squared error 14.042 %
Total Number of Instances 773
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.994 0.009 0.987 0.994 0.990 0.984 0.999 0.999 Nikon
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 Sony
0.981 0.004 0.990 0.981 0.985 0.980 0.999 0.997 Canon
Weighted Avg. 0.992 0.004 0.992 0.992 0.992 0.988 1.000 0.999
=== Confusion Matrix ===
a b c <-- classified as
306 0 2 | a = Nikon
0 258 0 | b = Sony
4 0 203 | c = Canon
=== Error on test split ===
Correctly Classified Instances 358 89.9497 %
Incorrectly Classified Instances 40 10.0503 %
Kappa statistic 0.8482
Mean absolute error 0.0656
Root mean squared error 0.2464
Relative absolute error 14.8485 %
Root relative squared error 52.2626 %
Total Number of Instances 398
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.885 0.089 0.842 0.885 0.863 0.787 0.908 0.832 Nikon
0.993 0.000 1.000 0.993 0.997 0.995 0.997 0.996 Sony
0.796 0.060 0.841 0.796 0.818 0.749 0.897 0.744 Canon
Weighted Avg. 0.899 0.048 0.900 0.899 0.899 0.853 0.938 0.867
=== Confusion Matrix ===
a b c <-- classified as
123 0 16 | a = Nikon
0 145 1 | b = Sony
23 0 90 | c = Canon
import weka.core.Instances.*
filename = 'C:\Users\Girish\Documents\MATLAB\DRESDEN_NSC.csv';
loader = weka.core.converters.CSVLoader();
loader.setFile(java.io.File(filename));
data = loader.getDataSet();
data.setClassIndex(data.numAttributes()-1);
%% classification
classifier = weka.classifiers.trees.J48();
classifier.setOptions( weka.core.Utils.splitOptions('-C 0.1 -M 1') );
classifier.buildClassifier(data);
classifier.toString()
ev = weka.classifiers.Evaluation(data);
v(1) = java.lang.String('-t');
v(2) = java.lang.String(filename);
v(3) = java.lang.String('-split-percentage');
v(4) = java.lang.String('66');
prm = cat(1,v(1:4));
ev.evaluateModel(classifier, prm)
Result:
Time taken to build model: 0.04 seconds
Time taken to test model on training split: 0 seconds
=== Error on training split ===
Correctly Classified Instances 767 99.2238 %
Incorrectly Classified Instances 6 0.7762 %
Kappa statistic 0.9882
Mean absolute error 0.0087
Root mean squared error 0.0658
Relative absolute error 1.9717 %
Root relative squared error 14.042 %
Total Number of Instances 773
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.994 0.009 0.987 0.994 0.990 0.984 0.999 0.999 Nikon
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 Sony
0.981 0.004 0.990 0.981 0.985 0.980 0.999 0.997 Canon
Weighted Avg. 0.992 0.004 0.992 0.992 0.992 0.988 1.000 0.999
=== Confusion Matrix ===
a b c <-- classified as
306 0 2 | a = Nikon
0 258 0 | b = Sony
4 0 203 | c = Canon
=== Error on test split ===
Correctly Classified Instances 358 89.9497 %
Incorrectly Classified Instances 40 10.0503 %
Kappa statistic 0.8482
Mean absolute error 0.0656
Root mean squared error 0.2464
Relative absolute error 14.8485 %
Root relative squared error 52.2626 %
Total Number of Instances 398
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.885 0.089 0.842 0.885 0.863 0.787 0.908 0.832 Nikon
0.993 0.000 1.000 0.993 0.997 0.995 0.997 0.996 Sony
0.796 0.060 0.841 0.796 0.818 0.749 0.897 0.744 Canon
Weighted Avg. 0.899 0.048 0.900 0.899 0.899 0.853 0.938 0.867
=== Confusion Matrix ===
a b c <-- classified as
123 0 16 | a = Nikon
0 145 1 | b = Sony
23 0 90 | c = Canon
Same Result with both split options which is the result for default options i.e. -C 0.25 -M 2 for J48 classifier
please help!!! stuck here for a long time.Tried Different means but nothing worked for me

Related

I'm using a version of pmdarima that no longer includes statsmodels ARIMA or ARMA class. How do I interpret SARIMAX without pdq?

auto_arima(df1['Births'],seasonal=False).summary()
SARIMAX Results
Dep. Variable: y No. Observations: 120
Model: SARIMAX Log Likelihood -409.745
Date: Mon, 23 Aug 2021 AIC 823.489
Time: 06:55:06 BIC 829.064
Sample: 0 HQIC 825.753
120
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 39.7833 0.687 57.896 0.000 38.437 41.130
sigma2 54.1197 8.319 6.506 0.000 37.815 70.424
Ljung-Box (L1) (Q): 0.85 Jarque-Bera (JB): 2.69
Prob(Q): 0.36 Prob(JB): 0.26
Heteroskedasticity (H): 0.80 Skew: 0.26
Prob(H) (two-sided): 0.48 Kurtosis: 2.48
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
auto_arima(df1['Births'],seasonal=False)
auto_arima(df1['Births'],seasonal=False)

Matlab to calculate a spectral line parameter for each layer

I need to calculate a parameter defined as x,( this is defined in my code below) for the given spectral lines in each layer. My atmospheric profile has 10 layers. I know how to calculate x for just one layer. Then I get 5 values for x corresponding to each spectral line ( or wavelength) .
Suppose I want to do this for all 10 layers. Then my output should have 10 rows and 5 columns , size should be (10,5) , 10 represent number of the layer and 5 represent the spectral line. Any suggestion would be greatly appreciated
wl=[100 200 300 400 500]; %5 wavelengths, 5 spectral lines
br=[0.12 0.56 0.45 0.67 0.89]; % broadening parameter for each wavelength
p=[1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ]; % pressure for 10 layers
T=[101 102 103 104 105 106 107 108 109 110]; % temperature for 10 layers
%suppose I want to caculate a parameter, x for all the layers
% x is defined as,( wavelength*br*T)/p
%when I do the calculation for the first layer,I have to consider all the
%wavelengths , all the broadening parameters and only the first value of
%pressure and only the first value of temperature
for i=1:5;
x(i)= (wl(i)*br(i)*T(1))/p(1);
end
% x is the x parameter for all the wavelengths in the first layer
%Now I want to calculate the x parameter for all the wavelengths in all 10
%layers
%my output should have 10 rows for 10 layers and 5 columns , size= (10,5)
you don't need loops for this case
>> (T./p)'*(wl.*br)
ans =
1.0e+05 *
0.0121 0.1131 0.1364 0.2707 0.4495
0.0136 0.1269 0.1530 0.3037 0.5043
0.0155 0.1442 0.1738 0.3451 0.5729
0.0178 0.1664 0.2006 0.3982 0.6611
0.0210 0.1960 0.2362 0.4690 0.7788
0.0254 0.2374 0.2862 0.5682 0.9434
0.0321 0.2996 0.3611 0.7169 1.1904
0.0432 0.4032 0.4860 0.9648 1.6020
0.0654 0.6104 0.7358 1.4606 2.4253
0.1320 1.2320 1.4850 2.9480 4.8950

Numerical issue in MATLAB maximum likelihood estimation

I am using mle and mlecov to estimate the mean and variance of the scalar noise signal n which is assumed to be normally distributed with the following models for mean and standard deviation:
mean(x,y) = #(x,y) k(1)+k(2)*x+k(3)*x.^2+k(4)*y+k(5)*y.^2;
sd(x,y) = #(x,y) k(6)+k(7)*x+k(8)*x.^2+k(9)*y+k(10)*y.^2;
where x is in the [0,3] interval and and y is in the [0,pi/2] interval (thus, scaling does not immediately seem to be an issue). The sample of n, x and y values used for MLE has 10981 samples. Here are some graphs to show the sample qualitatively:
Figure 1. Histogram of the noise samples.
Figure 2. Scatter plot of the noise samples vs. the x and y samples respectively.
My goal is to compute the maximum likelihood estimates for the k(i) model parameters, i=1,...,10, as well as their standard deviation, kSE(i) (given by the square root of the diagonal elements of the asymptotic covariance matrix output by mlecov).
For the maximum likelihood estimation, I minimize the negative log likelihood:
I also give MATLAB the analytical gradient of the negative log likelihood L(k(1),...,k(10)), used by mle and mlecov such that numerical approximations of the gradient hopefully do not contribute to the numerical issue I am about to describe.
Numerical Issue
To demonstrate the issue, I present three scenarios.
Scenario 1. I directly run mle and mlecov on the sample data. This outputs the following Stata-like summary:
-----------------------------------------------------------------------------
Coeffs | Val. Std. Err. z P>|z| [95% Conf. Interval]
---------+-------------------------------------------------------------------
k1 | -0.0153 0.0014 -11.27 0.000 -0.0179 -0.0126
k2 | 0.0075 0.0016 4.79 0.000 0.0045 0.0106
k3 | 0.0045 0.0006 7.44 0.000 0.0033 0.0056
k4 | 0.0131 0.0023 5.57 0.000 0.0085 0.0177
k5 | -0.0101 0.0012 -8.45 0.000 -0.0125 -0.0078
k6 | 0.0114 0.0011 10.25 0.000 0.0092 0.0135
k7 | 0.0244 0.0011 21.86 0.000 0.0222 0.0266
k8 | -0.0001 0.0004 -0.34 0.732 -0.0010 0.0007
k9 | -0.0190 0.0018 -10.48 0.000 -0.0225 -0.0154
k10 | 0.0057 0.0009 6.32 0.000 0.0039 0.0074
-----------------------------------------------------------------------------
The "Val." column corresponds to the k(i) estimates and the "Std. Err." column corresponds to kSE(i). The "P>|z|" column gives the p-value for a single coefficient Wald test of the null hypothesis k(i)==0 (if this p-value is <0.05, we reject the null hypothesis and thus conclude that the coefficient k(i) may be significant at the 95% level).
Note that to compute the asymptotic covariance matrix of the k(i) estimates, mlecov computes the Hessian H of L(k(1),...,k(10)) - which I provide an analytic gradient for. The condition number of H is cond(H)=2.7437e3. The mlecov function does a Cholesky factorization of the Hessian, which gives the upper-triangular matrix R with cond(R)=52.38.
Scenario 2. I multiply all samples by 0.1 and thus run mle and mlecov on the sample data n*0.1, x*0.1 and y*0.1. This outputs the following summary:
-----------------------------------------------------------------------------
Coeffs | Val. Std. Err. z P>|z| [95% Conf. Interval]
---------+-------------------------------------------------------------------
k1 | -0.0010 0.0001 -7.39 0.000 -0.0013 -0.0008
k2 | 0.0063 0.0016 3.97 0.000 0.0032 0.0093
k3 | 0.0494 0.0060 8.21 0.000 0.0376 0.0611
k4 | 0.0023 0.0024 0.95 0.340 -0.0024 0.0070
k5 | -0.0462 0.0123 -3.75 0.000 -0.0704 -0.0221
k6 | 0.0014 0.0001 12.30 0.000 0.0012 0.0016
k7 | 0.0220 0.0011 20.86 0.000 0.0200 0.0241
k8 | 0.0078 0.0042 1.87 0.062 -0.0004 0.0160
k9 | -0.0228 0.0020 -11.27 0.000 -0.0267 -0.0188
k10 | 0.0747 0.0097 7.70 0.000 0.0557 0.0937
-----------------------------------------------------------------------------
The p-values have changed. Also, now cond(H)=9.3831e5 (!!!) and cond(R)=968.6616. Note that when I remove the second order terms (x.^2 and y.^2) from the mean and standard deviation models, there is no longer this problem (i.e. the p-values stay the same and the k(i) values, except for the constant terms k(1) and k(6), are simply scaled by 0.1). Does this indicate a numerical issue?
Scenario 3. I decided to also try scaling n, x and y to the interval [-1,1] by dividing their samples by the largest element (i.e. n(i)=n(i)/max(abs(n)), x(i)=x(i)/max(abs(x)) and y(i)=y(i)/max(abs(y))). Running mle and mlecov on this scaled sample outputs the following summary:
-----------------------------------------------------------------------------
Coeffs | Val. Std. Err. z P>|z| [95% Conf. Interval]
---------+-------------------------------------------------------------------
k1 | -0.0347 0.0041 -8.40 0.000 -0.0428 -0.0266
k2 | 0.1193 0.0141 8.46 0.000 0.0917 0.1470
k3 | 0.0482 0.0164 2.94 0.003 0.0160 0.0803
k4 | -0.0002 0.0120 -0.02 0.987 -0.0238 0.0234
k5 | -0.0305 0.0103 -2.96 0.003 -0.0506 -0.0103
k6 | 0.0557 0.0035 16.11 0.000 0.0489 0.0624
k7 | 0.1131 0.0107 10.60 0.000 0.0922 0.1341
k8 | 0.1164 0.0128 9.13 0.000 0.0914 0.1414
k9 | -0.1132 0.0094 -11.99 0.000 -0.1317 -0.0947
k10 | 0.0583 0.0079 7.37 0.000 0.0428 0.0738
-----------------------------------------------------------------------------
The p-values have changed again! Now cond(H)=4.7550e3 (higher than Scenario 1 (unscaled) but lower than Scenario 2 (everything multiplied by 0.1)). Also, cond(R)=68.9565, which is only slightly higher than for Scenario 1.
My problem
The expected behavior across the three analyses, for me, is that k(i) and kSE(i) would change but the p-values would remain the same - in other words, scaling the data should not make any model coefficient more or less statistically significant. This is contrary to the above scenarios, where the p-values change each time!
Please help me to debug this numerical issue - or explain whether this is in fact the expected behavior and I have misunderstood something. Thank you for reading this long post and helping - I tried to encapsulate all relevant problem details here.
First, I assume you are controlling the random seed of the sampling so that's the same in all scenarios.
That taken care of, I think it may have something to do with the optimization problem you're trying to solve.
I have firsthand experience that tiny numerical changes (in my case, scaling the loglikelihood function by a factor, or equivalently: adding copies of all the datapoints) will change your result when the objective function is not convex.
I would try to derive the analytical gradient of the loglikelihood function in all of the parameters.
This should give you an idea of whether the optimization problem is convex.
If it is not convex, there are some things to do to make sure you get the real MLE.
Optimize the function 1000 times and pick the estimate with the highest loglikelihood
Change the tolerance and number of steps of the optimizer
Try other optimizers, like trust-region searches or particle swarms
I would start by simulating a simpler version of this problem and build it up gradually to see where this behaviour starts happening. For example, start with just 1 parameter for the mean and 1 for the noise, and see what happens with the p values then.

How to plot a DET curve from results provided by Weka?

I am facing a problem of classification between 4 classes, I used for this classification Weka and I get a result in this form:
Correctly Classified Instances 3860 96.5 %
Incorrectly Classified Instances 140 3.5 %
Kappa statistic 0.9533
Mean absolute error 0.0178
Root mean squared error 0.1235
Relative absolute error 4.7401 %
Root relative squared error 28.5106 %
Total Number of Instances 4000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.98 0.022 0.936 0.98 0.957 0.998 A
0.92 0.009 0.973 0.92 0.946 0.997 B
0.991 0.006 0.982 0.991 0.987 1 C
0.969 0.01 0.971 0.969 0.97 0.998 D
Weighted Avg. 0.965 0.012 0.965 0.965 0.965 0.998
=== Confusion Matrix ===
a b c d <-- classified as
980 17 1 2 | a = A
61 920 1 18 | b = B
0 0 991 9 | c = C
6 9 16 969 | d = D
My goal now is to draw (The Detection Error Trade-off) DET curve from results provided by Weka.
I found a MATLAB code that allows me to draw the DET curve, here are some line of code in this function:
Ntrials_True = 1000;
True_scores = randn(Ntrials_True,1);
Ntrials_False = 1000;
mean_False = -3;
stdv_False = 1.5;
False_scores = stdv_False * randn(Ntrials_False,1) + mean_False;
%-----------------------
% Compute Pmiss and Pfa from experimental detection output scores
[P_miss,P_fa] = Compute_DET(True_scores,False_scores);
the code of function Compute_DET is:
[Pmiss, Pfa] = Compute_DET(true_scores, false_scores)
num_true = max(size(true_scores));
num_false = max(size(false_scores));
total=num_true+num_false;
Pmiss = zeros(num_true+num_false+1, 1); %preallocate for speed
Pfa = zeros(num_true+num_false+1, 1); %preallocate for speed
scores(1:num_false,1) = false_scores;
scores(1:num_false,2) = 0;
scores(num_false+1:total,1) = true_scores;
scores(num_false+1:total,2) = 1;
scores=DETsort(scores);
sumtrue=cumsum(scores(:,2),1);
sumfalse=num_false - ([1:total]'-sumtrue);
Pmiss(1) = 0;
Pfa(1) = 1.0;
Pmiss(2:total+1) = sumtrue ./ num_true;
Pfa(2:total+1) = sumfalse ./ num_false;
return
but I have a problem with the translation of the meaning of different parameters. for example what is the significance of mean_False and stdv_False and what is the correspondence with the parameters of Weka?

find exact numbers of true postive in weka

In WEKA, I can easily find the TP Rate and total True Classified Instances from Confusion Matrix but is there any way to see exact number of tp and/or tn?
And do you know any way to find these values in matlab-anfis?
Since you are mentioning MATLAB, I'm assuming you are using the Java API to the Weka library to programmatically build classifiers.
In that case, you can evaluate the model using the weka.classifiers.Evaluation class, which provides all sorts of statistics.
Assuming you already have weka.jar file on the java class path (see javaaddpath function), here is an example in MATLAB:
%# data
fName = 'C:\Program Files\Weka-3-7\data\iris.arff';
loader = weka.core.converters.ArffLoader();
loader.setFile( java.io.File(fName) );
data = loader.getDataSet();
data.setClassIndex( data.numAttributes()-1 );
%# classifier
classifier = weka.classifiers.trees.J48();
classifier.setOptions( weka.core.Utils.splitOptions('-C 0.25 -M 2') );
classifier.buildClassifier( data );
%# evaluation
evl = weka.classifiers.Evaluation(data);
pred = evl.evaluateModel(classifier, data, {''});
%# display
disp(classifier.toString())
disp(evl.toSummaryString())
disp(evl.toClassDetailsString())
disp(evl.toMatrixString())
%# confusion matrix and other stats
cm = evl.confusionMatrix();
%# number of TP/TN/FP/FN with respect to class=1 (Iris-versicolor)
tp = evl.numTruePositives(1);
tn = evl.numTrueNegatives(1);
fp = evl.numFalsePositives(1);
fn = evl.numFalseNegatives(1);
%# class=XX is a zero-based index which maps to the following class values
classValues = arrayfun(#(k)char(data.classAttribute.value(k-1)), ...
1:data.classAttribute.numValues, 'Uniform',false);
The output:
J48 pruned tree
------------------
petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
Correctly Classified Instances 147 98 %
Incorrectly Classified Instances 3 2 %
Kappa statistic 0.97
Mean absolute error 0.0233
Root mean squared error 0.108
Relative absolute error 5.2482 %
Root relative squared error 22.9089 %
Coverage of cases (0.95 level) 98.6667 %
Mean rel. region size (0.95 level) 34 %
Total Number of Instances 150
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 Iris-setosa
0.980 0.020 0.961 0.961 0.961 0.955 0.990 0.969 Iris-versicolor
0.960 0.010 0.980 0.980 0.980 0.955 0.990 0.970 Iris-virginica
Weighted Avg. 0.980 0.010 0.980 0.980 0.980 0.970 0.993 0.980
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 2 48 | c = Iris-virginica