.it returns'ValueError: Found input variables with inconsistent numbers of samples: [5472, 1824]' - numbers

hello I have problem with this part of my code. when I run the metrics for assessment of knn algorithm on my data.jupyter says that
your textprint('R^2:',metrics.r2_score(y_train, y_pred))
your textprint('MAE:',metrics.mean_absolute_error(y_train, y_pred))
your textprint('MSE:',metrics.mean_squared_error(y_train, y_pred))
your textprint('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))'
"Found input variables with inconsistent numbers of samples: [5472, 1824]"
I try a lot of things

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Calling function and getting - not enough input arguments, even though syntax is correct

I'm teaching myself classification, I read and understood the MatLab online help of the simple LDA classifier which uses the fisher iris dataset.
I have now moved to SVM. But even though I use the exact syntax from the help page I get an error of either not enough or too many input arguments.
I have made trained my SVMClassifier using svmtrain via the command:
SVMStruct = svmtrain(training,labels);
Where training is a 207 by 900 training matrix. There are 207 samples and 900 HoG descriptors or features. Similarly labels is a 207 by 1 column vector consisting of either +1 or -1 for their respective samples.
I then wanted to test it and see if this works by calling:
Group = svmclassify(SVMStruct,sample,'Showplot',true)
Where sample is a 2 by 900 matrix containing 2 test samples. I was expecting to get +1 and -1 as these are what the test samples should be labelled. But I get the error:
Too many input arguments.
And when I use the command
Group = svmclassify(SVMStruct,sample)
I get the error
Not enough input arguments.
You might have overloaded svmclassify function.
try
>> which svmclassify
to verify that you are actually calling the right function.
In case that you overloaded the function (that is, created a different function with the same name svmclassify) and it is located higher in your path then you'll need to rename the overloaded function and run svmclassify again.

Pybrain outputs same result for any input

I am trying to train a simple neural network with Pybrain. After training I want to confirm that the nn is working as intended, so I activate the same data that I used to train it with. However every activation outputs the same result. Am I misunderstanding a basic concept about neural networks or is this by design?
I have tried altering the number of hidden nodes, the hiddenclass type, the bias, the learningrate, the number of training epochs and the momentum to no avail.
This is my code...
from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
net = buildNetwork(2, 3, 1)
net.randomize()
ds = SupervisedDataSet(2, 1)
ds.addSample([77, 78], 77)
ds.addSample([78, 76], 76)
ds.addSample([76, 76], 75)
trainer = BackpropTrainer(net, ds)
for epoch in range(0, 1000):
error = trainer.train()
if error < 0.001:
break
print net.activate([77, 78])
print net.activate([78, 76])
print net.activate([76, 76])
This is an example of what the results can be... As you can see the output is the same even though the activation inputs are different.
[ 75.99893007]
[ 75.99893007]
[ 75.99893007]
I had a similar problem, I was able to improve the accuracy (I.E. get different answer for each input) by doing the following.
Normalizing/Standardizing input and output to the neural network
Doing this allows the neural network to more accurately determine the internal weights and values for the network in order to predict the answers
heres an article that explains it in more detail. http://visualstudiomagazine.com/articles/2014/01/01/how-to-standardize-data-for-neural-networks.aspx
In the end I solved this by normalizing the data between 0 and 1 and also training until the error rate hit 0.00001. It takes much longer to train, but I do get accurate results now.

Matlab Weka Interface AdaBoost Issues: Out of Bounds Exception

I'm doing some cross-validation using a Matlab Weka Interface that I got from file exchange. My loop structure seems to work fine for Weka's Logistic classifier. However, when I try to do the exact same thing for AdaBoostM1, it throws the following error:
??? Java exception occurred: java.lang.ArrayIndexOutOfBoundsException
Error in ==> wekaClassify at 24 classProbs(t+1,:) = (classifier.distributionForInstance(testData.instance(t)))';
Error in ==> classifier_search at 225 [pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I have determined through some testing that this only occurs when the number of instances in the training set is greater than the number of instances in the test set. I am sure you can see why that is a problem for me, since in most situations the training set is greater than the test set in size.
Is there something different about how I should format my inputs when using Adaboost rather than Logistic? Any information you can give regarding this problem would be so helpful.
I downloaded this code from this page: http://www.mathworks.com/matlabcentral/fileexchange/21204-matlab-weka-interface
Emails bounce from the account of the guy who made it, and he doesn't seem to respond to comments on the page - I'm hoping that maybe someone here has used this.
EDIT: Here is the code that I use to train and test the classifier:
classifier = trainWekaClassifier(matlab2weka('training', featurelabels, train), 'meta.AdaBoostM1', { strcat('-P 100 -S 1 -I ', num2str(r), '-W weka.classifiers.trees.DecisionStump')});
[pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I haven't used this combination of software, so I can only take a guess at what could cause this.
Are your training/testing data matrices the right way round? They should be N-by-D (N instances, D features).
If you were passing in a D-by-N training matrix and a D-by-M testing matrix, then I would expect it to work only when M < N - which is what you describe - and even then, it wouldn't give a meaningful result.

How does List::Util 'shuffle' actually work?

I am currently working on building a classifier using c5.0. I have a dataset of 8000 entries and each entry has its own i.d number (1-8000). When testing the performance of the classifier I had to make 5sets of 10:90 (training data: test data) splits. Of course any training cases cannot appear again in the test cases, and duplicates cannot occur in either set.
To solve the problem of picking examples at random for the training data, and making sure the same cannot be picked for the test data I have developed a horribly slow method;
fill a file with numbers from 1-8000 on separate lines.
randomly pick a line number (from a range of 1-8000) and use the contents of the line as the id number of the training example.
write all unpicked numbers to a new file
decrement the range of the random number generator by 1
redo
Then all unpicked numbers are used as test data. It works but its slow. To speed things up I could use List::Util 'shuffle' to just 'randomly' shuffle and array of these numbers. But how random is 'shuffle'? It is essential that the same level of accuracy is maintained. Sorry about the essay, but does anyone know how 'shuffle' actually works. Any help at all would be great
Here is the shuffle algorithm used in List::Util::PP
sub shuffle (#) {
my #a=\(#_);
my $n;
my $i=#_;
map {
$n = rand($i--);
(${$a[$n]}, $a[$n] = $a[$i])[0];
} #_;
}
Which looks like a Fisher-Yates shuffle.