Test Data Prediction Error in SciPy sparse matrix - scipy

I input data in LIBSVM format like this into a SciPy sparse matrix. The training set is multi-label and multi-class as described in this question I asked:
Understanding format of data in scikit-learn
from sklearn.datasets import load_svmlight_file
X,Y = load_svmlight_file("train-subset100.csv.csv", multilabel = True, zero_based = True)
Then I employ OneVsRestClassifier with LinearSVC to train the data.
clf = OneVsRestClassifier(LinearSVC())
clf.fit(X, Y)
Now when I want to test the data, I do the following.
X_, Y_ = load_svmlight_file("train-subset10.csv", multilabel = True, zero_based = False)
predicted = clf.predict(X_)
Here it gives me error. I dump the traceback here as it is.
Traceback (most recent call last):
File "test.py", line 36, in
predicted = clf.predict(X_)
File "/usr/lib/pymodules/python2.7/sklearn/multiclass.py", line 151, in predict
return predict_ovr(self.estimators_, self.label_binarizer_, X)
File "/usr/lib/pymodules/python2.7/sklearn/multiclass.py", line 67, in predict_ovr
Y = np.array([_predict_binary(e, X) for e in estimators])
File "/usr/lib/pymodules/python2.7/sklearn/multiclass.py", line 40, in _predict_binary
return np.ravel(estimator.decision_function(X))
File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 728, in decision_function
self._check_n_features(X)
File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 748, in _check_n_features
X.shape[1]))
ValueError: X.shape[1] should be 3421, not 690.
I do not understand why is it looking for more features when the input format is a sparse matrix? How can I get it to predict test labels correctly?

I solved the issue myself. The problem was that loading datasets one by one using SVMLIGHT/LIBSVM format expects the training matrices to have feature set of the same size. So there are two workarounds for it. One is that you input all data at once using load_svmlight_files command.
X,Y,X_,Y_ = load_svmlight_files("train-subset100.csv", "train-subset10.csv",...
multilabel = True, zero_based = False)
Secondly you can mention the number of features explicitly.
X,Y=load_svmlight_file("train-subset100.csv",multilabel=True, zero_based = False)
X_,Y_ = load_svmlight_file("train-subset10.csv", n_features = X.shape[1],...
multilabel = True, zero_based = False, )

Related

Matlab giving wrong operation for mathematical equation

I am running a script in Matlab R2020b. The script contains an array with following values:
a=[500, 500, 500, 1000, 750, 750, 567.79 613.04]
The script as an equation:
(a(1)*(a(8)-a(6)) + a(7)*(a(6)-a(2))+ a(5)*(a(2)-a(8)))
When running on Matlab the above equation gives the answer -11312 for the values of array a.
But when I calculate each value separately and add them the Matlab compiler gives a different answer.
a(1)*(a(8)-a(6)) = -68480
a(7)*(a(6)-a(2)) = 1.419e+05
a(5)*(a(2)-a(8)) = -84780
>>(-68480) + (1.419e+05) +(-84780)
the answer for the above is -11310.
A screenshot of the commands is also attached.
kindly tell me why Matlab compiler gives these different answers??
The problem is that MATLAB's default format is 'short', and this is not showing you complete precision. Try format long.
>> format long
>> a(7)*(a(6)-a(2))
ans =
1.419475000000000e+05
You are wrong.
If you add format long g you can see the real numbers:
format long g
a=[500, 500, 500, 1000, 750, 750, 567.79 613.04]
res1=(a(1)*(a(8)-a(6)) + a(7)*(a(6)-a(2))+ a(5)*(a(2)-a(8)))
a2=a(7)*(a(6)-a(2))
a1=a(1)*(a(8)-a(6))
a3=a(5)*(a(2)-a(8))
res2=a1+a2+a3
results in:
res1 =
-11312.5
a2 =
141947.5
a1 =
-68480
a3 =
-84780
res2 =
-11312.5

Optimizing the convolution of a function with lmfit.Model or scipy.optimize.curve_fit

Using either lmfit.Model or scipy.optimize.curve_fit I have to optimize a function whose output needs to be convolved with some experimental data before being fit to some other experimental data. To sum up, the workflow is something like this:
(1) Function A is defined (for example, a Gaussian function).
(2) The output of function A is convolved with an experimental signal called data B.
(3) The parameters of function A are optimized for the convolution mentioned in (2) to perfectly match some other experimental data called data C.
I am convolving the output of function A with data B using Fourier transforms as follows:
from scipy.fftpack import fft, ifft
def convolve(data_B, function_A):
convolved = ifft(fft(IRF) * fft(model)).real
return convolved
How can I use lmfit.Model or scipy.optimize.curve_fit to fit "convolved" to data C?
EDIT: In response to the submitted answer, I have incorporated my convolution step into the equation used for the fit in the following manner:
#1 component exponential distribution:
def ExpDecay_1(x, ampl1, tau1, y0, x0, args=(new_y_irf)): # new_y_irf is a list.
h = np.zeros(x.size)
lengthVec = len(new_y_decay)
shift_1 = np.remainder(np.remainder(x-np.floor(x0)-1, lengthVec) + lengthVec, lengthVec)
shift_Incr1 = (1 - x0 + np.floor(x0))*new_y_irf[shift_1.astype(int)]
shift_2 = np.remainder(np.remainder(x-np.ceil(x0)-1, lengthVec) + lengthVec, lengthVec)
shift_Incr2 = (x0 - np.floor(x0))*new_y_irf[shift_2.astype(int)]
irf_shifted = (shift_Incr1 + shift_Incr2)
irf_norm = irf_shifted/sum(irf_shifted)
h = ampl1*np.exp(-(x)/tau1)
conv = ifft(fft(h) * fft(irf_norm)).real # This is the convolution step.
return conv
However, when I try this:
gmodel = Model(ExpDecay_1)
I get this:
gmodel = Model(ExpDecay_1) Traceback (most recent call last):
File "", line 1, in
gmodel = Model(ExpDecay_1)
File "C:\Users\lopez\Anaconda3\lib\site-packages\lmfit\model.py",
line 273, in init
self._parse_params()
File "C:\Users\lopez\Anaconda3\lib\site-packages\lmfit\model.py",
line 477, in _parse_params
if fpar.default == fpar.empty:
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
EDIT 2: I managed to make it work as follows:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
import numpy as np
from lmfit import Model
from scipy.fftpack import fft, ifft
def Test_fit2(x, arg=new_y_irf, data=new_y_decay, num_decay=1):
IRF = arg
DATA = data
def Exp(x, ampl1=1.0, tau1=3.0): # This generates an exponential model.
res = ampl1*np.exp(-x/tau1)
return res
def Conv(IRF,decay): # This convolves a model with the data (data = Instrument Response Function, IRF).
conv = ifft(fft(decay) * fft(IRF)).real
return conv
if num_decay == 1: # If the user chooses to use a model equation with one exponential term.
def fitting(x, ampl1=1.0, tau1=3.0):
exponential = Exp(x,ampl1,tau1)
convolved = Conv(IRF,exponential)
return convolved
modelling = Model(fitting)
res = modelling.fit(DATA,x=new_x_decay,ampl1=1.0,tau1=2.0)
if num_decay == 2: # If the user chooses to use a model equation with two exponential terms.
def fitting(x, ampl1=1.0, tau1=3.0, ampl2=1.0, tau2=1.0):
exponential = Exp(x,ampl1,tau1)+Exp(x,ampl2,tau2)
convolved = Conv(IRF,exponential)
return convolved
modelling = Model(fitting)
res = modelling.fit(DATA,x=new_x_decay,ampl1=1.0,tau1=2.0)
if num_decay == 3: # If the user chooses to use a model equation with three exponential terms.
def fitting(x, ampl1=1.0, tau1=3.0, ampl2=2.0, tau2=1.0, ampl3=3.0, tau3=5.0):
exponential = Exp(x,ampl1,tau1)+Exp(x,ampl2,tau2)+Exp(x,ampl3,tau3)
convolved = Conv(IRF,exponential)
return convolved
modelling = Model(fitting)
res = modelling.fit(DATA,x=new_x_decay,ampl1=1.0,tau1=2.0)
if num_decay == 4: # If the user chooses to use a model equation with four exponential terms.
def fitting(x, ampl1=1.0, tau1=0.1, ampl2=2.0, tau2=1.0, ampl3=3.0, tau3=5.0, ampl4=1.0, tau4=10.0):
exponential = Exp(x,ampl1,tau1)+Exp(x,ampl2,tau2)+Exp(x,ampl3,tau3)+Exp(x,ampl4,tau4)
convolved = Conv(IRF,exponential)
return convolved
modelling = Model(fitting)
res = modelling.fit(DATA,x=new_x_decay,ampl1=1.0,tau1=2.0)
return res
It is always helpful to post a complete, minimal example of what you are trying to do. Without a complete example, only vague answers are possible.
You could simply do the convolutions in your model function that is wrapped by lmfit.Model, passing in the kernel array to use in the convolution. Or you could create convolution kernel and function, and do the convolution as part of the modeling process, as described for example at https://lmfit.github.io/lmfit-py/examples/documentation/model_composite.html
I would imagine that the first approach is easier if the kernel is not actually meant to be changed during the fit, but it's hard to know that for sure without more details.

Using Julia Flux to build a simple neural network

I have a dataset of images (https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria), and I want to use a neural network to know if one picture is a uninfected cell or not.
So I arranged my data to get 4 variables :
X_tests, Y_tests, X_training, Y_training
Each of these variable is of type Array{Array{Float64,1},1}
And I have a function to build a simple neural network (that comes from an example https://smist08.wordpress.com/2018/09/24/julia-flux-for-machine-learning/):
function simple_nn(X_tests, Y_tests, X_training, Y_training)
input = 100*100*3
hl1 = 32
m = Chain(
Dense(input, 32, relu),
Dense(32, 2),
softmax) |> gpu
loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))
dataset = [(X_training,Y_training)]
evalcb = () -> #show(loss(X_training, Y_training))
opt = ADAM(params(m))
Flux.train!(loss, dataset, opt, cb = throttle(evalcb, 10))
println("acc X,Y ", accuracy(X_training, Y_training))
println("acc tX, tY ", accuracy(X_tests, Y_tests))
end
And I get this error after executing simple_nn(X_tests, Y_tests, X_training, Y_training) :
ERROR: DimensionMismatch("matrix A has dimensions (32,30000), vector B has length 2668")
...
The error is on this line : Flux.train!(loss, dataset, opt, cb = throttle(evalcb, 10))
I don't know what the functions are doing, what argument they take, what they are returning and I can't find any documentation on the internet. I can only find examples.
So I have two questions : How can I make this work for my dataset? And Is there a documentation for Flux functions, like for sklearn? (like this for example : https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
Can you provide a self-contained MWE? I think your X_training is not of dimension 3*100*100x something, and it is in fact 2688 by something.
Your first layer is Dense(input, 32, relu) and input is 3*100*100, so it's expect an input where one of dimension is 3*100*100 which you don't satisfy.
Maybe try to replace
dataset = [(X_training,Y_training)]
with
dataset = zip(X_training,Y_training)
zip actually pairs the training data 1 of X with 1 of Y and thus turns a tuple of vectors into a vector of tuples. I would guess that your training data has 2688 samples?

Keras IndexError: indices are out-of-bounds

I'm new to Keras and im trying to do Binary MLP on a dataset, and keep getting indices out of bounds with no idea why.
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(64, input_dim=20, init='uniform', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop')
model.fit(trainx, trainy, nb_epoch=20, batch_size=16) # THROWS INDICES ERROR
Error:
model.fit(trainx, trainy, nb_epoch=20, batch_size=16)
Epoch 1/20
Traceback (most recent call last):
File "<ipython-input-6-c81bd7606eb0>", line 1, in <module>
model.fit(trainx, trainy, nb_epoch=20, batch_size=16)
File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 646, in fit
shuffle=shuffle, metrics=metrics)
File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 271, in _fit
ins_batch = slice_X(ins, batch_ids)
File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 65, in slice_X
return [x[start] for x in X]
File "C:\Users\Thiru\Anaconda3\lib\site-packages\keras\models.py", line 65, in <listcomp>
return [x[start] for x in X]
File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1963, in __getitem__
return self._getitem_array(key)
File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2008, in _getitem_array
return self.take(indexer, axis=1, convert=True)
File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1371, in take
convert=True, verify=True)
File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3619, in take
indexer = maybe_convert_indices(indexer, n)
File "C:\Users\Thiru\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1750, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
Does anyone have any idea why this is happening? Im able to run other models just fine
Answer from the comment - trainx and trainy should be numpy arrays. You can convert the data frame to numpy array using as_matrix() method. I also faced this issue. It's weird that Keras does not give meaningful error message.
I came here looking for the same issue resolution for the auto-sklearn and pandas dataframe. The solution is to pass the X dataframe as X.values. I.e. fit(X.values,y)
From the official Keras Page:
Keras models are trained on Numpy arrays of input data and labels. For training a model, you will typically use the fit function.
To convert a pandas dataframe to numpy array you can use np.array(dataframe). For example:
x_train = np.array(x_train)

Libsvm dummy labels interferring with prediction

I m trying to simulate out of sample prediction of a binary classifier using libsvm in matlab. My target variable (ie my label) is binary (-1 +1). Therefore, in my test set there are series for which i don t know the label. I created a new label for these observations (this label is 747). I found that in my predicted_label_test vector (see code below), this 747 label is included. So it means the prediction I get is influenced by the labels of the data included the test set, which is what I m supposed to predict? - The mistake may be in the way I use Libsvm read and write functions but i can t find it - many thanks!!
%%%%%%%%%% GET DATA FROM THE CSV FILE AND CONVERT THEM TO LIBSVM
addpath('C:\libsvm1\matlab'); %indicate position of the CSV file
ALLDATA = csvread('DATACSV.csv'); % read a csv file
labels = ALLDATA(:, 1); % labels are included in the first column of data
labels_sparse = sparse (labels); %? needed
features = ALLDATA(:, 4:end); % features start at 4th column
features_sparse = sparse(features); % features must be in a sparse matrix
libsvmwrite('TTT.train', labels_sparse, features_sparse); % write the file to libsvm format
[label_vector, predictors_matrix] = libsvmread('C:\libsvm1\matlab\TTT.train'); % read the file that was recorded in Libsvm format
%%%%% DEFINE VECTOR AND MATRIX SIZE
label_vector_train = label_vector (1:143,:);
predictors_matrix_train = predictors_matrix (1:143,:);
label_vector_test = label_vector (144:193,:);
predictors_matrix_test = predictors_matrix (144:193,:);
%PREDICTION
param = ['-q -c 2 -g 3'];
bestModel = svmtrain(label_vector_test, predictors_matrix_test, param);
[predicted_label_test, accuracy, prob_values] = svmpredict(label_vector_test, predictors_matrix_test, bestModel);
You are training a svm model with test data, when you should train it with training data:
bestModel = svmtrain(label_vector_test, predictors_matrix_test, param);
should be:
bestModel = svmtrain(label_vector_train, predictors_matrix_train, param);