I am implementing multilabel text classification by training 4709 separate binary logistic regression classifiers in Sklearn, using HashingVectorizer [(n_features=2**24,binary=True,ngram_range=(1,2)].
Accuracy is pretty good, but prediction latency is huge. Average sparsity ratio of learned matrices is 0.967, and shape of matrices are (1, 16777216). Using build in predict_proba function prediction time for one entry is 147.9 secs (on server with one Intel Xeon E5 2630v4). Most of the time (80%) is spent by scipy sparse csc_tocsr function.
When I pre-process matrices with:
cf[i] = sparse.csr_matrix(clf.coef_.T)
and infer probability (I do not need normalization, just order of probabilities) directly by
prob[i] = x*cf[i]
it takes only 0.043 sec to infer 407 (10%) classifiers but memory consumption is 25GB, so I would need about 250GB of RAM to keep all classifiers in memory.
Is there any way to speed up decision function while keeping matrices sparse, or some other way of pre-processing that would take less memory.
To answer my question, I've found a simple solution. All all coefficient matrices can be combined (stacked) into one, that is transposed and used for inference. So the loading/preparation code is:
coefs = sparse.vstack([c.coef_ for c in clfs])
coefs = coefs.T
where clfs are logistic regression classifiers (but it works for all linear classifiers). Inference code is simple
x = vectorizer.fit_transform([q])
r = x*coefs
Inference time is less then 1/10 seconds in my case of 4709 classifiers.
Related
Lets say that I have a neural network named 'NN' with 500 weights and biases (total parameters=500).
For one training sample: It's introduced through 'NN', it spits out an output (Out1), the output is compared to the training label, and with the backpropagation algorithm there is a small change(positive or negative) in every parameter of 'NN'. The cost function is represented by a vector of dimentions 1x500, with all the small modifications obtained by the backpropagation algorithm.
Lets say mini_batch_size=10
For one mini-batch: each and every one of the 10 training samples provide a cost function of dimensions 1x500.
In order to visualize and explain better, lets say that we create a matrix of 10x500 (called M), where every row is the cost function of every training sample.
Question: For the mini-batch training example, Is the final cost function of the minibatch the result of the average of all the column elements?
PD.
In case the question is not clear enough I left some code on what I exactly mean.
for j=1:500
Cost_mini_batch(j)=sum(M(:,j))/10
end
The dimensions of Cost_mini_batch are 1x500.
"Cost" refers to the loss, i.e. the error between Out1 and the training label.
The cost function is represented by a vector of dimentions 1x500, with all the small modifications obtained by the backpropagation algorithm.
This is called "gradient", not cost function.
Question: For the mini-batch training example, Is the final cost function of the minibatch the result of the average of all the column elements?
Yes, both gradient and cost function for a minibatch is the average of the gradients of each example in the minibatch.
I am not familiar with deep learning so this might be a beginner question.
In my understanding, softmax function in Multi Layer Perceptrons is in charge of normalization and distributing probability for each class.
If so, why don't we use the simple normalization?
Let's say, we get a vector x = (10 3 2 1)
applying softmax, output will be y = (0.9986 0.0009 0.0003 0.0001).
Applying simple normalization (dividing each elements by the sum(16))
output will be y = (0.625 0.1875 0.125 0.166).
It seems like simple normalization could also distribute the probabilities.
So, what is the advantage of using softmax function on the output layer?
Normalization does not always produce probabilities, for example, it doesn't work when you consider negative values. Or what if the sum of the values is zero?
But using exponential of the logits changes that, it is in theory never zero, and it can map the full range of the logits into probabilities. So it is preferred because it actually works.
This depends on the training loss function. Many models are trained with a log loss algorithm, so that the values you see in that vector estimate the log of each probability. Thus, SoftMax is merely converting back to linear values and normalizing.
The empirical reason is simple: SoftMax is used where it produces better results.
So I have something like this,
y=l3*[sin(theta1)*cos(theta2)*cos(theta3)+cos(theta1)*sin(theta2)*cos(theta3)-sin(theta1)*sin(theta2)*sin(theta3)+cos(theta1)*cos(theta2)sin(theta3)]+l2[sin(theta1)*cos(theta2)+cos(theta1)*sin(theta2)]+l1*sin(theta1)+l0;
and something similar for x. Where thetai is angles from specified interval and li some coeficients. Task is approximate inversion of equation, so you set x and y and result will be appropriate theta. So I random generate thetas from specified intervals, compute x and y. Then I norm x and y between <-1,1> and thetas between <0,1>. This data I used as training set in such way, inputs of network are normalized x and y, outputs are normalized thetas.
I train the network, tried different configuration and absolute error of network was still around 24.9% after whole night of training. It's so much, so I don't know what to do.
Bigger training set?
Bigger network?
Experiment with learning rate?
Longer training?
Technical info
As training algorithm was used error back propagation. Neurons have sigmoid activation function, units are biased. I tried topology: [2 50 3], [2 100 50 3], training set has length 1000 and training duration was 1000 cycle(in one cycle I go through all dataset). Learning rate has value 0.2.
Error of approximation was computed as
sum of abs(desired_output - reached_output)/dataset_lenght.
Used optimizer is stochastic gradient descent.
Loss function,
1/2 (desired-reached)^2
Network was realized in my Matlab template for NN. I know that is weak point, but I'm sure my template is right because(successful solution of XOR problem, approximation of differential equations, approximation of state regulator). But I show this template, because this information may be useful.
Neuron class
Network class
EDIT:
I used 2500 unique data within theta ranges.
theta1<0, 180>, theta2<-130, 130>, theta3<-150, 150>
I also experiment with larger dataset, but accuracy doesn't improve.
I have read in the documentation of crossval is that mcr = crossval('mcr',X,y,'Predfun',predfun) function in matlab calculate the misclassification rate, But if it's apply with 10-fold cross-validation, then we will have 10 different values for misclassification, since we done 10 testing, and each testing produce a result, but the value mcr is single or scalar , So does it take the average misclassification rates or it's take the minimum..etc ?
The average misclassification rate (across all folds and all monte-carlo repartitions) is used. The following line of crossval demonstrates the calculation of the average loss -
loss = sum(loss)/ (mcreps * sum(cvp.TestSize));
where loss is initially a vector of losses for each cross-validation fold and each repartition, mcreps is the number of repartitions and sum(cvp.TestSize) is the total size of the cross-validation test sets.
This is used for both the MSE (mean-squared error) and MCR loss functions.
I am working on people detecting using two different features HOG and LBP. I used SVM to train the positive and negative samples. Here, I wanna ask how to improve the accuracy of SVM itself? Because, everytime I added up more positives and negatives sample, the accuracy is always decreasing. Currently my positive samples are 1500 and negative samples are 700.
%extract features
[fpos,fneg] = features(pathPos, pathNeg);
%train SVM
HOG_featV = loadingV(fpos,fneg); % loading and labeling each training example
fprintf('Training SVM..\n');
%L = ones(length(SV),1);
T = cell2mat(HOG_featV(2,:));
HOGtP = HOG_featV(3,:)';
C = cell2mat(HOGtP); % each row of P correspond to a training example
%extract features from LBP
[LBPpos,LBPneg] = LBPfeatures(pathPos, pathNeg);
LBP_featV = loadingV(LBPpos, LBPneg);
LBPlabel = cell2mat(LBP_featV(2,:));
LBPtP = LBP_featV(3,:);
M = cell2mat(LBPtP)'; % each row of P correspond to a training example
featureVector = [C M];
model = svmlearn(featureVector, T','-t 2 -g 0.3 -c 0.5');
Anyone knows how to find best C and Gamma value for improving SVM accuracy?
Thank you,
To find best C and Gamma value for improving SVM accuracy you typically perform cross-validation. In sum you can leave-one-out (1 sample) and test the VBM for that sample using the different parameters (2 parameters define a 2d grid). Typically you would test each decade of the parameters for a certain range. For example: C = [0.01, 0.1, 1, ..., 10^9]; G= [1^-5, 1^-4, ..., 1000]. This should also improve your SVM accuracy by optimizing the hyper-parameters.
By looking again to your question it seems you are using the svmlearn of the machine learning toolbox (statistics toolbox) of Matlab. Therefore you have already built-in functions for cross-validation. Give a look at: http://www.mathworks.co.uk/help/stats/support-vector-machines-svm.html
I followed ASantosRibeiro's method to optimize the parameters before and it works well.
In addition, you could try to add more negative samples until the proportion of the negative and positive reach 2:1. The reason is that when you implement real-time application, you should scan the whole image step by step and commonly the negative extracted samples would be much more than the people-contained samples.
Thus, add more negative training samples is a quite straightforward but effective way to improve overall accuracy(Both false positive and true negative).