I am trying to implement a convolutional neural netwrok and I don't understand why using im2col operation is more efficient. It basically stores the input to be multiplied by filter in separate columns. But why shouldn't loops be used directly to calculate convolution instead of first performing im2col ?
Well, you are thinking in the right way, In Alex Net almost 95% of the GPU time and 89% on CPU time is spent on the Convolutional Layer and Fully Connected Layer.
The Convolutional Layer and Fully Connected Layer are implemented using GEMM that stands for General Matrix to Matrix Multiplication.
So basically in GEMM, we convert the convolution operation to a Matrix Multiplication operation by using a function called im2col() which arranges the data in a way that the convolution output can be achieved by Matrix Multiplication.
Now, you may have a question instead of directly doing element wise convolution, why are we adding a step in between to arrange the data in a different way and then use GEMM.
The answer to this is, scientific programmers, have spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh any other losses.
We have an optimized CUDA GEMM API in cuBLAS library, Intel MKL has an optimized CPU GEMM while ciBLAS's GEMM API can be used for devices supporting OpenCL.
Element wise convolution performs badly because of the irregular memory accesses involved in it.
In turn, Im2col() arranges the data in a way that the memory accesses are regular for Matrix Multiplication.
Im2col() function adds a lot of data redundancy though, but the performance benefit of using Gemm outweigh this data redundancy.
This is the reason for using Im2col() operation in Neural Nets.
This link explains how Im2col() arranges the data for GEMM:
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
Related
I have a matrix X, size 40-by-60000
while writing the SVM, I need to form a linear kernel: K = X'*X
And of course I would get an error
Requested 60000x60000 (26.8GB) array exceeds maximum array size preference.
How is it usually done? The data set is Mnist, so someone must have done this before. In this case rank(K) <= 40, I need a way to store K and later pass it to quadprog.
How is it usually done?
Usually kernel matrices for big datasets are not precomputed. Since optimisation methods used (like SMO or gradient descent) do only need access to a subset of samples in each iteration, you simply need a data structure which is a lazy kernel matrix, in other words - each time an optimiser requests K[i,j] you literally compute K(xi,xj) then. Often, there are also caching mechanisms to make sure that often requested kernel values are already prepared etc.
If you're willing to commit to a linear kernel (or any other kernel whose corresponding feature transformation is easily computed) you can avoid allocating O(N^2) memory by using a primal optimization method, which does not construct the full kernel matrix K.
Primal methods represent the model using a weighted sum of the training samples' features, and so will only take O(NxD) memory, where N and D are the number of training samples and their feature dimension.
You could also use liblinear (if you resolve the C++ issues).
Note this comment from their website: "Without using kernels, one can quickly train a much larger set via a linear classifier."
This problem occurs due to the large size of your data set, thus it exceeds the amount of RAM available in your system. In 64-bit systems data processing performs better than in 32-bit, so you'll want to check which of the two your system is.
Matlab is an efficeve tool to do numerical experiments. Then, I find that many papers like using it to test the Flop Complexity of an algorithm (e.g., regression, svd).
However, as I have learnt from others, Matlab uses Intel MKL for Matrix Multiplication. This is highly optimized code taking advantage of all the cores and their Vector Processing Units (SSE / AVX), and optimized for the cache layout in the CPU.
This means directly using Matlab cannot truly test flops complexity.
My question is then: how to disable MKL or something eles in Matlab in order to test the Flop Complexity of an algorithm?
The theory from these links show that the order of Convolutional Network is: Convolutional Layer - Non-linear Activation - Pooling Layer.
Neural networks and deep learning (equation (125)
Deep learning book (page 304, 1st paragraph)
Lenet (the equation)
The source in this headline
But, in the last implementation from those sites, it said that the order is: Convolutional Layer - Pooling Layer - Non-linear Activation
network3.py
The sourcecode, LeNetConvPoolLayer class
I've tried too to explore a Conv2D operation syntax, but there is no activation function, it's only convolution with flipped kernel. Can someone help me to explain why is this happen?
Well, max-pooling and monotonely increasing non-linearities commute. This means that MaxPool(Relu(x)) = Relu(MaxPool(x)) for any input. So the result is the same in that case. So it is technically better to first subsample through max-pooling and then apply the non-linearity (if it is costly, such as the sigmoid). In practice it is often done the other way round - it doesn't seem to change much in performance.
As for conv2D, it does not flip the kernel. It implements exactly the definition of convolution. This is a linear operation, so you have to add the non-linearity yourself in the next step, e.g. theano.tensor.nnet.relu.
In many papers people use conv -> pooling -> non-linearity. It does not mean that you can't use another order and get reasonable results. In case of max-pooling layer and ReLU the order does not matter (both calculate the same thing):
You can proof that this is the case by remembering that ReLU is an element-wise operation and a non-decreasing function so
The same thing happens for almost every activation function (most of them are non-decreasing). But does not work for a general pooling layer (average-pooling).
Nonetheless both orders produce the same result, Activation(MaxPool(x)) does it significantly faster by doing less amount of operations. For a pooling layer of size k, it uses k^2 times less calls to activation function.
Sadly this optimization is negligible for CNN, because majority of the time is used in convolutional layers.
Max pooling is a sample-based discretization process. The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned
My goal is to classify an image using multi class linear SVM (with out kernel). I would like to write my own SVM classifier
I am using MATLAB and have trained linear SVM using image sets provided.
I have around 20 classes, 5 images in each class (total of 100 images) and I am using one-versus-all strategy.
Each image is a (112,92) matrix. That means 112*92=10304 values.
I am using quadprog(H,f,A,C) to solve the quadratic equation (y=w'x+b) in the SVM. One call to quadprog returns w vector of size 10304 for one image. That means I have to call quadprog for 100 times.
The problem is one quadprog call takes 35 secs to get executed. That means for 100 images it will take 3500 secs. This might be due to large size of vectors and matrices involved.
I want to reduce the execution time of quadprog. Is there any way to do it?
First of all, when you do classification using SVM, you usually extract a feature (like HOG) of an image, so that the dimensionality of space on which SVM has to operate gets reduced. You are using raw pixel values, which generates a 10304-D vector. That is not good. Use some standard feature.
Secondly, you do not call quadprog 100 times. You call only once. The concept behind the optimization is, you want to find a weight vector w and a bias b such that it satisfies w'x_i+b=y_i for all images (i.e. all x_i). Note that i will go from 1 to the number of examples in your training set, but w and b stay the same. If you wanted a (w,b) that will satisfy only one x, you do not need any fancy optimization. So stack your x in a big matrix of dimension NxD, y will be a vector of Nx1, and then call quadprog. It will take a longer time than 35 seconds, but you do it only once. This is called training an SVM. While testing for a new image, you just extract its feature, and do w*x+b to get its class.
Third, unless you are doing this as an exercise to understand SVMs, quadprog is not the best way to solve this problem. You need to use some other techniques which work well with large data, for example, Sequential Minimal Optimization.
How can one linear hyperplane classify more than 2 classes: For multi-class classification, SVMs usually use two popular approaches:
One-vs-one SVM: For a C class problem, you train C(C-1)/2 classifiers and at test time, you test that many classifiers and choose the class which receives most votes.
One-vs-all SVM: As name suggests, you train one classifier per class with positive samples from that class and negative samples from all other classes.
From LIBSVM FAQs:
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines , IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.
However, also note that a naive implementation of one-vs-one may not be practical for large-scale problems. LIBSVM website also lists this shortcoming and provides an extension.
LIBLINEAR does not support one-versus-one multi-classification, so we provide an extension here. If k is the number of classes, we generate k(k-1)/2 models, each of which involves only two classes of training data. According to Yuan et al. (2012), one-versus-one is not practical for large-scale linear classification because of the huge space needed to store k(k-1)/2 models. However, this approach may still be viable if model vectors (i.e., weight vectors) are very sparse. Our implementation stores models in a sparse form and can effectively handle some large-scale data.
I am trying to do a parameter estimation of a nonlinear multidimensional dynamical model specified as an idnlgrey object. In particular, i'm using the 'lsqnonlin' estimator with the pem function.
I'm satisfied both with accuracy and performance when fitting a model which is up to 8 dimensions.
The problem with performance starts arising as long as the dimensionality grows (my objective whould be scaling up to some hundreds of dimensions).
From the documentation I wasn't able to have a clear idea on whether pem itself can be run in parallel, nor it is clear if it can be considered a memory or CPU bound function.
I wonder if I can take advantage of the parallelization toolbox.