MATLAB idnlgrey multidimensional pem: parallelization - matlab

I am trying to do a parameter estimation of a nonlinear multidimensional dynamical model specified as an idnlgrey object. In particular, i'm using the 'lsqnonlin' estimator with the pem function.
I'm satisfied both with accuracy and performance when fitting a model which is up to 8 dimensions.
The problem with performance starts arising as long as the dimensionality grows (my objective whould be scaling up to some hundreds of dimensions).
From the documentation I wasn't able to have a clear idea on whether pem itself can be run in parallel, nor it is clear if it can be considered a memory or CPU bound function.
I wonder if I can take advantage of the parallelization toolbox.

Related

How is using im2col operation in convolutional nets more efficient?

I am trying to implement a convolutional neural netwrok and I don't understand why using im2col operation is more efficient. It basically stores the input to be multiplied by filter in separate columns. But why shouldn't loops be used directly to calculate convolution instead of first performing im2col ?
Well, you are thinking in the right way, In Alex Net almost 95% of the GPU time and 89% on CPU time is spent on the Convolutional Layer and Fully Connected Layer.
The Convolutional Layer and Fully Connected Layer are implemented using GEMM that stands for General Matrix to Matrix Multiplication.
So basically in GEMM, we convert the convolution operation to a Matrix Multiplication operation by using a function called im2col() which arranges the data in a way that the convolution output can be achieved by Matrix Multiplication.
Now, you may have a question instead of directly doing element wise convolution, why are we adding a step in between to arrange the data in a different way and then use GEMM.
The answer to this is, scientific programmers, have spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh any other losses.
We have an optimized CUDA GEMM API in cuBLAS library, Intel MKL has an optimized CPU GEMM while ciBLAS's GEMM API can be used for devices supporting OpenCL.
Element wise convolution performs badly because of the irregular memory accesses involved in it.
In turn, Im2col() arranges the data in a way that the memory accesses are regular for Matrix Multiplication.
Im2col() function adds a lot of data redundancy though, but the performance benefit of using Gemm outweigh this data redundancy.
This is the reason for using Im2col() operation in Neural Nets.
This link explains how Im2col() arranges the data for GEMM:
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

Dealing with a large kernel matrix in SVM

I have a matrix X, size 40-by-60000
while writing the SVM, I need to form a linear kernel: K = X'*X
And of course I would get an error
Requested 60000x60000 (26.8GB) array exceeds maximum array size preference.
How is it usually done? The data set is Mnist, so someone must have done this before. In this case rank(K) <= 40, I need a way to store K and later pass it to quadprog.
How is it usually done?
Usually kernel matrices for big datasets are not precomputed. Since optimisation methods used (like SMO or gradient descent) do only need access to a subset of samples in each iteration, you simply need a data structure which is a lazy kernel matrix, in other words - each time an optimiser requests K[i,j] you literally compute K(xi,xj) then. Often, there are also caching mechanisms to make sure that often requested kernel values are already prepared etc.
If you're willing to commit to a linear kernel (or any other kernel whose corresponding feature transformation is easily computed) you can avoid allocating O(N^2) memory by using a primal optimization method, which does not construct the full kernel matrix K.
Primal methods represent the model using a weighted sum of the training samples' features, and so will only take O(NxD) memory, where N and D are the number of training samples and their feature dimension.
You could also use liblinear (if you resolve the C++ issues).
Note this comment from their website: "Without using kernels, one can quickly train a much larger set via a linear classifier."
This problem occurs due to the large size of your data set, thus it exceeds the amount of RAM available in your system. In 64-bit systems data processing performs better than in 32-bit, so you'll want to check which of the two your system is.

Is there any way to disable MKL in matlab in order to test the Flops complexity of an algorithm?

Matlab is an efficeve tool to do numerical experiments. Then, I find that many papers like using it to test the Flop Complexity of an algorithm (e.g., regression, svd).
However, as I have learnt from others, Matlab uses Intel MKL for Matrix Multiplication. This is highly optimized code taking advantage of all the cores and their Vector Processing Units (SSE / AVX), and optimized for the cache layout in the CPU.
This means directly using Matlab cannot truly test flops complexity.
My question is then: how to disable MKL or something eles in Matlab in order to test the Flop Complexity of an algorithm?

Interpretation of MATLAB's NaiveBayses 'posterior' function

After we created a Naive Bayes classifier object nb (say, with multivariate multinomial (mvmn) distribution), we can call posterior function on testing data using nb object. This function has 3 output parameters:
[post,cpre,logp] = posterior(nb,test)
I understand how post is computed and the meaning of that, also cpre is the predicted class, based on the maximum over posterior probabilities for each class.
The question is about logp. It is clear how it is computed (logarithm of the PDF of each pattern in test), but I don't understand the meaning of this measure and how it can be used in the context of Naive Bayes procedure. Any light on this is very much appreciated.
Thanks.
The logp you are referring to is the log likelihood, which is one way to measure how good a model is fitting. We use log probabilities to prevent computers from underflowing on very small floating-point numbers, and also because adding is faster than multiplying.
If you learned your classifier several times with different starting points, you would get different results because the likelihood function is not log-concave, meaning there are local maxima that you would get stuck in. If you computed the likelihood of the posterior on your original data you would get the likelihood of the model. Although the likelihood gives you a good measure of how one set of parameters fits compared to another, you need to be careful that you're not overfitting.
In your case, you are computing the likelihood on some unobserved (test) data, which gives you an idea of how well your learned classifier is fitting on the data. If you were trying to learn this model based on the test set, you would pick the parameters based on the highest test likelihood; however in general when you're doing this it's better to use a validation set. What you are doing here is computing predictive likelihood.
Computing the log likelihood is not limited to Naive Bayes classifiers and can in fact be computed for any Bayesian model (gaussian mixture, latent dirichlet allocation, etc).

Deterministic Annealing Code

I would like to find an open source example of a code for deterministic annealing. It can be in almost any language: C, C++, MatLab/Octave, Fortran. I have already found a MatLab code for simulated annealing, so MatLab would be best. Here is a paper that describes the algorithm.
Deterministic annealing is an
optimization technique that attempts
to find a global minimum of a cost
function. The technique is designed to
be able to explore a large portion of
the cost surface using randomness,
while still performing optimization
using local information. The procedure
starts with changing the cost function
to introduce a notion of randomness,
allowing a large area to be explored.
Each iteration the amount of
randomness (measured by Shannon
Entropy [2]) is constrained, and a
local optimization of performed.
Gradually, the amount of imposed
randomness is lowered so that upon
termination the algorithm optimizes
over the original cost function,
yielding a solution to the original
problem
The figures in the paper you link to look like Matlab figures. I suggest you contact the authors whether they're willing to share their code with you.