Why are Embeddings in PyTorch implemented as Sparse Layers?

Why are Embeddings in PyTorch implemented as Sparse Layers? - neural-network

Embedding Layers in PyTorch are listed under "Sparse Layers" with the limitation:
Keep in mind that only a limited number of optimizers support sparse gradients: currently it’s optim.SGD (cuda and cpu), and optim.Adagrad (cpu)
What is the reason for this? For example in Keras I can train an architecture with an Embedding Layer using any optimizer.

Upon closer inspection sparse gradients on Embeddings are optional and can be turned on or off with the sparse parameter:
class torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2, scale_grad_by_freq=False, sparse=False)
Where:
sparse (boolean, optional) – if True, gradient w.r.t. weight matrix
will be a sparse tensor. See Notes for more details regarding sparse
gradients.
And the "Notes" mentioned are what I quoted in the question about a limited number of optimizers being supported for sparse gradients.
Update:
It is theoretically possible but technically difficult to implement some optimization methods on sparse gradients. There is an open issue in the PyTorch repo to add support for all optimizers.
Regarding the original question, I believe Embeddings can be treated as sparse because it is possible to operate on the input indices directly rather than converting them to one-hot encodings for input into a dense layer. This is explained in #Maxim's answer to my related question.

Related

Combining an image classifier and an expert system

Would it be accurate to include an expert system in an image classifying application? (I am working with Matlab, have some experience with image processing and no experience with expert systems.)
What I'm planning on doing is adding an extra feature vector that is actually an answer to a question. Is this fine?
For example: Assume I have two questions that I want the answers to : Question 1 and Question 2. Knowing the answers to these 2 questions should help classify the test image more accurately. I understand expert systems are coded differently from an image classifier but my question is would it be wrong to include the answers to these 2 questions, in a numerical form (1 can be yes, and 0 can be no) and pass this information along with the other feature vectors into a classifier.
If it matters, my current classifier is an SVM.
Regarding training images: yes, they too will be trained with the 2 extra feature vectors.

Converting a set of comments to an answer:
A similar question in cross-validated already explains that it can be done as long as data is properly preprocessed.
In short: you can combine them as long as training (and testing) data is properly preprocessed (e.g. standardized). Standardization improves the performance of most linear classifiers because it scales the variables so they have the similar weight in the learning process and improves the numerical stability (and performance) when variables are sampled from gaussian-like distributions (which is achieved by standarization).
With that, if continuous variables are standardized and categorical variables are encoded as (-1, +1) the SVM should work well. Whether it will improve or not the performance of the classifier depends on the quality of those cathegorical variables.
Answering the other question in the comment.. while using kernel SVM with for example a chi square kernel, the rows of the training data are suppose to behave like histograms (all positive and usually l1-normalized) and therefore introducing a (-1, +1) feature breaks the kernel. Using a RBF kernel the rows of the data are suppose to be L2 normalized, and again, introducing (-1, +1) features might introduce unexpected behaviour (I'm not very sure what exactly the effect would be..).

I worked on similar problem. if multiple features can be extracted from your images then you can train different classifier by using different features. You can think about these classifiers as experts in answering questions based on the features they used in training. Instead of using labels as outputs, it is better to use confidence values. uncertainty can be very important in this manner. you can use these experts to generate values. these values can be combined and used to train another classifier.

How do i identify which features are being selected with LDA?

I have run LDA with MATLAB using the fitcdiscr function and predict.
I have a feeling there may be some bugs in my code however and as a sanity check would like to identify which features are being most heavily weighted in the classification.
Can this be done?

There is a Coeffs field in your fitted object containing all the relevant information http://uk.mathworks.com/help/stats/classificationdiscriminant-class.html
In particular, if you fit a linear LDA there will be Linear field which is the linear operator used for projection. However, one should bear in mind that value of coefficients of linear models are not feature importances. There is much more in that to consider. Weight can be big because your feature have small values or because there is a highly biased distribution of the values. If you need feature selection technique - use feature selection methods (like L1 regularized models) otherwise you might easily get wrong conclusions from your data.

Recurrent neural layers in Keras

I'm learning neural networks through Keras and would like to explore my sequential dataset on a recurrent neural network.
I was reading the docs and trying to make sense of the LSTM example.
My questions are:
What are the timesteps that are required for both layers?
How do I prepare a sequential dataset that works with Dense as an input for those recurrent layers?
What does the Embedding layer do?

Timesteps are a pretty bothering thing about Keras. Due to the fact that data you provide as an input to your LSTM must be a numpy array it is needed (at least for Keras version <= 0.3.3) to have a specified shape of data - even with a "time" dimension. You can only put a sequences which have a specified length as an input - and in case your inputs vary in a length - you should use either an artificial data to "fill" your sequences or use a "stateful" mode (please read carefully Keras documentation to understand what this approach means). Both solutions might be unpleasent - but it's a cost you pay that Keras is so simple :) I hope that in version 1.0.0 they will do something with that.
There are two ways to apply norecurrent layers after LSTM ones:
you could set an argument return_sequences to False - then only the last activations from every sequence will be passed to a "static" layer.
you could use one of "time distributed" layers - to get more flexibility with what you want to do with your data.
https://stats.stackexchange.com/questions/182775/what-is-an-embedding-layer-in-a-neural-network :)

PCA on Sift desciptors and Fisher Vectors

I was reading this particular paper http://www.robots.ox.ac.uk/~vgg/publications/2011/Chatfield11/chatfield11.pdf and I find the Fisher Vector with GMM vocabulary approach very interesting and I would like to test it myself.
However, it is totally unclear (to me) how do they apply PCA dimensionality reduction on the data. I mean, do they calculate Feature Space and once it is calculated they perform PCA on it? Or do they just perform PCA on every image after SIFT is calculated and then they create feature space?
Is this supposed to be done for both training test sets? To me it's an 'obviously yes' answer, however it is not clear.
I was thinking of creating the feature space from training set and then run PCA on it. Then, I could use that PCA coefficient from training set to reduce each image's sift descriptor that is going to be encoded into Fisher Vector for later classification, whether it is a test or a train image.
EDIT 1;
Simplistic example:
[coef , reduced_feat_space]= pca(Feat_Space','NumComponents', 80);
and then (for both test and train images)
reduced_test_img = test_img * coef; (And then choose the first 80 dimensions of the reduced_test_img)
What do you think? Cheers

It looks to me like they do SIFT first and then do PCA. the article states in section 2.1 "The local descriptors are fixed in all experiments to be SIFT descriptors..."
also in the introduction section "the following three steps:(i) extraction
of local image features (e.g., SIFT descriptors), (ii) encoding of the local features in an image descriptor (e.g., a histogram of the quantized local features), and (iii) classification ... Recently several authors have focused on improving the second component" so it looks to me that the dimensionality reduction occurs after SIFT and the paper is simply talking about a few different methods of doing this, and the performance of each
I would also guess (as you did) that you would have to run it on both sets of images. Otherwise your would be using two different metrics to classify the images it really is like comparing apples to oranges. Comparing a reduced dimensional representation to the full one (even for the same exact image) will show some variation. In fact that is the whole premise of PCA, you are giving up some smaller features (usually) for computational efficiency. The real question with PCA or any dimensionality reduction algorithm is how much information can I give up and still reliably classify/segment different data sets
And as a last point, you would have to treat both images the same way, because your end goal is to use the Fisher Feature Vector for classification as either test or training. Now imagine you decided training images dont get PCA and test images do. Now I give you some image X, what would you do with it? How could you treat one set of images differently from another BEFORE you've classified them? Using the same technique on both sets means you'd process my image X then decide where to put it.
Anyway, I hope that helped and wasn't to rant-like. Good Luck :-)

Essential philosophy behind Support Vector Machine

I am studying Support Vector Machines (SVM) by reading a lot of material. However, it seems that most of it focuses on how to classify the input 2D data by mapping it using several kernels such as linear, polynomial, RBF / Gaussian, etc.
My first question is, can SVM handle high-dimensional (n-D) input data?
According to what I found, the answer is YES!
If my understanding is correct, n-D input data will be
constructed in Hilbert hyperspace, then those data will be
simplified by using some approaches (such as PCA ?) to combine it together / project it back to 2D plane, so that
the kernel methods can map it into an appropriate shape such a line or curve can separate it into distinguish groups.
It means most of the guides / tutorials focus on step (3). But some toolboxes I've checked cannot plot if the input data greater than 2D. How can the data after be projected to 2D?
If there is no projection of data, how can they classify it?
My second question is: is my understanding correct?

My first question is, does SVM can handle high-dimensional (n-D) input data?
Yes. I have dealt with data where n > 2500 when using LIBSVM software: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. I used linear and RBF kernels.
My second question is, does it correct my understanding?
I'm not entirely sure on what you mean here, so I'll try to comment on what you said most recently. I believe your intuition is generally correct. Data is "constructed" in some n-dimensional space, and a hyperplane of dimension n-1 is used to classify the data into two groups. However, by using kernel methods, it's possible to generate this information using linear methods and not consume all the memory of your computer.
I'm not sure if you've seen this already, but if you haven't, you may be interested in some of the information in this paper: http://pyml.sourceforge.net/doc/howto.pdf. I've copied and pasted a part of the text that may appeal to your thoughts:
A kernel method is an algorithm that depends on the data only through dot-products. When this is the case, the dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space. This has two advantages: First, the ability to generate non-linear decision boundaries using methods designed for linear classifiers. Second, the use of kernel functions allows the user to apply a classifier to data that have no obvious fixed-dimensional vector space representation. The prime example of such data in bioinformatics are sequence, either DNA or protein, and protein structure.
It would also help if you could explain what "guides" you are referring to. I don't think I've ever had to project data on a 2-D plane before, and it doesn't make sense to do so anyway for data with a ridiculous amount of dimensions (or "features" as it is called in LIBSVM). Using selected kernel methods should be enough to classify such data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse