Neural network for text documents invariant to sentence order - neural-network

Is there a neural network architecture that I can use to find a low dimensional mapping for documents comprised of multiple sentences such that the mapping is invariant to sentence order?
So, if Doc 1 is:
I like dogs. Cats are very nice.
and Doc 2 is:
Cats are very nice. I like dogs.
That in the new space, they would be represented by the same point?

Related

After loading a pretrained Word2Vec model, how do I get word2vec representations of new sentences?

I loaded a word2vec model using Google News dataset. Now I want to get the Word2Vec representations of a list of sentences that I wish to cluster. After going through the documentation I found this gensim.models.word2vec.LineSentencebut I'm not sure this is what I am looking for.
There should be a way to get word2vec representations of a list of sentences from a pretrained model right? None of the links I searched had anything about it. Any leads would be appreciated.
Word2Vec only offers vector representations for words, not sentences.
One crude but somewhat effective (for some purposes) way to go from word-vectors to vectors for longer texts (like sentences) is to average all the word-vectors together. This isn't a function of the gensim Word2Vec class; you have to code this yourself.
For example, with the word-vectors already loaded as word_model, you'd roughly do:
import numpy as np
sentence_tokens = "I do not like green eggs and ham".split()
sum_vector = np.zeros(word_model.vector_size)
for token in sentence_tokens:
sum_vector += word_model[token]
sentence_vector = sum_vector / len(sentence_tokens)
Real code might add handling for when the tokens aren't all known to the model, or other ways of tokenizing/filtering the text, and so forth.
There are other more sophisticated ways to get the vector for a length-of-text, such as the 'Paragraph Vectors' algorithm implemented by gensim's Doc2Vec class. These don't necessarily start with pretrained word-vectors, but can be trained on your own corpus of texts.

What mechanism can be used to quantify similarity between non-numeric lists?

I have a database of recipes which is essentially structured as a list of ingredients and their associated quantities. If you are given a recipe how would you identify similar recipes allowing for variations and omissions? For example using milk instead of water, or honey instead of sugar or entirely omitting something for flavour.
The current strategy is to do multiple inner joins for combinations of the main ingredients but this is can be exceedingly slow with a large database. Is there another way to do this? Something to the equivalent of perceptual hashing would be ideal!
How about cosine similarity?
This technique is commonly used in Machine Learning for text recognition as a similarity measure. With it, you can calculate the distance between two texts (actually, between any two vectors) which can be interpreted as how much are those texts alike (the closer, the more alike).
Take a look at this great question that explains cosine similarity in a simple way. In general, you could use any similarity measure to obtain a distance to compare your recipe. This article talks about different similarity measures, you can check it out if you wish to know more.

Word2vec model query

I trained a word2vec model on my dataset using the word2vec gensim package. My dataset has about 131,681 unique words but the model outputs a vector matrix of shape (47629,100). So only 47,629 words have vectors associated with them. What about the rest? Why am I not able to get a 100 dimensional vector for every unique word?
The gensim Word2Vec class uses a default min_count of 5, meaning any words appearing fewer than 5 times in your corpus will be ignored. If you enable INFO level logging, you should see logged messages about this and other steps taken by the training.
Note that it's hard to learn meaningful vectors with few (on non-varied) usage examples. So while you could lower the min_count to 1, you shouldn't expect those vectors to be very good – and even trying to train them may worsen your other vectors. (Low-occurrence words can be essentially noise, interfering with the training of other word-vectors, where those other more-frequent words do have sufficiently numerous/varied examples to be better.)

Sentence feature vectors

How could one, in an unsupervised environment, get a feature vectors for a sentence. I believe for an image data one could build a conv auto encoder and take the hidden layers outputs. What would be the best way to do this for RNN type models (LSTM, GRU, etc.).
Would this (https://papers.nips.cc/paper/5271-pre-training-of-recurrent-neural-networks-via-linear-autoencoders.pdf) be on the right track?
The easiest thing to do would be doing something with the word2vec representations of all the words (like summing them?).
See this post:
How to get vector for a sentence from the word2vec of tokens in sentence

Text patterns from classification

Say I have some kind of multi-class text/conversation classificator (naive bayes or so) and I wanted to find text patterns that were significant for the classification. How would I best go about finding these text patterns? The motivation behind this is you could use these patterns to better understand the process behind the classification.
A pattern is defined as a (multi)set of words s={w1, ... , wn}, this pattern has a class probability for each class c - P(c|s) - inferred by the classificator. A pattern is then significant, if the inferred probability is high (local maximum, top n, something like that).
Now it wouldn't be such a problem to run the classificator on parts of text in the dataset you are looking at. However, these patterns do not have to be natural sentences or something like that, but any (multi)subset of the vocabulary. You are then looking at running the classification on all the (multi)subsets of the vocabulary, which is computationally unrealistic.
I think what could work is to search the text space using a heuristic search algorithm such as hill climbing to maximize the likelyhood of a certain class. You could run the hillclimber a bunch of times from different initial conditions and then just take the top 10 or so unique results as patterns.
Is this a good approach, or are there better ones? Thanks for any suggestions.