How to employ Learning-to-rank models (CNN, LSTM) in short-pair ranking? - neural-network

In common applied learn-to-rank tasks, the inputs are usually semantic and have good syntactic structure, like Question-Answer ranking tasks. In this scenario, CNN or LSTM is a good structure to capture the latent information (local or long dependency) of QA-pairs.
But in reality, sometimes we just have short pair and discrete words. In this occasion, CNN or LSTM is still a fair choice?Or is there some more appropriate method can handle this?

The bigger question is how much training data you have. There's a lot of interesting work, but the reason that the deep neural network approaches tend to use QA ranking tasks is because those tasks typically have hundreds of thousands or millions of training examples.
When you have shorter queries, i.e. title or web queries, you will possibly need even more data to learn, because less of the network will be exercised by each training instance. It is possible, but the method you choose should be based on the training data you have available, rather than the size of your queries, in general.
[0-50 queries] -> Hand-tuned, time-tested model such as Query Likelihood, BM25, (or if you want better results, ngram models such as SDM) (if you want more recall, pseudo-relevance-feedback models such as RM3).
[50-1000 queries] -> Linear or Tree-based learning-to-rank methods
[1000-millions] -> Deep approach, or possibly still learning-to-rank. I'm not sure any of the deep papers have truly dominated a state-of-the-art gradient-boosted-regression-tree setup.
A recent paper by one of my labmates used pseudo-labels from BM25 to bootstrap a DNN. They got good results (better than BM25), but they literally had to be Google (training-time-wise) to pull it off.

Related

checking for convergence in complex hierarchical models JAGS

I have estimated a complex hierarchical model with many random effects, but don't really know what the best approach is to checking for convergend. I have complex longitudinal data from a few hundred individuals and estimate quite a few parameters for every individual. Because of that, I have way to many traceplots to inspect visually. Or should I really spend a day going through all the traceplots? What would be a better way to check for convergence? Do I have to calculate Gelman and Rubin's Rhat for every parameter on the person level? And when can I conclude that the model converged? When absolutely all of the thousends of parameters reached convergence? Is it even sensible to expect that? Or is there something like "overall convergence"? And what does it mean when some person-level parameters did not converge? Does it make sense to use autorun.jags from the R2jags package with such a model or will it just run for ever? I know, these are a lot of question, but I just don't know how to approach that.
The measure I am using for convergence is a potential scale reduction factor (psrf)* using the gelman.diag function from the R package coda.
But nevertheless, I am also quickly visually inspecting all the traceplots, even though I also have tens/hundreds of them. It can be really fast if you put them in PNG files and then quickly go through them using e.g. IrfanView (let me know if you need me to expand on this).
The reason you should inspect the traceplots is pretty well described by an example from Marc Kery (author of great Bayesian books): see "Never blindly trust Rhat for convergence in a Bayesian analysis", here I include a self explanatory image from this email:
This is related to Rhat statistics while I use psrf, but it's pretty likely that psrf suffers from this too... and better to check the chains.
*) Gelman, A. & Rubin, D. B. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–472 (1992).

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
https://arxiv.org/pdf/1707.01780.pdf
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

Factors that Impact Translation Time

I have run across issues in developing models where the translation time (simulates quickly but takes far too long to translate) has become a serious issue and could use some insight so I can look into resolving this.
So the question is:
What are some of the primary factors that impact the translation time of a model and ideas to address the issue?
For example, things that may have an impact:
for loops vs a vectorized method - a basic model testing this didn't seem to impact anything
using input variables vs parameters
impact of annotations (e.g., Evaluate=true)
or tough luck, this is tool dependent (Dymola, OMEdit, etc.) :(
use of many connect() - this seems to be a factor (perhaps primary) as it forces translater to do all the heavy lifting
Any insight is greatly appreciated.
Clearly the answer to this question if naturally open ended. There are many things to consider when computation times may be a factor.
For distributed models (e.g., finite difference) the use of simple models and then using connect equations to link them in the appropriate order is not the best way to produce the models. Experience has shown that this method significantly increases the translation time to unbearable lengths. It is better to create distributed models in the same approach that is used the MSL Dynamic pipe (not exactly like it but similar).
Changing the approach as described is significantly faster in translational time (orders of magnitude for larger models, >~100,000 equations) than using connect statements as the number of distributed elements increases to larger numbers. This was tested using Dymola 2017 and 2017FD01.
Some related materials pointed out by others that may be useful for more information have been included below:
https://modelica.org/events/modelica2011/Proceedings/pages/papers/07_1_ID_183_a_fv.pdf
Scalable Test Suite : https://dx.doi.org/10.3384/ecp15118459

Estimating possible # of actors in Scala

How can I estimate the number of actors that a Scala program can handle?
For context, I'm contemplating what is essentially a neural net that will be creating and forgetting cells at a high rate. I'm contemplating making each cell an actor, but there will be millions of them. I'm trying to decide whether this design is worth pursuing, but can't estimate the limits of number of actors. My intent is that it should totally run on one system, so distributed limits don't apply.
For that matter, I haven't definitely settled on Scala, if there's some better choice, but the cells do have state, as in, e.g., their connections to other cells, the weights of the connections, etc. Though this COULD be done as "Each cell is final. Changes mean replacing the current cell with a new one bearing the same id#."
P.S.: I don't know Scala. I'm considering picking it up to do this project. I'm also considering lots of other alternatives, including Java, Object Pascal and Ada. But actors seem a better map to what I'm after than thread-pools (and Java can't handle enough threads to make a thread/cell design feasible.
P.S.: At all times, most of the actors will be quiescent, but there wil need to be a way of cycling through the entire collection of them. If there isn't one built into the language, then this can be managed via first/next links within each cell. (Both links are needed, to allow cells in the middle to be extracted for release.)
With a neural net simulation, the real question is how much of the computational effort will be spent communicating, and how much will be spent computing something within a cell? If most of the effort is in communication then actors are perhaps a good choice for correctness, but not a good choice at all for efficiency (even with Akka, which performs reasonably well; AsyncFP might do the trick, though). Millions of neurons sounds slow--efficiency is probably a significant concern. If the neurons have some pretty heavy-duty computations to do themselves, then the communications overhead is no big deal.
If communications is the bottleneck, and you have lots of tiny messages, then you should design a custom data structure to hold the network, and also custom thread-handling that will take advantage of all the processors you have and minimize the amount of locking that you must do. For example, if you have space, each neuron could hold an array of input values from those neurons linked to it, and it would when calculating its output just read that array directly with no locking and the input neurons would just update the values also with no locking as they went. You can then just dump all your neurons into one big pool and have a master distribute them in chunks of, I don't know, maybe ten thousand at a time, each to its own thread. Scala will work fine for this sort of thing, but expect to do a lot of low-level work yourself, or wait for a really long time for the simulation to finish.