what are the different techniques for comparing 2 words semantically? Which one is the best among them? - word

Right now, I am at the starting point of a project in which I am supposed to compare two words semantically.
I came to know about WordNet wherein we find distance between words to find how similar they are in terms of their meaning.
It would be really helpful if you can suggest some more techniques and which method would be the best one.

Related

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
https://arxiv.org/pdf/1707.01780.pdf
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

Is there a high-precision logarithm calculator in elisp?

In my class, for a final project, we are working on improving an algorithm that matches a prefix with a power of 2
(I.E. input="25", output="2^8=256", input="99", output="2^99=9903520314283042199192993792...")
Anyways, we are relying on logarithms to identify matching prefixes. Logarithm precision actually does matter and we are looking for better log functions. The standard log function and the calc-function both have the same precision. Are there any options if I wanted even better precision?
Based off several google searches, I've come to the conclusion that one currently does not exist. If someone where to inform me otherwise, I would happily change my best answer however.

linear probing vs separate chaining in hashes

I am well aware that there's another question about this, but my question is different.
I know for sure that searching using separate chaining will us O(N/M) and if we sort the lists we get O( log(N/M)).
However the running time of searching or deleting using linear probing is not clear to me. As far as I know it is the load factor but that's it.
Additionally, when we have cases like a full array (worst case), is it better to use separate chain or linear probing?
If we have a sparse array, which one is to choose as well?
I can't seem to figure out where the advantages of them over each other.
thank you

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

Graph/tree representation and recursion

I'm currently writing an optimization algorithm in MATLAB, at which I completely suck, therefore I could really use your help. I'm really struggling to find a good way of representing a graph (or well more like a tree with several roots) which would look more or less like this:
alt text http://img100.imageshack.us/img100/3232/graphe.png
Basically 11/12/13 are our roots (stage 0), 2x is stage1, 3x stage2 and 4x stage3. As you can see nodes from stageX are only connected to several nodes from stage(X+1) (so they don't have to be connected to all of them).
Important: each node has to hold several values (at least 3-4), one will be it's number and at least two other variables (which will be used to optimize the decisions).
I do have a simple representation using matrices but it's really hard to maintain, so I was wondering is there a good way to do it?
Second question: when I'm done with that representation I need to calculate how good each route (from roots to the end) is (like let's say I need to compare is 11-21-31-41 the best or is 11-21-31-42 better) to do that I will be using the variables that each node holds. But the values will have to be calculated recursively, let's say we start at 11 but to calcultate how good 11-21-31-41 is we first need to go to 41, do some calculations, go to 31, do some calculations, go to 21 do some calculations and then we can calculate 11 using all the previous calculations. Same with 11-21-31-42 (we start with 42 then 31->21->11). I need to check all the possible routes that way. And here's the question, how to do it? Maybe a BFS/DFS? But I'm not quite sure how to store all the results.
Those are some lengthy questions, but I hope I'm not asking you for doing my homework (as I got all the algorithms, it's just that I'm not really good at matlab and my teacher wouldn't let me to do it in java).
Granted, it may not be the most efficient solution, but if you have access to Matlab 2008+, you can define a node class to represent your graph.
The Matlab documentation has a nice example on linked lists, which you can use as a template.
Basically, a node would have a property 'linksTo', which points to the index of the node it links to, and a method to calculate the cost of each of the links (possibly with some additional property that describe each link). Then, all you need is a function that moves down each link, and brings the cost(s) with it when it moves back up.