Using Conditional Random Fields for Named Entity Recognition - metadata

What is Conditional Random Field?
How does exactly Conditional Random Field identify proper names as person, organization, or place in a structured or unstructured text?
For example: This product is ordered by StackOverFlow Inc.
What does Conditional Random Field do to identify StackOverFlow Inc. as an organization?

A CRF is a discriminative, batch, tagging model, in the same general family as a Maximum Entropy Markov model.
A full explanation is book-length.
A short explanation is as follows:
Humans annotate 200-500K words of text, marking the entities.
Humans select a set of features that they hope indicate entities. Things like capitalization, or whether the word was seen in the training set with a tag.
A training procedure counts all the occurrences of the features.
The meat of the CRF algorithm search the space of all possible models that fit the counts to find a pretty good one.
At runtime, a decoder (probably a Viterbi decoder) looks at a sentence and decides what tag to assign to each word.
The hard parts of this are feature selection and the search algorithm in step 4.

Well to understand that you got to study a lot of things.
For start
Understand the basic of markov and bayesian networks.
Online course available in coursera by daphne coller
https://class.coursera.org/pgm/lecture/index
CRF is a special type of markov network where we have observation and hidden states.
The objective is to find the best State Assignment to the unobserved variables also known as MAP problem.
Be Prepared for a lot of probability and Optimization. :-)

Related

Implementating spell drawing/casting mechanism in Luau (Roblox)

I am coding a spell-casting system where you draw a symbol with your wand (mouse), and it can recognize said symbol.
There are two methods I believe might work; neural networking and an "invisible grid system"
The problem with the neural networking system is that It would be (likely) suboptimal in Roblox Luau, and not be able to match the performance nor speed I wish for. (Although, I may just be lacking in neural networking knowledge. Please let me know whether I should continue to try implementing it this way)
For the invisible grid system, I thought of converting the drawing into 1s and 0s (1 = drawn, 0 = blank), then seeing if it is similar to one of the symbols. I create the symbols by making a dictionary like:
local Symbol = { -- "Answer Key" shape, looks like a tilted square
00100,
01010,
10001,
01010,
00100,
}
The problem is that user error will likely cause it to be inaccurate, like this "spell"'s blue boxes, showing user error/inaccuracy. I'm also sure that if I have multiple Symbols, comparing every value in every symbol will surely not be quick.
Do you know an algorithm that could help me do this? Or just some alternative way of doing this I am missing? Thank you for reading my post.
I'm sorry if the format on this is incorrect, this is my first stack-overflow post. I will gladly delete this post if it doesn't abide to one of the rules. ( Let me know if there are any tags I should add )
One possible approach to solving this problem is to use a template matching algorithm. In this approach, you would create a "template" for each symbol that you want to recognize, which would be a grid of 1s and 0s similar to what you described in your question. Then, when the user draws a symbol, you would convert their drawing into a grid of 1s and 0s in the same way.
Next, you would compare the user's drawing to each of the templates using a similarity metric, such as the sum of absolute differences (SAD) or normalized cross-correlation (NCC). The template with the lowest SAD or highest NCC value would be considered the "best match" for the user's drawing, and therefore the recognized symbol.
There are a few advantages to using this approach:
It is relatively simple to implement, compared to a neural network.
It is fast, since you only need to compare the user's drawing to a small number of templates.
It can tolerate some user error, since the templates can be designed to be tolerant of slight variations in the user's drawing.
There are also some potential disadvantages to consider:
It may not be as accurate as a neural network, especially for complex or highly variable symbols.
The templates must be carefully designed to be representative of the expected variations in the user's drawings, which can be time-consuming.
Overall, whether this approach is suitable for your use case will depend on the specific requirements of your spell-casting system, including the number and complexity of the symbols you want to recognize, the accuracy and speed you need, and the resources (e.g. time, compute power) that are available to you.

How to predict word using trained CBOW

I have a question about CBOW prediction. Suppose my job is to use 3 surrounding words w(t-3), w(t-2), w(t-1)as input to predict one target word w(t). Once the model is trained and I want to predict a missing word after a sentence. Does this model only work for a sentence with four words which the first three are known and the last is unknown? If I have a sentence in 10 words. The first nine words are known, can I use 9 words as input to predict the last missing word in that sentence?
Word2vec CBOW mode typically uses symmetric windows around a target word. But it simply averages the (current in-training) word-vectors for all words in the window to find the 'inputs' for the prediction neural-network. Thus, it is tolerant of asymmetric windows – if there are fewer words are available on either side, fewer words on that side are used (and perhaps even zero on that side, for words at the front/end of a text).
Additionally, during each training example, it doesn't always use the maximum-window specified, but some random-sized window up-to the specified size. So for window=5, it will sometimes use just 1 on either side, and other times 2, 3, 4, or 5. This is done to effectively overweight closer words.
Finally and most importantly for your question, word2vec doesn't really do a full-prediction during training of "what exact word does the model say should be heat this target location?" In either the 'hierarchical softmax' or 'negative-sampling' variants, such an exact prediction can be expensive, requiring calculations of neural-network output-node activation levels proportionate to the size of the full corpus vocabulary.
Instead, it does the much-smaller number-of-calculations required to see how strongly the neural-network is predicting the actual target word observed in the training data, perhaps in contrast to a few other words. In hierarchical-softmax, this involves calculating output nodes for a short encoding of the one target word – ignoring all other output nodes encoding other words. In negative-sampling, this involves calculating the one distinct output node for the target word, plus a few output nodes for other randomly-chosen words (the 'negative' examples).
In neither case does training know if this target word is being predicted in preference over all other words – because it's not taking the time to evaluate all others words. It just looks at the current strength-of-outputs for a real example's target word, and nudges them (via back-propagation) to be slightly stronger.
The end result of this process is the word-vectors that are usefully-arranged for other purposes, where similar words are close to each other, and even certain relative directions and magnitudes also seem to match human judgements of words' relationships.
But the final word-vectors, and model-state, might still be just mediocre at predicting missing words from texts – because it was only ever nudged to be better on individual examples. You could theoretically compare a model's predictions for every possible target word, and thus force-create a sort of ranked-list of predicted-words – but that's more expensive than anything needed for training, and prediction of words like that isn't the usual downstream application of sets of word-vectors. So indeed most word2vec libraries don't even include any interface methods for doing full target-word prediction. (For example, the original word2vec.c from Google doesn't.)
A few versions ago, the Python gensim library added an experimental method for prediction, [predict_output_word()][1]. It only works for negative-sampling mode, and it doesn't quite handle window-word-weighting the same way as is done in training. You could give it a try, but don't be surprised if the results aren't impressive. As noted above, making actual predictions of words isn't the usual real goal of word2vec-training. (Other more stateful text-analysis, even just large co-occurrence tables, might do better at that. But they might not force word-vectors into interesting constellations like word2vec.)

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
https://arxiv.org/pdf/1707.01780.pdf
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)