How to auto-tag content, algorithms and suggestions needed - tags

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.

Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.

You should use a metric such as tf-idf to get the tags out:
Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.

Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation

If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.

Your approach seems sensible and there are two ways you can improve the tagging.
Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.

If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott

Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string

Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf

Related

Imbalanced multiclass classification using company names

I have this classification scenario below in which Im getting a very low F1, precision, recall and other metrics.
Target is multiclass (about ~200 classes) which is highly imbalanced
I only use company names as classifier (mostly 1-2 words which have max of 8 words), no other fields (like description, etc.)
Training data ~ 100k+ records
Preprocessing: numeric and special characters and stopwords removal
I have very low resources for processing (thats why when I try to use oversampling techniques like smote, distance_smote for multiclass, etc., I always get memory error)
Tried using different vectorization/embedding/tokenizer like word2vec, tfidf, fasttext, bert, roberta, etc. but to no avail
Tried using (and fine-tuning) different algorithms (networks, svm, trees, boosting, etc.) but also getting low scores.
I also did cost-sensitive learning (using class weights) but it only decreased my scores.
Tried all options that I know but scores are not increasing. Can you recommend other options here or do you think any part of the process that may be wrong/discarded? Thank you!
Distribution of target labels:
Sample observations
There is essentially no way to know that 'Exxon' is an oil company, and 'Apple' a computer company, and 'McDonalds' a fast-food chain, just from their company names.
Even if you have a list of every other company in the world, by name and type, that's not enough to make the deduction for these last 3. Only other outside info – like a few sentences about them, or other data – could classify them.
In fact, while company names sometimes describe their exact field-of-commerce, often they're totally arbitrary, as that gives them more freedom to range over many products/services, or create their own unique associations with the name (aka branding).
So I strongly suspect your (unshown) names & (unshown) labels are just too arbitrary for the data you're using to get very good at the task you're attempting.
Is there a real-world situation where someone will only have a company name – no other info, or research options – and benefit from correctly guessing the class? If so, more specifics about the situation might help generate more specific tactical recommendations. But mainly such recommendations will be: get richer data about the targets of the classification.
You might squeeze a little more out of vague trends in corporate naming via better preprocessing/feature-extraction. You may want to keep numbers, special-characters, & punctuation in some form, as they might include extra slight hints. Using subwords (character n-grams) might also reveal some shared word-roots used even in made-up names.

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

Efficiently extract WikiData entities from text

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.
The problem is simple: How to extract every WikiData entity from a given text?
An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.
So far I've tried the following:
Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
Apply Porter Stemming where applicable.
Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.
This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.
Suggestions?
I tagged the question Scala because I'm using Spark for the task.
Some suggestions:
consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
I wonder at the value of stemming for most entity names
I suspect you might be losing information by dividing the task into discrete stages
although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation
In particular, DBpedia Spotlight is one system designed for exactly this task.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

Tools for getting intent from Twitter statuses? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit?
Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super familiar with Twitter - do most people use hashtags (even for smaller scale issues), or would relying on them cut off a large segment of data?
I'd also be interested in grabbing Facebook statuses (with permission from the poster, of course), and hashtag use is pretty rare on Facebook.
I could use simple keyword search to crudely narrow the field, but that's more likely to require human intervention to determine which tweets should actually be posted alongside the content.
Ideas? Has this been done before?
There are two straightforward ways to go about finding tweets relevant to your content.
The first would be to treat this as a supervised document classification task, whereby you would train a classifier to annotate tweets with a certain predetermined set of topic labels. You could then use the labels to select tweets that are appropriate for whatever content you'll be augmenting. If you don't like using a predetermined set of topics, another approach would be to simply score tweets according to their semantic overlap with your content. You could then display the top n tweets with the most semantic overlap.
Supervised Document Classification
Using supervised document classification would require that you have a training set of tweets labeled with the set of topics you'll be using. e.g.,
tweet: NBA finals rocked label: sports
tweet: Googlers now allowed to use Ruby! label: programming
tweet: eating lunch label: other
If you want to collect training data without having to manually label the tweets with topics, you could use hashtags to assign topic labels to the tweets. The hashtags could be identical with the topic labels, or you could write rules to map tweets with certain hashtags to the desired label. For example, tweets tagged either #NFL or #NBA could all be assigned a label of sports.
Once you have the tweets labeled by topic, you can use any number of existing software packages to train a classifier that assigns labels to new tweets. A few available packages include:
NLTK (Python) - see Chapter 6 in the NLTK book on Learning to Classify Text
Classifier4J (Java)
nBayes (C#)
Semantic Overlap
Finding tweets using their semantic overlap with your content avoids the need for a labeled training set. The simplest way to estimate the semantic overlap between your content and the tweets that you're scoring is to use a vector space model. To do this, represent your document and each tweet as a vector with each dimension in the vector corresponding to a word. The value assigned to each vector position then represents how important that word is to the meaning of document. One way to estimate this would be to simply use the number of times the word occurs in the document. However, you'll likely get better results by using something like TF/IDF, which up-weights rare terms and down-weights more common ones.
Once you've represented your content and the tweets as vectors, you can score the tweets by their semantic similarity to your content by taking the cosine similarity of the vector for your content and the vector for each tweet.
There's no need to code any of this yourself. You can just use a package like Classifier4J, which includes a VectorClassifier class that scores document similarity using a vector space model.
Better Semantic Overlap
One problem you might run into with vector space models that use one term per dimension is that they don't do a good job of handling different words that mean roughly the same thing. For example, such a model would say that there is no similarity between The small automobile and A little car.
There are more sophisticated modeling frameworks such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) that can be used to construct more abstract representations of the documents being compared to each other. Such models can be thought of as scoring documents not based on simple word overlap, but rather in terms of overlap in the underlying meaning of the words.
In terms of software, the package Semantic Vectors provides a scalable LSA-like framework for document similarity. For LDA, you could use David Blei's implementation or the Stanford Topic Modeling Toolbox.
Great question. I think for twitter your best bet is to use hashtags because otherwise you need to create algorithms or find existing algorithms that do language analysis and improve over time based on user input/feedback.
For facebook you can kind of do what bing implemented a while back. As I covered in this article here:
http://www.socialtimes.com/2010/06/bing-adds-facebook-and-twitter-features-steps-up-social-services/
I wrote: For example, a search for “NBA Finals” will return fan-page content from Facebook, including posts from a local TV station. So if you're trying to augmented NBA related content, you could do a similar search as Bing provides - searching publically available fan-page content the way spiders index them for search engines. I'm not a developer so i'm not sure of the intricacies but I know it can be done.
Also you can display popular shared links from users who are publishing to ‘everyone’ will be aggregated for all non-fan page content. I'm not sure if this is limited to being published to 'everyone' and/or being 'popular' although I would assume so - but you can double check that.
Hope this helps
The problem with NLP is not the algorithm (although that is an issue) the problem is the resources. There are some open source shallow parsing tools (that's all you would need to get intent) that you could use but parsing thousands or millions of tweets would cost a fortune in computer time.
On the other hand like you said not all tweets have hashtags and there is no promise they will be relevant.
Maybe you can use a mixture of keyword search to filter out a few possibilities (those with the highest keyword density) and then use a deeper data analysis to pick the top 1 or 2. This would keep computer resources at a minimum and you should be able to get relevant tweets.

How do I adapt my recommendation engine to cold starts?

I am curious what are the methods / approaches to overcome the "cold start" problem where when a new user or an item enters the system, due to lack of info about this new entity, making recommendation is a problem.
I can think of doing some prediction based recommendation (like gender, nationality and so on).
You can cold start a recommendation system.
There are two type of recommendation systems; collaborative filtering and content-based. Content based systems use meta data about the things you are recommending. The question is then what meta data is important? The second approach is collaborative filtering which doesn't care about the meta data, it just uses what people did or said about an item to make a recommendation. With collaborative filtering you don't have to worry about what terms in the meta data are important. In fact you don't need any meta data to make the recommendation. The problem with collaborative filtering is that you need data. Before you have enough data you can use content-based recommendations. You can provide recommendations that are based on both methods, and at the beginning have 100% content-based, then as you get more data start to mix in collaborative filtering based.
That is the method I have used in the past.
Another common technique is to treat the content-based portion as a simple search problem. You just put in meta data as the text or body of your document then index your documents. You can do this with Lucene & Solr without writing any code.
If you want to know how basic collaborative filtering works, check out Chapter 2 of "Programming Collective Intelligence" by Toby Segaran
Maybe there are times you just shouldn't make a recommendation? "Insufficient data" should qualify as one of those times.
I just don't see how prediction recommendations based on "gender, nationality and so on" will amount to more than stereotyping.
IIRC, places such as Amazon built up their databases for a while before rolling out recommendations. It's not the kind of thing you want to get wrong; there are lots of stories out there about inappropriate recommendations based on insufficient data.
Working on this problem myself, but this paper from microsoft on Boltzmann machines looks worthwhile: http://research.microsoft.com/pubs/81783/gunawardana09__unified_approac_build_hybrid_recom_system.pdf
This has been asked several times before (naturally, I cannot find those questions now :/, but the general conclusion was it's better to avoid such recommendations. In various parts of the worls same names belong to different sexes, and so on ...
Recommendations based on "similar users liked..." clearly must wait. You can give out coupons or other incentives to survey respondents if you are absolutely committed to doing predictions based on user similarity.
There are two other ways to cold-start a recommendation engine.
Build a model yourself.
Get your suppliers to fill in key information to a skeleton model. (Also may require $ incentives.)
Lots of potential pitfalls in all of these, which are too common sense to mention.
As you might expect, there is no free lunch here. But think about it this way: recommendation engines are not a business plan. They merely enhance the business plan.
There are three things needed to address the Cold-Start Problem:
The data must have been profiled such that you have many different features (with product data the term used for 'feature' is often 'classification facets'). If you don't properly profile data as it comes in the door, your recommendation engine will stay 'cold' as it has nothing with which to classify recommendations.
MOST IMPORTANT: You need a user-feedback loop with which users can review the recommendations the personalization engine's suggestions. For example, Yes/No button for 'Was This Suggestion Helpful?' should queue a review of participants in one training dataset (i.e. the 'Recommend' training dataset) to another training dataset (i.e. DO NOT Recommend training dataset).
The model used for (Recommend/DO NOT Recommend) suggestions should never be considered to be a one-size-fits-all recommendation. In addition to classifying the product or service to suggest to a customer, how the firm classifies each specific customer matters too. If functioning properly, one should expect that customers with different features will get different suggestions for (Recommend/DO NOT Recommend) in a given situation. That would the 'personalization' part of personalization engines.