Classifiers assembled with identical training sets using IBM Watson NLU and IBM Watson NLC services yield different results - classification

Everyone actively using the Natural Language Classifier service from IBM Watson has seen the following message while using the API:
"On 9 August 2021, IBM announced the deprecation of the Natural Language Classifier service. The service will no longer be available from 8 August 2022. As of 9 September 2021, you will not be able to create new instances. Existing instances will be supported until 8 August 2022. Any instance that still exists on that date will be deleted.
For more information, see IBM Cloud Docs"
IBM actively promotes to migrate NLC models to IBM's Natural Language Understanding Service. Today I have migrated my first classification model from Natural Language Classifier to Natural Language Understanding. Since I did not dive into the technological background of either service, I wanted to compare the output of both services. In order to do so, I followed the migration guidelines provided by IBM ( NLC --> NLU migration guidelines ). To recreate the NLC classifier in NLU, I downloaded the complete set of training data used to create the initial classifier built in the NLC service. So the data sets used to train the NLC and NLU classifiers are identical. Recreation of the classifier in NLU was straightforward forward and the classifier training took about the same time as in NLC. 
To compare the performance, I then assembled a test set of phrases that was not used for training purposes in either the NLC or NLU service. The test set contains 100 phrases that were passed through both the NLC and NLU classifier. To my big surprise, the differences are substantial. Out of 100, 18 results are different (more than 0.30 difference in confidence value), or 37 out of 100 when accepting a difference of 0.2 in confidence value. To summarize, the differences in analysis results are substantial.
In my opinion, this difference is too large to blindly move on to migrating all NLC models to NLU without any hesitation. The results I obtained so far justify further investigation using a manual curation step by a SME to validate the yielded analysis results. I am not too happy about this. I was wondering whether more users have seen this issue and/or have the same observation. Perhaps someone can shed a light on the differences in analysis results between the NLC and NLU services. And how to close the gap between the differences in analysis results obtained with the NLC and NLU service.
Please find below an excerpt of the analysis results of comparison:
title
NLC
NLU
Comparability
"Microbial Volatile Organic Compound (VOC)-Driven Dissolution and Surface Modification of Phosphorus-Containing Soil Minerals for Plant Nutrition: An Indirect Route for VOC-Based Plant-Microbe Communications"
0,01
0,05
comparable
"Valorization of kiwi agricultural waste and industry by-products by recovering bioactive compounds and applications as food additives: A circular economy model"
0,01
0,05
comparable
"Quantitatively unravelling the effect of altitude of cultivation on the volatiles fingerprint of wheat by a chemometric approach"
0,70
0,39
different
"Identification of volatile biomarkers for high-throughput sensing of soft rot and Pythium leak diseases in stored potatoes"
0,01
0,33
different
"Impact of Electrolyzed Water on the Microbial Spoilage Profile of Piedmontese Steak Tartare"
0,08
0,50
different
"Review on factors affecting Coffee Volatiles: From Seed to Cup"
0,67
0,90
different
"Chemometric analysis of the volatile profile in peduncles of cashew clones and its correlation with sensory attributes"
0,79
0,98
comparable
"Surface-enhanced Raman scattering sensors for biomedical and molecular detection applications in space"
0,00
0,00
comparable
"Understanding the flavor signature of the rice grown in different regions of China via metabolite profiling"
0,26
0,70
different
"Nutritional composition, antioxidant activity, volatile compounds, and stability properties of sweet potato residues fermented with selected lactic acid bacteria and bifidobacteria"
0,77
0,87
comparable

We have also been migrating our classifiers from NLC to NLU and doing analysis to explain the differences. We explored different possible factors to see what may have an influence: Upper case/Lower case, text length…no correlation found in these cases.
We did however find some correlation between the difference in score between the 1st and 2nd class returned by NLU and the score drop from NLC. That is to say we noticed that the closer the score of the second class returned the lower the NLU score on the first class. We call this confusion. In the case of our data there are times when the confusion is ‘real’ (ie. an SME would also classify the test phrase as borderline between 2 classes) but there were also times when we realized we could improve our training data to have more ‘distinct’ classes.
Bottom line, we can not explain the internals of NLU that generate the difference and we do still have a drop in the scores between NLC and NLU but it is across the board. We will move ahead to NLU despite the lowering of the scores: it does not hinder our interpretation of results.

Related

Comparison between fasttext and LDA

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2.
I haven't tried Fasttext but here are few pro and con for LDA based on my experience
Pro
Iterative model, having support for Apache spark
Takes in a corpus of document and does topic modeling.
Not only finds out what the document is talking about but also finds out related documents
Apache spark community is continuously contributing to this. Earlier they made it work on mllib now on ml libraries
Con
Stopwords need to be defined well. They have to be related to the context of the document. For ex: "document" is a word which is having high frequency of appearance and may top the chart of recommended topics but it may or maynot be relevant, so we need to update the stopword for that
Sometime classification might be irrelevant. In the below example it is hard to infer what this bucket is talking about
Topic:
Term:discipline
Term:disciplines
Term:notestable
Term:winning
Term:pathways
Term:chapterclosingtable
Term:metaprograms
Term:breakthroughs
Term:distinctions
Term:rescue
If anyone has done research in Fasttext can you please update with your learning?
fastText offers more than topic modelling, it is a tool for generation of word embeddings and text classification using a shallow neural network.
The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.
Pros:
=> It is extremely easy to train your own fastText model,
$ ./fasttext skipgram -input data.txt -output model
Just provide your input and output file, the architecture to be used and that's all, but if you wish to customize your model a bit, fastText provides the option to change the hyper-parameters as well.
=> While generating word vectors, fastText takes into account sub-parts of words called character n-grams so that similar words have similar vectors even if they happen to occur in different contexts. For example, “supervised”, “supervise” and “supervisor” all are assigned similar vectors.
=> A previously trained model can be used to compute word vectors for out-of-vocabulary words. This one is my favorite. Even if the vocabulary of your corpus is finite, you can get a vector for almost any word that exists in the world.
=> fastText also provides the option to generate vectors for paragraphs or sentences. Similar documents can be found by comparing the vectors of documents.
=> The option to predict likely labels for a piece of text has been included too.
=> Pre-trained word vectors for about 90 languages trained on Wikipedia are available in the official repo.
Cons:
=> As fastText is command line based, I struggled while incorporating this into my project, this might not be an issue to others though.
=> No in-built method to find similar words or paragraphs.
For those who wish to read more, here are the links to the official research papers:
1) https://arxiv.org/pdf/1607.04606.pdf
2) https://arxiv.org/pdf/1607.01759.pdf
And link to the official repo:
https://github.com/facebookresearch/fastText

Why such a bad performance for Moses using Europarl?

I have started playing around with Moses and tried to make what I believe would be a fairly standard baseline system. I have basically followed the steps described on the website, but instead of using news-commentary I have used Europarl v7 for training, with the WMT 2006 development set and the original Europarl common test. My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
To summarise, my workflow was more or less this:
tokenizer.perl on everything
lowercase.perl (instead of truecase)
clean-corpus-n.perl
Train IRSTLM model using only French data from Europarl v7
train-model.perl exactly as described
mert-moses.pl using WMT 2006 dev
Testing and measuring performances as described
And the resulting BLEU score is .26... This leads me to two questions:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
Just to put things straight first: the .68 you are referring to has nothing to do with BLEU.
My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
The article you refer to only states that 68% of the pronouns (using co-reference resolution) was translated correctly. It nowhere mentions that a .68 BLEU score was obtained. As a matter of fact, no scores were given, probably because the qualitative improvement the paper proposes cannot be measured with statistical significance (which happens a lot if you only improve on a small number of words). For this reason, the paper uses a manual evaluation of the pronouns only:
A better evaluation metric is the number of correctly
translated pronouns. This requires manual
inspection of the translation results.
This is where the .68 comes into play.
Now to answer your questions with respect to the .26 you got:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Yes it is. You can find the performance of WMT language pairs here http://matrix.statmt.org/
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
I assume that you trained your system correctly. With respect to the "undisclosed corpus" question: members of the academic community normally state for each experiment which data sets were used for training testing and tuning, at least in peer-reviewed publications. The only exception is the WMT task (see for example http://www.statmt.org/wmt14/translation-task.html) where privately owned corpora may be used if the system participates in the unconstrained track. But even then, people will mention that they used additional data.

Clearing Mesh of Graph

If we do the information visualization of documents, the graph generation across multiple documents often forms a mesh. Now to get a clear picture it is easy to form them with minimum data load and thus summarization is a good thing. But if the document load becomes
million then with summarization also the graph forms a big mesh.
I am bit perplexed how to clear the mesh. Reading and working round http://www.jerrytalton.net/research/Talton04SSMSA.report/Talton04SSMSA.pdf is not coming much help, as data is huge.
If any learned members may kindly help me out.
Regards,
SK
Are you talking about creating a graph or network of the documents? For example, you could have a network of documents linked by their citations, by having shared authors, by having the same terms appearing in them, etc. This isn't generally called a mesh problem, instead it is an automatic graph layout problem.
You need either better layout algorithms or to do some kind of clustering and reduction. There are many clustering algorithms you can use, for example Wakita & Tsurumi's:
Ken Wakita and Toshiyuki Tsurumi. 2007. Finding community structure in mega-scale social networks: [extended abstract]. Proc. 16th international conference on World Wide Web (WWW '07). 1275-1276. DOI=10.1145/1242572.1242805.
One that is particularly targeted at reducing complexity through "graph summarization" is Navlakha et al. 2008:
Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. 2008. Graph summarization with bounded error. Proc. 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). 419-432. DOI=10.1145/1376616.1376661.
You could also check out my latest paper, which replaces common repeating patterns in the network with representative glyphs:
Dunne, C. & Shneiderman, B. 2013. Motif simplification: improving network visualization readability with fan, connector, and clique glyphs. Proc. 2013 SIGCHI Conference on Human Factors in Computing Systems (CHI '13). PDF.
Here's an example picture of the reduction possible:

Clustering or classification?

I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with.
I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item "Apple launches new iphone", I need to associate the company Apple with it. A particular news item/document only contains 'title' and 'description' so I have to analyze the text in order to find out which company the news referes to. It could be multiple companies too.
To solve this, I turned to Mahout.
I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' etc as top terms in my clusters and from there I would know the news in a cluster corresponds to its cluster label, but things were a bit different. I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal', 'shares', 'street', 'olympics' and lots of other terms as the top ones (which makes sense as clustering algos' look for common terms). Although there were some 'Apple' clusters but the news items associated with it were very few.I thought may be clustering is not for this kind of problem as many of the company news goes into more general clusters(investment, profit) instead of the specific company cluster(Apple).
I started reading about classification which requires training data, The name was convincing too as I actually want to 'classify' my news items into 'company names'. As I read on, I got an impression that the name classification is a bit deceiving and the technique is used more for prediction purposes as compared to classification. The other confusions that I got was how can I prepare training data for news documents? lets assume I have a list of companies that I am interested in. I write a program to produce training data for the classifier. the program will see if the news title or description contains the company name 'Apple' then its a news story about apple. Is this how I can prepare training data?(off course I read that training data is actually a set of predictors and target variables). If so, then why should I use mahout classification in the first place? I should ditch mahout and instead use this little program that I wrote for training data(which actually does the classification)
You can see how confused I am about how to address this issue. Another thing that concerns me is that if its possible to make a system this intelligent, that if the news says 'iphone sales at a record high' without using the word 'Apple', the system can classify it as a news related to apple?
Thank you in advance for pointing me in the right direction.
Copying my reply from the mailing list:
Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as "about Apple" or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a classifier that can tell when an article is "about
Apple".
I don't think it will quite work to automatically generate the
training set by labeling according to the simple rule, that it is
about Apple if 'Apple' is in the title. Well, if you do that, then
there is no point in training a classifier. You can make a trivial
classifier that achieves 100% accuracy on your test set by just
checking if 'Apple' is in the title! Yes, you are right, this gains
you nothing.
Clearly you want to learn something subtler from the classifier, so
that an article titled "Apple juice shown to reduce risk of dementia"
isn't classified as about the company. You'd really need to feed it
hand-classified documents.
That's the bad news, but, sure you can certainly train N classifiers
for N topics this way.
Classifiers put items into a class or not. They are not the same as
regression techniques which predict a continuous value for an input.
They're related but distinct.
Clustering has the advantage of being unsupervised. You don't need
labels. However the resulting clusters are not guaranteed to match up
to your notion of article topics. You may see a cluster that has a lot
of Apple articles, some about the iPod, but also some about Samsung
and laptops in general. I don't think this is the best tool for your
problem.
First of all, you don't need Mahout. 3000 documents is close to nothing. Revisit Mahout when you hit a million. I've been processing 100.000 images on a single computer, so you really can skip the overhead of Mahout for now.
What you are trying to do sounds like classification to me. Because you have predefined classes.
A clustering algorithm is unsupervised. It will (unless you overfit the parameters) likely break Apple into "iPad/iPhone" and "Macbook". Or on the other hand, it may merge Apple and Google, as they are closely related (much more than, say, Apple and Ford).
Yes, you need training data, that reflects the structure that you want to measure. There is other structure (e.g. iPhones being not the same as Macbooks, and Google, Facebook and Apple being more similar companies than Kellogs, Ford and Apple). If you want a company level of structure, you need training data at this level of detail.

Making predictions from a CV

I have a database with many CVs, including structured data of the gender, age, address, number of years of education, and many other parameters of each person.
For about 10% of the sample, I also have additional data about a certain action they've made at some point in time. For instance, that Jane took a home loan in July 1998 or that John started pilot training in Jan. 2007 and got his license in Dec. 2007.
I need an algorithm that will give, for each of the actions, the probability that it will happen for each person in future time increments. For instance, that the chance of Bill taking a home loan is 2% in 2011, 3.5% in 2012, etc.
How should I approach this? Regression analysis? SVM? Neural net? Something else?
Is there perhaps even some standard tool/library that I can use with just the obvious customizations?
The probability that X happens given that Y happened is right out of Bayesian inference, I think.
Lou is right, this is the case for 'Bayesian Inference'.
The best tool/library to solve this is the R statistic programming language (r-project.org).
Take a look at the Bayesian Inference Libraries in R:
http://cran.r-project.org/web/views/Bayesian.html
How many people are in the "10% of the sample"? If it's below 100 people or so, I would fear that the results of the analysis could not be significant. If it's 1000 or more people, the results will be quite good (rule of thumb).
I would fist export the data to R (r-project) and do some data cleaning necessary. Then find a person familiar with R and advanced statistics, he will be able to solve this very quickly. Or try yourself, but R takes some time in the beginning.
Concerning the tool/library choice, I suggest you give Weka a try. It's an open source tool for experimenting with data mining and machine learning. Weka has several tools for reading, processing and filtering your data, as well as prediction and classification tools.
However, you must have a strong foundation in the above mentioned fields in order to strive for a useful result.