Chinese Segmentation : ICTCLAS Training Corpora - corpus

I am using the ICTCLAS segmentation tool for Chinese. We can read in "Automatic Recognition of Chinese Unknown Words Based on Roles Tagging" (Zhang, Liu, 2002) that it has been trained on the Peking University Corpus (PKU) : "The training corpus came from one-month news from the People’ s Daily with 2,305,896 Chinese characters, which are manually checked after word segmentation and POS tagging (It can be downloaded at icl.pku.edu.cn, the homepage of the Institute of Computational Linguistics, Peking University)."
But I didn't find any other mention of the datas they used since this 2002 paper, and would like to confirm that they still train the segmenter on PKU.

Related

Classifiers assembled with identical training sets using IBM Watson NLU and IBM Watson NLC services yield different results

Everyone actively using the Natural Language Classifier service from IBM Watson has seen the following message while using the API:
"On 9 August 2021, IBM announced the deprecation of the Natural Language Classifier service. The service will no longer be available from 8 August 2022. As of 9 September 2021, you will not be able to create new instances. Existing instances will be supported until 8 August 2022. Any instance that still exists on that date will be deleted.
For more information, see IBM Cloud Docs"
IBM actively promotes to migrate NLC models to IBM's Natural Language Understanding Service. Today I have migrated my first classification model from Natural Language Classifier to Natural Language Understanding. Since I did not dive into the technological background of either service, I wanted to compare the output of both services. In order to do so, I followed the migration guidelines provided by IBM ( NLC --> NLU migration guidelines ). To recreate the NLC classifier in NLU, I downloaded the complete set of training data used to create the initial classifier built in the NLC service. So the data sets used to train the NLC and NLU classifiers are identical. Recreation of the classifier in NLU was straightforward forward and the classifier training took about the same time as in NLC. 
To compare the performance, I then assembled a test set of phrases that was not used for training purposes in either the NLC or NLU service. The test set contains 100 phrases that were passed through both the NLC and NLU classifier. To my big surprise, the differences are substantial. Out of 100, 18 results are different (more than 0.30 difference in confidence value), or 37 out of 100 when accepting a difference of 0.2 in confidence value. To summarize, the differences in analysis results are substantial.
In my opinion, this difference is too large to blindly move on to migrating all NLC models to NLU without any hesitation. The results I obtained so far justify further investigation using a manual curation step by a SME to validate the yielded analysis results. I am not too happy about this. I was wondering whether more users have seen this issue and/or have the same observation. Perhaps someone can shed a light on the differences in analysis results between the NLC and NLU services. And how to close the gap between the differences in analysis results obtained with the NLC and NLU service.
Please find below an excerpt of the analysis results of comparison:
title
NLC
NLU
Comparability
"Microbial Volatile Organic Compound (VOC)-Driven Dissolution and Surface Modification of Phosphorus-Containing Soil Minerals for Plant Nutrition: An Indirect Route for VOC-Based Plant-Microbe Communications"
0,01
0,05
comparable
"Valorization of kiwi agricultural waste and industry by-products by recovering bioactive compounds and applications as food additives: A circular economy model"
0,01
0,05
comparable
"Quantitatively unravelling the effect of altitude of cultivation on the volatiles fingerprint of wheat by a chemometric approach"
0,70
0,39
different
"Identification of volatile biomarkers for high-throughput sensing of soft rot and Pythium leak diseases in stored potatoes"
0,01
0,33
different
"Impact of Electrolyzed Water on the Microbial Spoilage Profile of Piedmontese Steak Tartare"
0,08
0,50
different
"Review on factors affecting Coffee Volatiles: From Seed to Cup"
0,67
0,90
different
"Chemometric analysis of the volatile profile in peduncles of cashew clones and its correlation with sensory attributes"
0,79
0,98
comparable
"Surface-enhanced Raman scattering sensors for biomedical and molecular detection applications in space"
0,00
0,00
comparable
"Understanding the flavor signature of the rice grown in different regions of China via metabolite profiling"
0,26
0,70
different
"Nutritional composition, antioxidant activity, volatile compounds, and stability properties of sweet potato residues fermented with selected lactic acid bacteria and bifidobacteria"
0,77
0,87
comparable
We have also been migrating our classifiers from NLC to NLU and doing analysis to explain the differences. We explored different possible factors to see what may have an influence: Upper case/Lower case, text length…no correlation found in these cases.
We did however find some correlation between the difference in score between the 1st and 2nd class returned by NLU and the score drop from NLC. That is to say we noticed that the closer the score of the second class returned the lower the NLU score on the first class. We call this confusion. In the case of our data there are times when the confusion is ‘real’ (ie. an SME would also classify the test phrase as borderline between 2 classes) but there were also times when we realized we could improve our training data to have more ‘distinct’ classes.
Bottom line, we can not explain the internals of NLU that generate the difference and we do still have a drop in the scores between NLC and NLU but it is across the board. We will move ahead to NLU despite the lowering of the scores: it does not hinder our interpretation of results.

Comparison between fasttext and LDA

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2.
I haven't tried Fasttext but here are few pro and con for LDA based on my experience
Pro
Iterative model, having support for Apache spark
Takes in a corpus of document and does topic modeling.
Not only finds out what the document is talking about but also finds out related documents
Apache spark community is continuously contributing to this. Earlier they made it work on mllib now on ml libraries
Con
Stopwords need to be defined well. They have to be related to the context of the document. For ex: "document" is a word which is having high frequency of appearance and may top the chart of recommended topics but it may or maynot be relevant, so we need to update the stopword for that
Sometime classification might be irrelevant. In the below example it is hard to infer what this bucket is talking about
Topic:
Term:discipline
Term:disciplines
Term:notestable
Term:winning
Term:pathways
Term:chapterclosingtable
Term:metaprograms
Term:breakthroughs
Term:distinctions
Term:rescue
If anyone has done research in Fasttext can you please update with your learning?
fastText offers more than topic modelling, it is a tool for generation of word embeddings and text classification using a shallow neural network.
The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.
Pros:
=> It is extremely easy to train your own fastText model,
$ ./fasttext skipgram -input data.txt -output model
Just provide your input and output file, the architecture to be used and that's all, but if you wish to customize your model a bit, fastText provides the option to change the hyper-parameters as well.
=> While generating word vectors, fastText takes into account sub-parts of words called character n-grams so that similar words have similar vectors even if they happen to occur in different contexts. For example, “supervised”, “supervise” and “supervisor” all are assigned similar vectors.
=> A previously trained model can be used to compute word vectors for out-of-vocabulary words. This one is my favorite. Even if the vocabulary of your corpus is finite, you can get a vector for almost any word that exists in the world.
=> fastText also provides the option to generate vectors for paragraphs or sentences. Similar documents can be found by comparing the vectors of documents.
=> The option to predict likely labels for a piece of text has been included too.
=> Pre-trained word vectors for about 90 languages trained on Wikipedia are available in the official repo.
Cons:
=> As fastText is command line based, I struggled while incorporating this into my project, this might not be an issue to others though.
=> No in-built method to find similar words or paragraphs.
For those who wish to read more, here are the links to the official research papers:
1) https://arxiv.org/pdf/1607.04606.pdf
2) https://arxiv.org/pdf/1607.01759.pdf
And link to the official repo:
https://github.com/facebookresearch/fastText

Why such a bad performance for Moses using Europarl?

I have started playing around with Moses and tried to make what I believe would be a fairly standard baseline system. I have basically followed the steps described on the website, but instead of using news-commentary I have used Europarl v7 for training, with the WMT 2006 development set and the original Europarl common test. My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
To summarise, my workflow was more or less this:
tokenizer.perl on everything
lowercase.perl (instead of truecase)
clean-corpus-n.perl
Train IRSTLM model using only French data from Europarl v7
train-model.perl exactly as described
mert-moses.pl using WMT 2006 dev
Testing and measuring performances as described
And the resulting BLEU score is .26... This leads me to two questions:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
Just to put things straight first: the .68 you are referring to has nothing to do with BLEU.
My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
The article you refer to only states that 68% of the pronouns (using co-reference resolution) was translated correctly. It nowhere mentions that a .68 BLEU score was obtained. As a matter of fact, no scores were given, probably because the qualitative improvement the paper proposes cannot be measured with statistical significance (which happens a lot if you only improve on a small number of words). For this reason, the paper uses a manual evaluation of the pronouns only:
A better evaluation metric is the number of correctly
translated pronouns. This requires manual
inspection of the translation results.
This is where the .68 comes into play.
Now to answer your questions with respect to the .26 you got:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Yes it is. You can find the performance of WMT language pairs here http://matrix.statmt.org/
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
I assume that you trained your system correctly. With respect to the "undisclosed corpus" question: members of the academic community normally state for each experiment which data sets were used for training testing and tuning, at least in peer-reviewed publications. The only exception is the WMT task (see for example http://www.statmt.org/wmt14/translation-task.html) where privately owned corpora may be used if the system participates in the unconstrained track. But even then, people will mention that they used additional data.

error while importing txt file into mallet

I have been having trouble converting some txt files to mallet. I keep getting:
Exception in thread "main" java.lang.IllegalStateException: Line #39843 does not match regex:
and the Line#39843 reads:
24393584 |Title Validation of a Danish version of the Toronto Extremity Salvage Score questionnaire for 
patients with sarcoma in the extremities.The Toronto Extremity Salvage Score (TESS) questionnaire is a selfadministered questionnaire designed to assess physical disability in patients having undergone surgery of the extremities. The aim of this study was to validate a Danish translation of the TESS. The TESS was translated according to international guidelines. A total of 22 consecutive patients attending the regular outpatient control programme were recruited for the study. To test their understanding of the questionnaires, they were asked to describe the meaning of five randomly selected questions from the TESS. The psychometric properties of the Danish version of TESS were tested for validity and reliability. To assess the testretest reliability, the patients filled in an extra TESS questionnaire one week after they had completed the first one. Patients showed good understanding of the questionnaire. There was a good internal consistency for both the upper and lower questionnaire measured by Cronbach's alpha. A BlandAltman plot showed acceptable limits of agreement for both questionnaires in the testretest. There was also good intraclass correlation coefficients for both questionnaires. The validity expressed as Spearman's rank correlation coefficient comparing the TESS with the QLQC30 was 0.89 and 0.90 for the questionnaire on upper and lower extremities, respectively. The psychometric properties of the Danish TESS showed good validity and reliability. not relevant.not relevant.
This happens for a quite a few of the lines and when I remove the line, the rest of the file
is imported into mallet. What regex expression in this line could be the problem?
thanks,
Priya
Mallet has problems handling certain machine symbols, because of bad programming. Try running
tr -dc [:alnum:][\ ,.]\\n < ./inputfile.txt > ./inputfilefixed.txt
before running mallet. This will remove all non-alphanumerical symbols, which usually solves the problem for me.

Flickr API: What are the "vision" tags?

When querying the Flickr API and checking for the returned tags, I noticed that I receive additional tags which are not shown on the web interface. For example for this image:
http://www.flickr.com/photos/77060598#N08/12078886973
Beside the tags shown on the webpage (Nikon F2AS, Nikon, Black and White, B&W, Mountains, Germany, Snow, Landscape, Sky, Clouds), the JSON response contains the tags vision:outdoor=0949 and vision:sky=051.
I assume, that some computer vision processing is applied by Flickr to automatically assign those tags. Am I right with this assumption? I cannot find any documentation about those tags. Is there any description about the algorithms they employ and/or the kind of tags and the meaning of the numbers they assign?
Yes, your assumption is right. These tags are image classification tags.
They are part of an ongoing research in the area of classification and computational learning.
The research goal is to reach a precise category based image classification with a minimal learning effort.
yahoo large scale flickr tag image classification challenge
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge International Journal of Computer Vision, 88(2), 303-338, 2010 - PDF
http://pascallin.ecs.soton.ac.uk/challenges/VOC/
Training and Test Data
Results & Leaderboard