high-performance computing (HPC) for language technology in EU - hpc

I am looking to compile list of HPC open to researchers in the field of language technology.
Can you please point me to the HPC that are available to researchers and small and medium enterprises.
For example,
LEONARDO
https://www.lumi-supercomputer.eu/

Related

Classifiers assembled with identical training sets using IBM Watson NLU and IBM Watson NLC services yield different results

Everyone actively using the Natural Language Classifier service from IBM Watson has seen the following message while using the API:
"On 9 August 2021, IBM announced the deprecation of the Natural Language Classifier service. The service will no longer be available from 8 August 2022. As of 9 September 2021, you will not be able to create new instances. Existing instances will be supported until 8 August 2022. Any instance that still exists on that date will be deleted.
For more information, see IBM Cloud Docs"
IBM actively promotes to migrate NLC models to IBM's Natural Language Understanding Service. Today I have migrated my first classification model from Natural Language Classifier to Natural Language Understanding. Since I did not dive into the technological background of either service, I wanted to compare the output of both services. In order to do so, I followed the migration guidelines provided by IBM ( NLC --> NLU migration guidelines ). To recreate the NLC classifier in NLU, I downloaded the complete set of training data used to create the initial classifier built in the NLC service. So the data sets used to train the NLC and NLU classifiers are identical. Recreation of the classifier in NLU was straightforward forward and the classifier training took about the same time as in NLC. 
To compare the performance, I then assembled a test set of phrases that was not used for training purposes in either the NLC or NLU service. The test set contains 100 phrases that were passed through both the NLC and NLU classifier. To my big surprise, the differences are substantial. Out of 100, 18 results are different (more than 0.30 difference in confidence value), or 37 out of 100 when accepting a difference of 0.2 in confidence value. To summarize, the differences in analysis results are substantial.
In my opinion, this difference is too large to blindly move on to migrating all NLC models to NLU without any hesitation. The results I obtained so far justify further investigation using a manual curation step by a SME to validate the yielded analysis results. I am not too happy about this. I was wondering whether more users have seen this issue and/or have the same observation. Perhaps someone can shed a light on the differences in analysis results between the NLC and NLU services. And how to close the gap between the differences in analysis results obtained with the NLC and NLU service.
Please find below an excerpt of the analysis results of comparison:
title
NLC
NLU
Comparability
"Microbial Volatile Organic Compound (VOC)-Driven Dissolution and Surface Modification of Phosphorus-Containing Soil Minerals for Plant Nutrition: An Indirect Route for VOC-Based Plant-Microbe Communications"
0,01
0,05
comparable
"Valorization of kiwi agricultural waste and industry by-products by recovering bioactive compounds and applications as food additives: A circular economy model"
0,01
0,05
comparable
"Quantitatively unravelling the effect of altitude of cultivation on the volatiles fingerprint of wheat by a chemometric approach"
0,70
0,39
different
"Identification of volatile biomarkers for high-throughput sensing of soft rot and Pythium leak diseases in stored potatoes"
0,01
0,33
different
"Impact of Electrolyzed Water on the Microbial Spoilage Profile of Piedmontese Steak Tartare"
0,08
0,50
different
"Review on factors affecting Coffee Volatiles: From Seed to Cup"
0,67
0,90
different
"Chemometric analysis of the volatile profile in peduncles of cashew clones and its correlation with sensory attributes"
0,79
0,98
comparable
"Surface-enhanced Raman scattering sensors for biomedical and molecular detection applications in space"
0,00
0,00
comparable
"Understanding the flavor signature of the rice grown in different regions of China via metabolite profiling"
0,26
0,70
different
"Nutritional composition, antioxidant activity, volatile compounds, and stability properties of sweet potato residues fermented with selected lactic acid bacteria and bifidobacteria"
0,77
0,87
comparable
We have also been migrating our classifiers from NLC to NLU and doing analysis to explain the differences. We explored different possible factors to see what may have an influence: Upper case/Lower case, text length…no correlation found in these cases.
We did however find some correlation between the difference in score between the 1st and 2nd class returned by NLU and the score drop from NLC. That is to say we noticed that the closer the score of the second class returned the lower the NLU score on the first class. We call this confusion. In the case of our data there are times when the confusion is ‘real’ (ie. an SME would also classify the test phrase as borderline between 2 classes) but there were also times when we realized we could improve our training data to have more ‘distinct’ classes.
Bottom line, we can not explain the internals of NLU that generate the difference and we do still have a drop in the scores between NLC and NLU but it is across the board. We will move ahead to NLU despite the lowering of the scores: it does not hinder our interpretation of results.

Why such a bad performance for Moses using Europarl?

I have started playing around with Moses and tried to make what I believe would be a fairly standard baseline system. I have basically followed the steps described on the website, but instead of using news-commentary I have used Europarl v7 for training, with the WMT 2006 development set and the original Europarl common test. My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
To summarise, my workflow was more or less this:
tokenizer.perl on everything
lowercase.perl (instead of truecase)
clean-corpus-n.perl
Train IRSTLM model using only French data from Europarl v7
train-model.perl exactly as described
mert-moses.pl using WMT 2006 dev
Testing and measuring performances as described
And the resulting BLEU score is .26... This leads me to two questions:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
Just to put things straight first: the .68 you are referring to has nothing to do with BLEU.
My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.
The article you refer to only states that 68% of the pronouns (using co-reference resolution) was translated correctly. It nowhere mentions that a .68 BLEU score was obtained. As a matter of fact, no scores were given, probably because the qualitative improvement the paper proposes cannot be measured with statistical significance (which happens a lot if you only improve on a small number of words). For this reason, the paper uses a manual evaluation of the pronouns only:
A better evaluation metric is the number of correctly
translated pronouns. This requires manual
inspection of the translation results.
This is where the .68 comes into play.
Now to answer your questions with respect to the .26 you got:
Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Yes it is. You can find the performance of WMT language pairs here http://matrix.statmt.org/
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?
I assume that you trained your system correctly. With respect to the "undisclosed corpus" question: members of the academic community normally state for each experiment which data sets were used for training testing and tuning, at least in peer-reviewed publications. The only exception is the WMT task (see for example http://www.statmt.org/wmt14/translation-task.html) where privately owned corpora may be used if the system participates in the unconstrained track. But even then, people will mention that they used additional data.

Clearing Mesh of Graph

If we do the information visualization of documents, the graph generation across multiple documents often forms a mesh. Now to get a clear picture it is easy to form them with minimum data load and thus summarization is a good thing. But if the document load becomes
million then with summarization also the graph forms a big mesh.
I am bit perplexed how to clear the mesh. Reading and working round http://www.jerrytalton.net/research/Talton04SSMSA.report/Talton04SSMSA.pdf is not coming much help, as data is huge.
If any learned members may kindly help me out.
Regards,
SK
Are you talking about creating a graph or network of the documents? For example, you could have a network of documents linked by their citations, by having shared authors, by having the same terms appearing in them, etc. This isn't generally called a mesh problem, instead it is an automatic graph layout problem.
You need either better layout algorithms or to do some kind of clustering and reduction. There are many clustering algorithms you can use, for example Wakita & Tsurumi's:
Ken Wakita and Toshiyuki Tsurumi. 2007. Finding community structure in mega-scale social networks: [extended abstract]. Proc. 16th international conference on World Wide Web (WWW '07). 1275-1276. DOI=10.1145/1242572.1242805.
One that is particularly targeted at reducing complexity through "graph summarization" is Navlakha et al. 2008:
Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. 2008. Graph summarization with bounded error. Proc. 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). 419-432. DOI=10.1145/1376616.1376661.
You could also check out my latest paper, which replaces common repeating patterns in the network with representative glyphs:
Dunne, C. & Shneiderman, B. 2013. Motif simplification: improving network visualization readability with fan, connector, and clique glyphs. Proc. 2013 SIGCHI Conference on Human Factors in Computing Systems (CHI '13). PDF.
Here's an example picture of the reduction possible:

What does PI in OSIsoft's 'PI System' stand for?

What does PI in OSIsoft's 'PI System' stand for?
I can't tell if it stands for the symbol/number Pi, or if it stands for a previous name for the technology, like 'Process Intelligence'. PI is too close to the more common BI to be just a coincidence.
Note - There are Channel9 videos that demonstrate how MS uses OSIsoft to monitor some operations. Links to the C9 videos are from the 'SQL 2008 R2' CEP pages. The SQL CEP features are called StreamInsight.
PI used to stand for "Plant Information". Now that the PI System suite of products does much more and is used in many different environments, PI is just "PI".
PI stands for Plant Information or Process Information, depending on who you ask.
A little bit of history for you; before PI was used in various industries, it was mainly geared to oil/gas. OSI used to stand for Oil Systems Incorporated.
Mainly to store a large amount of historical data for industrial purposes.
A good example is when you need to develop an enhancement in your process to reduce the energy consuption then you can use the informations of each equipment and make plans to reduce it's energy consuption.
Other example is when you need to make industrial informations available for the entire corporation, as the state of a equipment (stopped or running). The PI System has it's own interfaces to make it possible.
PI stands for process intelligence!whereby in its storage of information like intersection of matrices where there is small set of universe which represent the system,that intersection has never been zero,like for pi it has information as long as it is in connection with the plant.

Is there a tool that supports discrete mathematics?

Discrete mathematics (also finite mathematics) deals with topics such as logic, set theory, information theory, partially ordered sets, proofs, relations, and a number of other topics.
For other branches of mathematics, there are tools that support programming. For statistics, there is R and S that have many useful statistics functions built in. For numerical analysis, Octave can be used as a language or integrated into C++.
I don't know of any languages or packages that deal specifically with discrete mathematics (although just about every language can be used to implement algorithms used in discrete mathematics, there should be libraries or environments out there designed specifically for these applications).
The current version of Mathematica is 7. License costs:
Home Edition: $295.
Standard: $2,495 Win/Mac/Linux PC ($3,120 for Solaris)
Government: $1,996 ($2,496 for Solaris)
Educational: $1,095 ($1,370 for Solaris)
Student: $139.95 (no Solaris)
Above, the Home Edition link says:
Mathematica Home Edition is a fully functional version of Mathematica Professional with the same features.
The current version of Maple is 12. License costs:
Student: $99
Commercial: $1,895
Academic: $995
Government: $1,795
And yes, check out Sage, mentioned above by Thomas Owens.
Mathematica
Mathematica has a Combinatorica package, which though quite venerable at this point, provides a good deal of support for combinatorics and graphs. Commands like this are available:
NecklacePolynomial[8, m, Cyclic];
GrayCodeSubsets[{1, 2, 3, 4}];
IntegerPartitions[6]
I'd say Mathematica is your best bet.. even if it does not come with some functionality out of the box, it has very well designed supplementary packages available for it on the net
check out http://www.wolfram.com/products/mathematica/analysis/
you might be interested in the links for Number Theory, Graph Visualizations
I also found Sage. It appears to be the closest thing to Mathematica that's open source, but I'm not sure how well it handles discrete mathematics.
Maple and Matlab would be a couple of Mathematical software packages that may cover part of what you want.
Stanford GraphBase, written primarily by Donald Knuth is a great package for combinatorial computing. I wouldn't call it an extensive code base, but it has great support for graphs and a great deal of discrete mathematics can be formulated in terms of graph theory. It's written in CWEB, which is (IMO) a more readable version of C.
EDIT: It's free.
I love Mathematica and used it to prototype ideas during my PhD in computational physics. However, Mathematica tries to be all things to all people and there are a few downsides:
Being a for-profit company, bug-fixes sometimes come in the next major release: you pay.
Being a proprietary product, sharing code with non-Mathematica people (the world) is problematic.
New features are often half-baked and break when you try to take it beyond the embedded example.
It's user base (tutorials, advice, external libraries) is less active than say python's,
Mulitpanel figures are difficult to generate; see SciDraw library.
That being said, Mathematica's core functionality is amazing for the following reasons:
Its default math functionality is quite robust allowing quick solutions.
It allows both functional and procedural programming.
One can quickly code & publish in a variety of formats: pdf, interactive website.
A new Discrete Book came out.
Bottom line
Apple users expecting ease of use, will like Mathematica for its Apple-like, get-up-and-go feel.
Linux users wanting extensibility, will find Mathematica frustrating for having its Apple-like, box-welded-shut design.