Expressing complex relationships in ontologies - simulation

I'm from a medical background and new to the concept of ontologies. I'm using Protégé to create an ontology for a resuscitation simulator. I have no problem expressing that "Blood pressure", "Oxygen saturations" and "Left ventricular ejection fraction" are disjointed subclasses of "Patient" and that "Intravenous fluid" is a subclass of "Medical interventions". However, I also want to express that intravenous fluids will increase blood pressure and decrease oxygen saturation as a function of left ventricular ejection fraction. I have no problem putting this function into code, but how do I best express it in an ontology so that non-medics can see this relationship? Or is this simply the point where an ontology ends and computer programming begins?
Thanks in advance for any help.

If you wish to express a mathematical function that a reasoner should be able to calculate as part of reasoning, I don't believe OWL has currently any support for this - I believe there are proposals for such an extension but I'm not sure of their status. See for example http://ceur-ws.org/Vol-921/openmath-01.pdf
For query languages such as SPARQL, it is possible to define builtin functions that could produce those values as part of query answering; however this matches your "out of ontology and into programming" threshold.
There's also SWRL - it has similar builtins support and can be used inside Protege, but it has different restrictions on which individuals can be involved and what assertions can be created.

Related

What is currently the best way to add a custom dictionary to a neural machine translator that uses the transformer architecture?

It's common to add a custom dictionary to a machine translator to ensure that terminology from a specific domain is correctly translated. For example, the term server should be translated differently when the document is about data centers, vs when the document is about restaurants.
With a transformer model, this is not very obvious to do, since words are not aligned 1:1. I've seen a couple of papers on this topic, but I'm not sure which would be the best one to use. What are the best practices for this problem?
I am afraid you cannot easily do that. You cannot easily add new words to the vocabulary because you don't know what embedding it would get during training. You can try to remove some words, or alternatively you can manually change the bias in the final softmax layer to prevent some words from appearing in the translation. Anything else would be pretty difficult to do.
What you want to do is called domain adaptation. To get an idea of how domain adaptation is usually done, you can have a look at a survey paper.
The most commonly used approaches are probably model finetuning or ensembling with a language model. If you want to have parallel data in your domain, you can try to fine-tune your model on that parallel data (with simple SGD, small learning rate).
If you only have monolingual data in the target language, you train a language model on that data. During the decoding, you can mix the probabilities from the domain-specific language and the translation model. Unfortunately, I don't know of any tool that could do this out of the box.

Attention Text Generation in Character-by-Character fashion

I am searching the web for a couple of days for any text generation model that would use only attention mechanisms.
The Transformer architecture that made waves in the context of Seq-to-Seq models is actually based solely on Attention mechanisms but is mainly designed and used for translation or chat bot tasks so it doesn't fit to the purpose, but the principle does.
My question is:
Does anyone knows or heard of a text generation model based solely on Attention without any recurrence?
Thanks a lot!
P.S. I'm familiar with PyTorch.
Building a character-level self-attentive model is a challenging task. Character-level models are usually based on RNNs. Whereas in a word/subword model, it is clear from the beginning what are the units carrying meaning (and therefore the units the attention mechanism can attend to), a character-level model needs to learn word meaning in the following layers. This makes it quite difficult for the model to learn.
Text generation models are nothing more than conditional languages model. Google AI recently published a paper on Transformer character language model, but it is the only work I know of.
Anyway, you should consider either using subwords units (as BPE, SentencePiece) or if you really need to go for character level, use RNNs instead.

Translating a State Transition System to properties LTL formulae

In the context of bounded model checking, one describes the system as a State Transition System and the properties that need to be checked.
When one needs to provide multiple system descriptions and properties to the Model Checker Tool, it can become tedious to write the property by hand. In my case, I use some temporal logic.
How does one automate the process of translating/parsing the system description and deriving verifiable properties from it (ideally, a set of Initial states, Transitions, Set of States).
For example, consider the Microwave Example given here Given such a system description, how can I arrive at the specifications in an efficient manner?
There is no such open source tool that I know of, that can do this. Any approaches in terms of ideas, theories are welcome.
You can't automatically derive LTL formulae from automata as you suggest, because automata are more expressive than LTL formulae.
That leaves you with mainly two options: 1. find a verification tool which accepts specifications that are directly expressed as automata (I'm not sure which ones do, but I suspect it is worth checking SPIN and NuSMV for this feature.), or 2. use a meta-specification language that makes the writing of specifications easier; for example, https://www.isp.uni-luebeck.de/salt (doi: 10.1007/11901433_41) or IEE1850/PSL. While PSL is more a language definition for tool-implementors, SALT already offers a web front-end that translates your input straight into LTL.
(By the way, I find your approach methodologically challenging though: you're not supposed to derive formulae from your model, but from your initial system description as it is this very model which you're going to verify. But I am not a 100% sure, if I understood this point in your question correctly.)
I think properties of a system, e.g. Microwave system, come from technical and common sense expectation and requirements, not the model. E.g. microwave is supposed to cook the food. But it is not supposed to cook with door open. Nevertheless a repository of typical LTL pattern can be useful to define properties. It also lists properties along with more familiar regex and automata properties.
If you certain you still want to translate automata to LTL automatically check
https://mathoverflow.net/questions/96963/translate-a-buchi-automaton-to-ltl
Kansas Specification Property Repository
http://patterns.projects.cs.ksu.edu/documentation/patterns.shtml

evaluation NLP classifier with annotated data

if we want to evaluate a classifier of NLP application with data that are annotated with two annotators, and they are not completely agreed on the annotation, how is the procedure?
That is, if we should compare the classifier output with just the portion of data that annotators agreed on? or just one of the annotator data? or the both of them separately and then compute the average?
Taking the majority vote between annotators is common. Throwing out disagreements is also done.
Here's a blog post on the subject:
Suppose we have a bunch of annotators and we don’t have perfect agreement on items. What do we do? Well, in practice, machine learning evals tend to either (1) throw away the examples without agreement (e.g., the RTE evals, some biocreative named entity evals, etc.), or (2) go with the majority label (everything else I know of). Either way, we are throwing away a huge amount of information by reducing the label to artificial certainty. You can see this pretty easily with simulations, and Raykar et al. showed it with real data.
What's right for you depends heavily on your data and how the annotators disagree; for starters, why not use only items they agree on and see what then compare the model to the ones they didn't agree on?

How to auto-tag content, algorithms and suggestions needed

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
You should use a metric such as tf-idf to get the tags out:
Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation
If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.
Your approach seems sensible and there are two ways you can improve the tagging.
Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.
If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott
Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string
Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf