convert IPTC taxonomy to Boolean Expression - boolean

Can one change IPTC taxonomy to boolean expression? For easing the exchange of news, the International Press Telecommunication Council (IPTC) has developed the NewsML Architecture (NAR), As part of this architecture, specific controlled vocabularies, such as the IPTC News Codes, are used to categorize news items.
the Subject Codes is a thesaurus of 1300 terms used for categorizing the main
topics (subjects) of each news items."
as of 2021, there are 1400 plus terms. The IPTC subjectCodes (from 2012) are tree-like structure with 3 layers. My assumption is a group of vocabularies defines the category of the news. My question: is it possible to convert the hierarchy to a boolean expression like this :
"armed conflict" OR "armed dispute" OR "civil riots" OR (("armed" OR "weapon") AND ("right-wing" OR "left-wing" OR "extremist" OR "dangerous" OR "confrontation"))
" ?

We at IPTC have looked at this question in the past when we built a rules-based classification engine as a Google News Initiative project. It's called IPTC EXTRA and it allows users to create rules based on boolean logic to classify documents against terms in the IPTC Media Topics controlled vocabulary (or any other CV).
The rule language, Extra Query Language (EQL) is more expressive than simple Boolean and/or/not operators. We also look at proximity of words and some other characteristics: see the EXTRA User Manual for details.
You can see a set of test rules created for the EXTRA project on our GitHub repository. But please note that this is just a small subset of the rules that would be required to classify any content against the IPTC Media Topics vocabulary. At present, we don't know of a full set of rules for classifying all Media Topics.

Related

What is the difference between metadata & microdata?

I am quite puzzled with these two terminologies. I know the basic meaning of metadata is "data about the data".
Microdata means the webpages are now more accessible to the search engines.
But what separates these two terms?
Microdata is the name of a specific technology, metadata is a generic term.
Metadata is, like you explain, data about data. We’d typically want this metadata to be machine-readable/-understandable, so that search engines and other consumers can make use of it.
In the typical sense, metadata is data about the whole document (e.g., who wrote it, when it was published etc.). This goes into the head element (which "represents a collection of metadata for the Document"), where you have to use the meta element and its name attribute (unless the value is a URI, in which case you have to use the link element and its rel attribute), as this is defined to "represent document-level metadata".
Microdata is not involved here.
If the data is about entities described in that document (or the entity which represents the document itself), we typically speak of structured data. An example for such an entity could be a product and its price, manufacturer, weight etc.
Microdata is one of several ways how to provide structured data like that. Others are RDFa, Microformats, and also script elements used as data block (which can contain something like JSON-LD).
Metadata (small m) is a general descriptive term, Microdata (big M) is the name of a particular technology.
Microdata is a particular kind of metadata that can be attached to a particular kind of data (namely HTML) in a particular way (as defined by W3C's Microdata spec).
Metadata: using data to provide information about data. For instance, if you are collecting data about prices of different commodities and you added a small section at the top of the questionnaire to collect information about the name of the enumerator, time of interview, duration of interview etc., such information is a metadata.
Microdata: data from individual observations of interest.

DBpedia.org Ontology versus Schema.org Ontology

First off, I'm trying to define database tables with attributes from Schema.org, eg., for example I have a table named "JobPosting" that more or less has the same attributes as those defined in http://schema.org/JobPosting (baseSalary, etc.,), same goes for another table named "Organisation"
I have recently come across dbpedia.org (http://dbpedia.org/ontology/Organisation), the schema details seem to be much more richer, but I'm am confused as to:
Is dbpedia.org ontology an extension of those listed in schema.org?
Are dbpedia.org schemas recognized by major search engines (as those from schema.org)
What's the difference between Microdata and RFDs?
I'm going a little stir crazy trying to find the details...I couldn't find any comparisons vis-a-vis dbpedia.og vs schema.org.
Schema.org is one of countless vocabularies (resp. ontologies). The DBpedia Ontology is another one. Both vocabularies are independent of each other. Another vocabulary, related to your example, would be The Organization Ontology.
Which search engines recognize which vocabularies is a question without a definite answer. Search engines might recognize vocabularies without documenting it, or they might not recognize some (parts of) vocabularies although their documentation says otherwise. On top of that, all this might change daily.
You asked for the difference between Microdata and RFDs RDFs, but it’s likely that you mean RDFa in this context. Both are syntaxes which can be used to annotate content with the help of vocabularies. See my answer about differences between Microdata and RDFa.
(RDFS is "just" another vocabulary which can be used to describe vocabularies.)
I will try to answer all your questions, with understandable explanations.
Is dbpedia.org ontology an extension of those listed in schema.org?
No, it's not. There are countless ontologies available online, and any of them can be used combined, or alone, as long as their namespace (i.e. https://www.w3.org/2004/02/skos/ for SKOS or http://rdfs.org/sioc/spec/ for SIOC) is a valid URI.
Are dbpedia.org schemas recognized by major search engines (as those from schema.org)?
dbpedia schemas are as good as any other, and, as stated in the answer for the first question, it really doesn't matter which ontology you decide to use, as long as it best fits your content.
You can even create your own ontology in OWL-RDF.
What's the difference between Microdata and RFDa (not RDFs)?
The only difference between these 2 attribute sets is the way they're written, while they both do the same thing.
Other information:
RDFs stands for Resource Description Format Schema, and it's a format used to write the ontologies, together with OWL
OWL stands for Web Ontology Language, and it was created especially for writing ontologies
RDFa stands for Resource Description Format in Attributes, and it's an attribute set used to create structured data mapped on the existent HTML code
Microdata is an attribute set used to create structured data mapped on the existent HTML code

How to auto-tag content, algorithms and suggestions needed

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
You should use a metric such as tf-idf to get the tags out:
Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation
If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.
Your approach seems sensible and there are two ways you can improve the tagging.
Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.
If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott
Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string
Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf

Content Repository , Document Repository , whats the difference if any?

What is the difference between CMSs and DMSs ? Both store date , give access to the data , where do they differ? Can apache Jack Rabbit be used in place of Alfresco ?
I would differentiate the two based on they mutability of the data under management:
In a Document Management System, the Documents are immutable (and often opaque) blobs created by external applications
A Content Management system contains mutable data (the content) and provides an interface to mutate said content.
Of course, DMSs have evolved to break this rule - for example, by adding document properties to a Word Document... however, people seem comfortable with calling this "metadata" and therefore it can break all the rules.
Given the immutable nature of the data, a DMS can make assumptions that a CMS can not - given these assumptions, I would be careful stating (as per Wolfwyrd's comment) that DMS is a subset of CMS.
Content management refers to a system that stores content of any type. It tends to involve a workflow (i.e. creators, editors, publishers). Content management oalso often deals with fragments of data applied to templates. For example, a template for a page may be created with editable body, sub title, title etc.
Document management refers to a system that stores electronic documents or files of any type. Document management can be considered a subset of content management - a more specialised form of content management as it approaches the management only of electronic files, not necessarily the potential to store fragments of content.
Jack Rabbit and Alfresco both supply content management services so they can also be used to support document management by the simple fact that one is a subset of the other. So in this case, it's more down to which provide the features you need.

Tools for getting intent from Twitter statuses? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit?
Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super familiar with Twitter - do most people use hashtags (even for smaller scale issues), or would relying on them cut off a large segment of data?
I'd also be interested in grabbing Facebook statuses (with permission from the poster, of course), and hashtag use is pretty rare on Facebook.
I could use simple keyword search to crudely narrow the field, but that's more likely to require human intervention to determine which tweets should actually be posted alongside the content.
Ideas? Has this been done before?
There are two straightforward ways to go about finding tweets relevant to your content.
The first would be to treat this as a supervised document classification task, whereby you would train a classifier to annotate tweets with a certain predetermined set of topic labels. You could then use the labels to select tweets that are appropriate for whatever content you'll be augmenting. If you don't like using a predetermined set of topics, another approach would be to simply score tweets according to their semantic overlap with your content. You could then display the top n tweets with the most semantic overlap.
Supervised Document Classification
Using supervised document classification would require that you have a training set of tweets labeled with the set of topics you'll be using. e.g.,
tweet: NBA finals rocked label: sports
tweet: Googlers now allowed to use Ruby! label: programming
tweet: eating lunch label: other
If you want to collect training data without having to manually label the tweets with topics, you could use hashtags to assign topic labels to the tweets. The hashtags could be identical with the topic labels, or you could write rules to map tweets with certain hashtags to the desired label. For example, tweets tagged either #NFL or #NBA could all be assigned a label of sports.
Once you have the tweets labeled by topic, you can use any number of existing software packages to train a classifier that assigns labels to new tweets. A few available packages include:
NLTK (Python) - see Chapter 6 in the NLTK book on Learning to Classify Text
Classifier4J (Java)
nBayes (C#)
Semantic Overlap
Finding tweets using their semantic overlap with your content avoids the need for a labeled training set. The simplest way to estimate the semantic overlap between your content and the tweets that you're scoring is to use a vector space model. To do this, represent your document and each tweet as a vector with each dimension in the vector corresponding to a word. The value assigned to each vector position then represents how important that word is to the meaning of document. One way to estimate this would be to simply use the number of times the word occurs in the document. However, you'll likely get better results by using something like TF/IDF, which up-weights rare terms and down-weights more common ones.
Once you've represented your content and the tweets as vectors, you can score the tweets by their semantic similarity to your content by taking the cosine similarity of the vector for your content and the vector for each tweet.
There's no need to code any of this yourself. You can just use a package like Classifier4J, which includes a VectorClassifier class that scores document similarity using a vector space model.
Better Semantic Overlap
One problem you might run into with vector space models that use one term per dimension is that they don't do a good job of handling different words that mean roughly the same thing. For example, such a model would say that there is no similarity between The small automobile and A little car.
There are more sophisticated modeling frameworks such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) that can be used to construct more abstract representations of the documents being compared to each other. Such models can be thought of as scoring documents not based on simple word overlap, but rather in terms of overlap in the underlying meaning of the words.
In terms of software, the package Semantic Vectors provides a scalable LSA-like framework for document similarity. For LDA, you could use David Blei's implementation or the Stanford Topic Modeling Toolbox.
Great question. I think for twitter your best bet is to use hashtags because otherwise you need to create algorithms or find existing algorithms that do language analysis and improve over time based on user input/feedback.
For facebook you can kind of do what bing implemented a while back. As I covered in this article here:
http://www.socialtimes.com/2010/06/bing-adds-facebook-and-twitter-features-steps-up-social-services/
I wrote: For example, a search for “NBA Finals” will return fan-page content from Facebook, including posts from a local TV station. So if you're trying to augmented NBA related content, you could do a similar search as Bing provides - searching publically available fan-page content the way spiders index them for search engines. I'm not a developer so i'm not sure of the intricacies but I know it can be done.
Also you can display popular shared links from users who are publishing to ‘everyone’ will be aggregated for all non-fan page content. I'm not sure if this is limited to being published to 'everyone' and/or being 'popular' although I would assume so - but you can double check that.
Hope this helps
The problem with NLP is not the algorithm (although that is an issue) the problem is the resources. There are some open source shallow parsing tools (that's all you would need to get intent) that you could use but parsing thousands or millions of tweets would cost a fortune in computer time.
On the other hand like you said not all tweets have hashtags and there is no promise they will be relevant.
Maybe you can use a mixture of keyword search to filter out a few possibilities (those with the highest keyword density) and then use a deeper data analysis to pick the top 1 or 2. This would keep computer resources at a minimum and you should be able to get relevant tweets.