What is the difference between metadata & microdata? - metadata

I am quite puzzled with these two terminologies. I know the basic meaning of metadata is "data about the data".
Microdata means the webpages are now more accessible to the search engines.
But what separates these two terms?

Microdata is the name of a specific technology, metadata is a generic term.
Metadata is, like you explain, data about data. We’d typically want this metadata to be machine-readable/-understandable, so that search engines and other consumers can make use of it.
In the typical sense, metadata is data about the whole document (e.g., who wrote it, when it was published etc.). This goes into the head element (which "represents a collection of metadata for the Document"), where you have to use the meta element and its name attribute (unless the value is a URI, in which case you have to use the link element and its rel attribute), as this is defined to "represent document-level metadata".
Microdata is not involved here.
If the data is about entities described in that document (or the entity which represents the document itself), we typically speak of structured data. An example for such an entity could be a product and its price, manufacturer, weight etc.
Microdata is one of several ways how to provide structured data like that. Others are RDFa, Microformats, and also script elements used as data block (which can contain something like JSON-LD).

Metadata (small m) is a general descriptive term, Microdata (big M) is the name of a particular technology.
Microdata is a particular kind of metadata that can be attached to a particular kind of data (namely HTML) in a particular way (as defined by W3C's Microdata spec).

Metadata: using data to provide information about data. For instance, if you are collecting data about prices of different commodities and you added a small section at the top of the questionnaire to collect information about the name of the enumerator, time of interview, duration of interview etc., such information is a metadata.
Microdata: data from individual observations of interest.

Related

REST API URL pattern for path parameters

I am building a Backbone app which displays interactive facsimiles of diagrams from certain technical manuals. Each manual has a number of diagram types (say A-Z), distributed over its pages. Each diagram might occur more than once across the pages, and sometimes a single page might contain more than one instance of a given diagram type.
I have a Django backend serving a REST API which my frontend consumes. What I have been struggling with is the design of the url for the request. I have tried several patterns, none of which satisfy me. My Django model looks something like this:
class Diagram(models.Model):
type = models.CharField(max_length=1)
page = models.IntegerField(default=1)
order = models.IntegerField(default=1)
data = JSONField(default='{}')
The order field relates to a situation where there is more than one instance of the given diagram type on a page. The table for this model is read-only, so I am just doing simple GETs. Users only view one diagram instance at a time. A diagram is selected by type, page, and (where relevant) order. My initial url design was this:
example.org/api/diagrams/A/pages/1/order/2/
Although there is a plurality of diagrams, the diagrams param suggests a collection - but the diagrams don't 'contain' pages. Same with the pages param. Obviously order can only be singular. So perhaps:
example.org/api/diagrams/type=A/page=1/order=2/
Or perhaps just go with query params:
example.org/api/diagrams/?type=A&page=1&order=2
Personally I prefer path parameters, but the main complication of this is that the order param is actually redundant most of the time - there are only a small number of cases of repetition of a diagram on a page (currently I default order to '1', both on the backend and in the request). So perhaps a combination of both path and query parameters:
example.org/api/diagrams/A/page/1/?order=2
Is this a good pattern? Are there other alternatives I could consider?
Edit: After some additional reading (notably the URI Standard) I think the answer is that a path parameter design is suited for a hierarchical structure... which seems intuitive. But I don't have that, so the right candidate is the pure query parameter design. Right?
Could I suggest a different approach? I know, that may not be the answer you are looking for, but instead of trying to publish the exact object model from your code, think about what "kind" of resource the client needs to see and what is it related to.
For example if the client needs to "browse" the diagrams, you could have 2 media-types, one for listing all diagrams, and for a single diagram itself. The URIs could be:
/api/diagrams/ <-- list of all diagrams with titles
/api/diagrams/1 <-- a single diagram
/api/diagrams/2
...
If the client needs to browse per manual per page, then you can offer those too with additional media-types representing a manual (list of pages), and the pages with links to the diagrams that are on it. For example:
/api/manuals <-- list of all manuals
/api/manuals/1 <-- list of pages, maybe a list of all diagrams in manual
/api/manuals/1/page2 <-- list of diagrams on page2
The same for your case about browsing per order and diagram type.
If you only need a "search" API, and not a "browse" API, then the proper solution would be to create a "form" in which you can submit the information (order, type, page, etc.). So that would be 2 media-types, one for the search description, and probably one for diagrams.
The point is, URI should not be fixed if you are trying to create a REST API. The server should provide the URIs to the client (except for the start URI, the search page for example).
This has several advantages, one being that you can control your URIs on the server. You don't have to be RESTful though, if you don't want to, but even then the URI itself does not really matter if you control the client anyway. Neither your approaches is wrong by objective measures.
Sorry if that does not help.:)

DBpedia.org Ontology versus Schema.org Ontology

First off, I'm trying to define database tables with attributes from Schema.org, eg., for example I have a table named "JobPosting" that more or less has the same attributes as those defined in http://schema.org/JobPosting (baseSalary, etc.,), same goes for another table named "Organisation"
I have recently come across dbpedia.org (http://dbpedia.org/ontology/Organisation), the schema details seem to be much more richer, but I'm am confused as to:
Is dbpedia.org ontology an extension of those listed in schema.org?
Are dbpedia.org schemas recognized by major search engines (as those from schema.org)
What's the difference between Microdata and RFDs?
I'm going a little stir crazy trying to find the details...I couldn't find any comparisons vis-a-vis dbpedia.og vs schema.org.
Schema.org is one of countless vocabularies (resp. ontologies). The DBpedia Ontology is another one. Both vocabularies are independent of each other. Another vocabulary, related to your example, would be The Organization Ontology.
Which search engines recognize which vocabularies is a question without a definite answer. Search engines might recognize vocabularies without documenting it, or they might not recognize some (parts of) vocabularies although their documentation says otherwise. On top of that, all this might change daily.
You asked for the difference between Microdata and RFDs RDFs, but it’s likely that you mean RDFa in this context. Both are syntaxes which can be used to annotate content with the help of vocabularies. See my answer about differences between Microdata and RDFa.
(RDFS is "just" another vocabulary which can be used to describe vocabularies.)
I will try to answer all your questions, with understandable explanations.
Is dbpedia.org ontology an extension of those listed in schema.org?
No, it's not. There are countless ontologies available online, and any of them can be used combined, or alone, as long as their namespace (i.e. https://www.w3.org/2004/02/skos/ for SKOS or http://rdfs.org/sioc/spec/ for SIOC) is a valid URI.
Are dbpedia.org schemas recognized by major search engines (as those from schema.org)?
dbpedia schemas are as good as any other, and, as stated in the answer for the first question, it really doesn't matter which ontology you decide to use, as long as it best fits your content.
You can even create your own ontology in OWL-RDF.
What's the difference between Microdata and RFDa (not RDFs)?
The only difference between these 2 attribute sets is the way they're written, while they both do the same thing.
Other information:
RDFs stands for Resource Description Format Schema, and it's a format used to write the ontologies, together with OWL
OWL stands for Web Ontology Language, and it was created especially for writing ontologies
RDFa stands for Resource Description Format in Attributes, and it's an attribute set used to create structured data mapped on the existent HTML code
Microdata is an attribute set used to create structured data mapped on the existent HTML code

Strategies for embedding change tracking in a structured data document format

When designing a specialized structured-data document format (perhaps upon XML): part of the requirements for this document format are that it accommodates, in a metadata section, a history of meaningful (app-level) changes to the structured data at a field level.
At minimum, useful tracked information would be:
an author identifier
time stamp
type of change
what it was changed from
Both data items and any lists of such data items are to be tracked meaningfully, efficiently. The data schema should be separable/unaware of the metadata tracking it - although facilitating annotations such as node identifiers could be required. A trusted application could be required to enforce the tracking; however, it would be a benefit to be able to calculate the "deltas" at intervals by comparing data-sections between versions rather than requiring the editor to track each change live.
"Meaningful" tracking may involve the metadata schema treating higher-level data changes atomically - such as an update to a group of fields which is treated at the application level as one data-point.
For character-by-character or byte-by-byte data, diff/patch type algorithms work. Structured data (to be treated as structured) seems to me to require more complex solutions.
I realize that I don't have very well-defined requirements - the purpose of my question here is to find out where these problems have been considered with more elegance.
What strategies exist for embedding change tracking in a structured data document format?
Thanks!
You might be interested in XML patch formats (e.g. as described by rfc 5261).
You could for example, build a list of such patches embedded at the top of your structured XML file and annotating each patch with its author, potential feature request/bug number and so on, potentially annotating it with semantic level patch information (such as "added such object", "removed such rule"...). Using such a format could help you obtain old versions from your document rather easily as tools exist to treat it.

Content Repository , Document Repository , whats the difference if any?

What is the difference between CMSs and DMSs ? Both store date , give access to the data , where do they differ? Can apache Jack Rabbit be used in place of Alfresco ?
I would differentiate the two based on they mutability of the data under management:
In a Document Management System, the Documents are immutable (and often opaque) blobs created by external applications
A Content Management system contains mutable data (the content) and provides an interface to mutate said content.
Of course, DMSs have evolved to break this rule - for example, by adding document properties to a Word Document... however, people seem comfortable with calling this "metadata" and therefore it can break all the rules.
Given the immutable nature of the data, a DMS can make assumptions that a CMS can not - given these assumptions, I would be careful stating (as per Wolfwyrd's comment) that DMS is a subset of CMS.
Content management refers to a system that stores content of any type. It tends to involve a workflow (i.e. creators, editors, publishers). Content management oalso often deals with fragments of data applied to templates. For example, a template for a page may be created with editable body, sub title, title etc.
Document management refers to a system that stores electronic documents or files of any type. Document management can be considered a subset of content management - a more specialised form of content management as it approaches the management only of electronic files, not necessarily the potential to store fragments of content.
Jack Rabbit and Alfresco both supply content management services so they can also be used to support document management by the simple fact that one is a subset of the other. So in this case, it's more down to which provide the features you need.

REST Media type explosion

In my attempt to redesign an existing application using REST architectural style, I came across a problem which I would like to term as "Mediatype Explosion". However, I am not sure if this is really a problem or an inherent benefit of REST. To explain what I mean, take the following example
One tiny part of our application looks like:
collection-of-collections->collections-of-items->items
i.e the top level is a collection of collections and each of these collection is again a collection of items.
Also, each item has 8 attributes which can be read and written individually. Trying to expose the above hierarchy as RESTful resources leaves me with the following media types:
application/vnd.mycompany.collection-of-collections+xml
application/vnd.mycompany.collection-of-items+xml
application/vnd.mycompany.item+xml
Further more, since each item has 8 attributes which can be read and written to individually, it will result in another 8 media types. e.g. one such media type for "value" attribute of an item would be:
application/vnd.mycompany.item_value+xml
As I mentioned earlier, this is just a tiny part of our application and I expect several different collections and items that needs to be exposed in this way.
My questions are:
Am I doing something wrong by having these huge number of media types?
What is the alternative design method to avoid this explosion of media types?
I am also aware that the design above is highly granular, especially exposing individual attributes of the item and having separate media types for each them. However, making it coarse means I will end up transferring unnecessary data over the wire when in reality the client only needs to read or write a single attribute of an item. How would you approach such a design issue?
One approach that would reduce the number of media types required is to use a media type defined to hold lists of other media-types. This could be used for all of your collections. Generally lists tend to have a consistent set of behavior.
You could roll your own vnd.mycompany.resourcelist or you could reuse something like an Atom collection.
With regards to the specific resource representations like vnd.mycompany.item, what you can do depends a whole lot on the characteristics of your client. Is it in a browser? can you do code-download? Is your client a rich UI, or is it a data processing client?
If the client is going to do specific data processing then you pretty much need to stick with the precise media types and you may end up with a large number of them. But look on the bright side, you will have less media-types than you would have namespaces if you were using SOAP!
Remember, the media-type is your contract, if your application needs to define lots of contracts with the client, then so be it.
However, I would not go as far as defining contracts to exchange single attribute values. If you feel the need to do that, then you are doing something else wrong in your design. Distributed interface design needs to have chunky conversations, not chatty ones.
I think I finally got the clarification I sought for the above question from Ian Robinson's presentation and thought I should share it here.
Recently, I came across the statement "media type for helping tune the hypermedia engine, schema for structure" in a blog entry by Jim Webber. I then found this presentation by Ian Robinson of Thoughtworks. This presentation is one of the best that I have come across that provides a very clear understanding of the roles and responsibilities of media types and schema languages (the entire presentation is a treat and I highly recommend for all). Especially lookout for the slides titled "You've Chosen application/xml, you bstrd." and "Custom media types". Ian clearly explains the different roles of the schemas and the media types. In short, this is my take away from Ian's presentation:
A media type description includes the processing model that identifies hypermedia controls and defines what methods are applicable for the resources of that type. Identifying hypermedia controls means "How do we identify links?" in XHTML, links are identified based on tag and RDF has different semantics for the same. The next thing that media types help identify is what methods are applicable for resources of a given media type? A good example is ATOM (application/atom+xml) specification which gives a very rich description of hyper media controls; they tell us how the link element is defined? and what we can expect to be able to do when we dereference a URI so it actually tells something about the methods we can expect to be able to apply to the resource. The structural information of a resource represenation is NOT part of or NOT contained within the media type description but is provided as part of appropriate schema of the actual representation i.e the media type specification won’t necessarily dictate anything about the structure of the representation.
So what does this mean to us? simply that we dont need a separate media type for describing each resource as described above in my original question. We just need one media type for the entire application. This could be a totally new custom media type or a custom media type which reuses existing standard media types or better still, simply a standard media type that can be reused without change in our application.
Hope this helps.
In my opinion, this is the weak link of the REST concept. As an architectural and interface style, REST is outstanding and the work done by Roy F. and others has advanced the state of the art considerably. But there is an upper limit to what can be communicated (not just represented) by standard media types.
For people to understand and use your REST-ish API, they need to understand the meaning of the data. There are APIs where the media types tell most of the story; e.g. if you have a text-to-speech API, the input media type is text/plain and the output media type is audio / mp4, then someone familiar with the subject matter could probably make do. Text in, audio out, probably enough to go on in this case.
But many APIs can't communicate much of their meaning with just media type. Let's say you have an API that handles airline ticketing. The inputs and outputs will mostly be data. The media types on input and output of every API could be application/json or application/xml, so the media type doesn't transmit a lot of information. So then you would look at the individual fields in the inputs & outputs. Maybe there's a field called "price". Is that in dollars or pennies? USD or some other currency? I don't know how a user would answer those questions without either (a) very descriptive names, like "price_pennies_in_usd", or (b) documentation. Not to mention format conventions. Is an account number provided with or without dashes, must letters be all-caps and so on. There is no standard media type that defines these issues.
It's one thing when we're in situations where the client doesn't need a semantic understanding of the data. That works well. The fact that browsers can visually render any compliant document, and interact with any compliant resource, is really great. That's basically the "media" use case.
But it's entirely different when the client (or actually, the developer/user behind the client) needs to understand the semantics of the data. DATA IS NOT MEDIA. There is no way to explain data in all its real-world meaning and subtlety other than documenting it. This is the "data" use case.
The overly-academic definition of REST works in the media use case. It doesn't work, and needs to be supplemented with non-pure but useful things like documentation, for other use cases.
You're using the media type to convey details of your data that should be stored in the representation itself. So you could have just one media type, say "application/xml", and then your XML representations would look like:
<collection-of-collections>
<collection-of-items>
<item>
</item>
<item>
</item>
</collection-of-items>
<collection-of-items>
<item>
</item>
<item>
</item>
</collection-of-items>
</collection-of-collections>
If you're concerned about sending too much data, substitute JSON for XML. Another way to save on bytes written and read is to use gzip encoding, which cuts things down about 60-70%. Unless you have ultra-high performance needs, one of these approaches ought to work well for you. (For better performance, you could use very terse hand-crafted strings, or even drop down to a custom binary TCP/IP protocol.)
Edit One of your concerns is that:
making [the representation] coarse means I will end up transferring unnecessary data over the wire when in reality the client only needs to read or write a single attribute of an item
In any web service there is quite a lot of overhead in sending messages (each HTTP request might cost several hundred bytes for the start line and request headers and ditto for each HTTP response as in this example). So in general you want to have less granular representations. So you would write your client to ask for these bigger representations and then cache them in some convenient in-memory data structure where your program could read data from them many times (but be sure to honor the HTTP expiration date your server sets). When writing data to the server, you would normally combine a set of changes to your in-memory data structure, and then send the updates as a single HTTP PUT request to the server.
You should grab a copy of Richardson and Ruby's RESTful Web Services, which is a truly excellent book on how to design REST web services and explains things much more clearly than I could. If you're working in Java I highly recommend the RESTlet framework, which very faithfully models the REST concepts. Roy Fielding's USC dissertation defining the REST principles may also be helpful.
A media type should be seldomly created and time should be invested in making sure the format can survive change.
As you're relying on xml, there is no particular reason why you couldn't create one media type, provided that media type is described in one source.
Choosing ATOM over having one host media type that supports multiple root elements doesn't necessarily bring you anything: you'll still need to start reading the message within the context of a specific operation before deciding if enough information is present to process the request.
So i would suggest that you could happily have one media type, represented by one root element, and use a schema language to specify which of the elements can be contained.
In other words, a language like xsd can let you type your media type to support one of multiple root elements. There is nothing inherently wrong with application/vnd.acme.humanresources+xml describing an xml document that can take either or as a root element.
So to answer your question, create as few media types as you can possibly afford, by questioning if what you put in the documentation of the media type will be understandable and implementeable by a developer.
Unless you intend on registering these media types you should pick one of the existing mime types instead of trying to make up your own formats. As Jim mentions application/xml or text/xml or application/json works for most of what gets transmitted in a REST design.
In reply to Darrel here is Roy's full post. Aren't you trying to define typed resources by creating your own mime types?
Suresh, why isn't HTTP+POX Restful?