Format to use for exposing structured meta data (dublin core, rdf, atom)? - metadata

In an altruistic manner I would like to expose as much structured data about my website.
I also wouldn't mind SEO boost but its secondary.
Seems there are a couple of options:
Full on RDF (kill me now XML)
Atom with your own custom tags (liking that)
RDFa in your webpage (might help SEO)
Dublin Core Meta tags
Dublin Core using RDFa
Atom with RDFa
I'm just trying to make it easy for people to get data off my site.
The nice thing about standards is that there are so many of them to choose from.
Which one do you think I should use?

RDF is not just XML; RDF is a data model that relies on sets of triples (subject, predicate, object) and URIs to unambiguously refer to things. Actually, people working with RDF tend to run away from RDF/XML and we prefer RDF/Turtle or RDF/Ntriples, even RDF in JSON format. These serializations are more readable, easier to construct and easier to parse. Moreover, there are many tools that allow you to transform between all the range of RDF flavors (i.e: rapper or Jena).
When it comes to publishing information in RDF. You generally have three different choices:
To provide RDF dumps of your data.
To publish RDF following the Linked Data rules.
To add metadata to your existing Web pages with RDFa.
... these are not exclusive. You can go for any combination of them, the most important thing is choosing the correct structure of URIs (see Cool URIs don't change).
Following your SO profile I see that you're working on a social taste recommendation website (http://evocatus.com/). I assume that you might want to expose information about those reviews. So for a review like http://evocatus.com/sauce/cholula-chipolte-hot-sauce/272645/ you can provide different serializations and give back not just HTML but also:
.../holula-chipolte-hot-sauce/272645/rdf-turtle
.../holula-chipolte-hot-sauce/272645/rdf-xml
.../holula-chipolte-hot-sauce/272645/rdf-json
and one for any other type of format you want to expose.
In addition, the HTML version could be enhanced with RDFa. Depending on the type of client that consumes your data, following content negotiation rules, you'll redirect the HTTP request to whichever format is accepted by the client. This is established by the HTTP header Accept. So a request like the one below with curl would be redirected by your application giving back the RDF/XML version:
curl -H 'Accept: application/rdf+xml' .../holula-chipolte-hot-sauce/272645/
In the future, people would be able to say things about existing reviews in your site by just reusing your URIs in their RDF data. That's the power of RDF and Linked Data.
About Dublin Core, you could use Dublin Core with either RDF or RDFa. But, in your case there are some other interesting ontologies to consider and the right thing would be to use a mix of all of them:
FOAF: Friend Of A Friend, to express user personal information and relations between users.
Tag Ontology: A very simple ontology to express tag information.
RDF Review Vocabulary: Vocabulary for expressing reviews and ratings using RDF.
GoodRelations: An ontology to express product information and eCommerce.
Vcard/RDF: for addresses, normally used in combination with FOAF.
There is one site called http://revyu.com/ that uses all these ontologies (except GoodRelations), so you could use it as a guideline. See for instance:
http://revyu.com/reviews/342b55e79f64d5ca37f633b93c246c6ad6e14b04/about/html
http://revyu.com/reviews/342b55e79f64d5ca37f633b93c246c6ad6e14b04/about/rdf
... these are HTML and RDF versions of the same review.
Unlike with ATOM, as you can see, with RDF you would be able to reuse existing ontologies and since RDF is based on URIs everything would be interlinked.
Linked Data Added Value
What would happen if you invest sometime linking your products and reviews to other data sources ? (i.e: dbpedia.org or freebase.com). Let's imagine that you start linking all your Beer reviews (http://evocatus.com/beer/) to whatever brewery is manufacturing the product from (http://dbpedia.org/page/Alcoholic_beverage), by following the links you would be able to know for instance where the preferable beers are manufactured. Dbpedia holds that information.
Also see that in Freebase, that also provides RDF versions, you could link to manufacturers. For instance see, http://rdf.freebase.com/rdf/en.budweiser in RDF or http://www.freebase.com/view/en/budweiser in HTML.

The Dublin Core Schema is a small set of vocabulary terms that can be used to describe web resources (video, images, web pages, etc.).
Example of Dublin Core code
<meta name="DC.Format" content="video/mpeg; 10 minutes">
<meta name="DC.Language" content="en" >
<meta name="DC.Publisher" content="publisher-name" >
Link to Generate DC.Meta tags : http://www.dublincoregenerator.com/generator_nq.html
DC in meta-tags for SEO purposes - they are obsolete.
It was found that using Dublin Core elements did not improve the retrieval rank of the web pages" and that "Dublin Core metadata, as a well-known metadata schema, is not widely accepted and used by search engine designers and the spiders do not consider its elements while ranking the web pages.
Google are NOT using that in their indexing, and there is no mention of Dublin core on Google or search engine's site for indexing.
In the UK, government organisations use DC to provide standardised access to tags.
That's not to say Google, Bing, Yahoo, etc will never implement them. Google is using more metadata and rich snippets these days.

Related

Can a product's structured data be split into separate sections?

I'm working on a site optimizing their structured data and noticed they use YotPo to pull in ratings and reviews. YotPo is defining the Product and only has a couple of values for AggregateRating that are being injected via JavaScript.
I have all of the other product data coming from the CMS, so I defined all the other information there, but when I run Google's testing tool on the page, it sees it as 2 products and says it's missing fields for the YotPo markup that are already defined in my markup.
Is there some way to let Google know that they're both chunks of data for the same product so it only sees it as a single product with the combined data?
You need to make sure both Yotpo and your CMS use the same format. e.g. json-ld or microdata.
You can then indicate that they relate to the same product by setting both up to use the same id.

Bluemix Embedded Reports REST architecture

When one wishes to use the Bluemix Embedded Reports one first creates a package and then a report definition. After that, one is supposed to use the REST APIs that are documented using Swagger here:
https://erservice-impl.ng.bluemix.net/ers/swagger-ui/
Unfortunately, I am unable to find any architectural definitions for these APIs. To elaborate on this notion, there are APIs to get connections, packages, definitions, reports, models, datasources and visualizations ... however I unable to find any documentation describing when I would use what. In addition, some fundamental APIs such as those relating to operations for "reports" seem to want a "reportId" and I am lost on how to retrieve or obtain one of those. Other mysteries are the concept of "What are report links?" and what is the semantics of obtaining a "report instance"? For a report "rendered in a format" ... what are the allowable formats and when would I use vs another?
Again ... the REST API isn't bad and Swagger provides useful syntax documentation but without the associated semantic comprehension, it leaves the reader cold on quite how to use the technology.
I am hoping that there is additional documentation either existing somewhere or else planned for release as soon as practicable. If anyone knows where to find such or has additional information on how to interpret the semantics of the APIs, that would be a fantastic answer to the question.
Some information around the REST API, particularly around running of reports, is available on the documentation page for the service, found here: https://console.ng.bluemix.net/docs/services/EmbeddableReporting/index.html#gettingstartedtemplate
Though the full API is provided in swagger, users are expected to use only 3 resources: connection, definitions, and reports. The other endpoints deal with the management of report artifacts and their related resources (datasources, models, packages)
The first step in using ERS is to define datasources and report specifications (definitions) within the admin dashboard. Then, each definition will be given an ID that you can copy/paste into your RESTful calls.
Connect to ERS using basic auth and the /connection endpoint. This sends back cookies (include a JSESSIONID) that you are expected to send with all other calls.
POST /connection
with an empty json body {} and basic auth headers
Run a report in a particular format (2 flavours)
2.1 For 'vanilla' reports with no special options or parameters, you can use the shortcut call, which both creates a report resource and runs it in the format you choose:
GET /definitions/{definition_id}/reports/{format}
where definition_id is taken from the admin dashboard, and format is one of html, phtml (partial html, for embedding. Most common), pdf, json, xml, csv
2.2 For more complex cases, you need to first create a report instance (this holds state for the report that is being run. You can do a next-page or check parameter values and options). Then you can run the report in a format.
POST /definitions/{definition_id}/reports
with a body with your options/parameters. You can also send an empty json body ({}) for all the defaults. This returns a json payload with a reportId and location to run the report from
GET /reports/{report_id}/{format}
You might also want to look at the sample that is included in the documentation (in javascript, java and node) to see how to do this in an app. The documentation mentioned above also has curl examples.

Will Google accept combined JSON-LD and HTML meta/microdata?

I have a situation where I can put 99% of my structured data into JSON-LD in the of my product pages. But the only way I can get the UPC is to place it inline as microdata.
Will Google aggregate the product data from the JSON-LD and the inline microdata?
Most likely yes.
Google’s Structured Data Testing Tool works fine if you are using JSON-LD and Microdata (and RDFa).
Google does not say otherwise (they did in the past), see their Structured Data Policies:
The data may be embedded in your webpage using any of three supported formats: JSON-LD, RDFa, and microdata.
Some of Google’s structured data features are (currently) only documenting JSON-LD (for example, TV and Movie Watch Actions); for others, Google recommends using RDFa/Microdata, see for example their "About schema.org":
[…] Google recommends the use of JSON-LD for those features. For the remaining Rich Snippets types and breadcrumbs, Google recommends the use of microdata or RDFa.
It wouldn’t make sense for Google to restrict authors so that can’t make use of all the features (using different syntaxes) in the same document.
That said, you can never know for sure (their documentation is not always up-to-date, and their rules might change each day.)
One can use both in same page. But may not divide the info of one type of entity into parts ..example: some info about the product in JSON-LD format and some in microdata format does not work.
Also two separate things would mean two entities. Although one may use #id in JSON-LD and itemid in microdata to specify them as same entity but the Google’s Structured Data Testing Tool still shows them as two entities.
You can combine all three formats in a single page but Google gives more priority to JSON-LD and will take data from JSON-LD format shall the other one (or two) have different values than the former.

CMS for managing plain-text content, with tagging

We have some quite-specific requirements for our app that a CMS may help us with, and were hoping that someone may know of a CMS that matches these requirements (it's quite a laborous task to download each CMS and verify this manually).
We want a CMS to allow users to create and manage articles, but storing the articles in plain-text only. All of the CMSs that we have looked at so far are geared towards creating HTML pages. We want the CMS to manage workflow (approval process), and tracking of history.
The requirements for plain text only is that the intent is to allow business people to generate content which we are going to display in our Silverlight application - we don't want to go down the route of hosting and displaying arbitrary HTML in the app as we want the styling to be seamless with our app, amongst other reasons.
We would also want to allow the user to be able to link to media stored on the server, but not to external sites (i.e. HTML with no formatting, or some other way of specifying article links), and the third requirement is the ability to tag articles and search on articles.
Does anyone know of any non-HTML targetted CMS systems that may match these requirements?
I would expect several CMS systems to allow this, but eZ Publish stores content as plain XML. And you have a way of allowing certain tags if you wish; and explicitly prevent for example external links. You then have options for how to present that content according the templates you choose to use.
You also have control via a /layout/set/myLayout directive.
You could for example retrieve the content as a plain xml feed or a print layout or whatever custom format you choose at the time. With appropriate headers.
http://doc.ez.no/eZ-Publish/Technical-manual/3.10/Reference/Modules/layout/(language)/eng-GB
vs.
http://doc.ez.no/layout/set/print/eZ-Publish/Technical-manual/3.10/Reference/Modules/layout/(language)/eng-GB
You could define a layout such as /layout/set/xml/....
Workflow as in content approval processes, versioning, tagging and search are standard.
You can give Statamic a try.
http://statamic.com/
Not sure if you can disallow external links, though.

Comparing data with RESTful API

For a website I am working on defining a RESTful API. I believe I got it (mostly) correct using proper resource URIs and proper use of GET/POST/UPDATE/DELETE.
However there is one point where I can't quite figure out what the proper way to do it "in" REST would be - comparison of lists.
Let's say I have a bookstore and a customer can have a wishlist. The wishlist consists of books (their full Book record, i.e. name, synopsis, etc) and a full copy of the list exists on the client. What would be a good way to design the RESTful API to allow a client to query the correctness of its local wishlist (i.e. get to know what books have been added/removed on the wishlist on the server side)?
One option would be to just download the full wishlist from the server and compare it locally. However this is quite a large amount of data (due to the embedded content) and this is a mobile client with a low-bandwidth connection, so this would cause a lot of problems.
Another option would be to download not the whole wishlist (i.e. not including book infos) but only a list of the books' identifiers. This would be not much data (compared to the previous option) and the client could compare the lists locally. However to get the full book record for newly added books, a REST call would have to be made for every single new book. Again, as this is a mobile client with bad network connectivity, this could be problematic.
A third option and my favorite, would be that the client sends its list of identifiers to the server and the server compares it to the wishlist and returns what books were removed and the data for books that were added. This would mean a single roundtrip and only the necessary amount of data. As the wishlist size is estimated to be less than 100 entries, sending just the IDs would be a minimal amount of data (~0.5kb). However I don't know what kind of call would be appropriate - it can't be GET as we are sending data (and putting it all in the URL does not feel right), it can't be POST/UPDATE as we do not change anything on the server. Obviously it's not DELETE either.
How would you implement this third option?
Side-question: how would you solve this problem (i.e. why is option 3 stupid or what better, simple solutions may there be)?
Thank you.
P.S.: A fourth option would be to implement a more sophisticated protocol where the server keeps track of changes to the list (additions/deletes) and the client can e.g. query for changes based on a version identifier or simply a timestamp. However I like the third option better as implementation-wise it is much more simpler and less error-prone on both client and server.
There is nothing in HTTP that says that POST must update the server. People seem to forget the following line in RFC2616 regarding one use of POST:
Providing a block of data, such as the result of submitting a
form, to a data-handling process;
There is nothing wrong with taking your client side wishlist and POSTing to a resource whose sole purpose is to return a set of differences.
POST /Bookstore/WishlistComparisonEngine
The whole concept behind REST is that you leverage the power of the underlying HTTP protocol.
In this case there are two HTTP headers that can help you find out if the list on your mobile device is stale. An added benefit is that the client on your mobile device probably supports these headers natively, which means you won't have to add any client side code to implement them!
If-Modified-Since: check to see if the server's copy has been updated since your client first retrieved it
Etag: check to see if a unique identifier for your client's local copy matches that which is on the server. An easy way to generate the unique string required for ETags on your server is to just hash the service's text output using MD5.
You might try reading Mark Nottingham's excellent HTTP caching tutorial for information on how these headers work.
If you are using Rails 2.2 or greater, there is built in support for these headers.
Django 1.1 supports conditional view processing.
And this MIX video shows how to implement with ASP.Net MVC.
I think the key problems here are the definitions of Book and Wishlist, and where the authoritative copies of Wishlists are kept.
I'd attack the problem this way. First you have Books, which are keyed by ISBN number and have all the metadata describing the book (title, authors, description, publication date, pages, etc.) Then you have Wishlists, which are merely lists of ISBN numbers. You'll also have Customer and other resources.
You could name Book resources something like:
/book/{isbn}
and Wishlist resources:
/customer/{customer}/wishlist
assuming you have one wishlist per customer.
The authoritative Wishlists are on the server, and the client has a local cached copy. Likewise the authoritative Books are on the server, and the client has cached copies.
The Book representation could be, say, an XML document with the metadata. The Wishlist representation would be a list of Book resource names (and perhaps snippets of metadata). The Atom and RSS formats seem good fits for Wishlist representations.
So your client-server synchronization would go like this:
GET /customer/{customer}/wishlist
for ( each Book resource name /book/{isbn} in the wishlist )
GET /book/{isbn}
This is fully RESTful, and lets the client later on do PUT (to update a Wishlist) and DELETE (to delete it).
This synchronization would be pretty efficient on a wired connection, but since you're on a mobile you need to be more careful. As #marshally points out, HTTP 1.1 has a lot of optimization features. Do read that HTTP caching tutorial, and be sure to have your web server properly set Expires headers, ETags, etc. Then make sure the client has an HTTP cache. If your app were browser-based, you could leverage the browser cache. If you're rolling your own app, and can't find a caching library to use, you can write a really basic HTTP 1.1 cache that stores the returned representations in a database or in the file system. The cache entries would be indexed by resource names, and hold the expiration dates, entity tag numbers, etc. This cache might take a couple days or a week or two to write, but it is a general solution to your synchronization problems.
You can also consider using GZIP compression on the responses, as this cuts down the sizes by maybe 60%. All major browsers and servers support it, and there are client libraries you can use if your programming language doesn't already (Java has GzipInputStream, for instance).
If I strip out the domain-specific details from your question, here's what I get:
In your RESTful client-server application, the client stores a local copy of a large resource. Periodically, the client needs to check with the server to determine whether its copy of the resource is up-to-date.
marshally's suggestion is to use HTTP caching, which IMO is a good approach provided it can be done within your app's constraints (e.g., authentication system).
The downside is that if the resource is stale in any way, you'll be downloading the entire list, which sounds like it's not feasible in your situation.
Instead, how about re-evaluating the need to keep a local copy of the Wishlist in the first place:
How is your client currently using the local Wishlist?
If you had to, how would you replace the local copy with data fetched from the server?
What have you done to minimize your client's data requirements when building its Wishlist view(s) and executing business logic?
Your third alternative sounds nice, but I agree that it doesn't feel to RESTfull ...
Here's another suggestion that may or may not work: If you keep a version history of of your list, you could ask for updates since a specific version. This feels more like something that can be a GET operation. The version identifiers could either be simple version numbers (like in e.g. svn), or if you want to support branching or other non-linear history they could be some kind of checksums (like in e.g. monotone).
Disclaimer: I'm not an expert on REST philosophy or implementation by any means.
Edit: Did you ad that PS after I loaded the question? Or did I simply not read your question all the way through before writing an answer? Sorry. I still think the versioning might be a good idea, though.