Graph databases vs. triple stores - triplestore

What's currently the best choice to persist graph-like structures? Graph databases (e.g. Neo4j) or RDF triple stores (e.g. Virtuoso)?
For example, we have the following use case:
the weakly connected graph (similar to the one of scholarly papers in a collection) with nearly 10M nodes;
quite rare updates;
critical operations: retrieving particular sub-graphs, updating nodes in a given sub-graph, re-computing link analysis measures (e.g. HITS or PageRank) after updating some nodes.
Providing the standard API to query the data for third party applications (a la Facebook's or Twitter's) is desired as well.

With Virtuoso you have the following working for you:
-- SPARQL, SQL, SPASQL (SPARQL inside SQL), and SQL inside SPARQL support (e.g. for dealing with N-ary relations via magic/function predicates/properties.
-- works as a compact engine (e.g., as exploited via KDE Desktop) or massive DBMS as demonstrated via the live 17 Billion Triples+ LOD Cloud Cache or the smaller DBpedia live instance.
-- includes Full Text indexing and text patterns in SPARQL (via bif:contains) it also included XPath/Xquery (via xcontains)
-- Acid or Non Acid mode ditto Schema-Last when dealing with Property Graph Store
-- Via Transformation Middleware it can pull data from 80+ data sources (includes REST APIs, SOAP services, Hypermedia Resource, ODBC or JDBC accessible relational data sources etc..) and transform into Transient or Persistent Linked Data graphs
-- Linked Data publishing is automatic i.e., post DBMS record creation you have in-built Linked Data Pages that as views into the DBMS. No messing around re. URL-Rewrite rules, 303 redirects or anything like that. InterWeb scale Super Keys just work!
That's it for now :-)

For horizontal scale (thus small to medium sized databases) graph databases like neo4j will currently give better performance for graph traversals. Triplestores are catching up though. The big advantage of a Triple Store compared to a graph database is that data dumps and query language are standardized, which means its a lot easier to move to another product and prevent vendor lock-in.

Related

What are the pros and cons of DynamoDB with respect to Google Cloud Datastore

My understanding is DynamoDB behave like a giant table which you must specify a hash key and range key.
The core concept of Google Cloud Datastore is entity based (like Cassandra) and is more flexible, i.e. can use more than 1 index.
But are there any more in-depth comparison?
AWS DynamoDB is a pretty simple flat key-value store. It has support for conditional writes and sets which allow for some cool features. You specify the amount of horsepower you want (which you can only adjust a few times a day) and AWS splits up your dataset uniformly across enough database nodes to meet your demands. You have to make sure your key values are sufficiently random as to guarantee balanced access across your dataset. AWS almost guarantees single-digit latencies. Transactions are not supported. You specify the consistency of operations.
Google Cloud Datastore is a more sophisticated key-valueish store with built-in transaction support and entity hierarchy. You don't have to worry about the capacity of the system, it automatically scales to your data size and access patterns. You have less control of some things so you have to pay attention. You cannot specify for a read to be consistent, but you can force consistency by structuring your entities in a certain way.
One downside of Google Cloud products I have experienced is that documentation and language support is not very uniform. Sometimes you have to read documentation of another language to understand the system fully and many features are not supported in certain languages.
There are a lot of other differences. Look at the API reference of your favorite language on both documentation pages and you'll get a decent feel of the specific features of each.

Which NoSQL solution lets us easily create an analytics product?

Assume we want to build a simple Google Analytics clone, tracking pageviews. We will place javascript on websites that tracks pageviews.
Can the javascript dump data directly into the database without having to go through a server (preferred)?
We obviously want to dump a lot of data in there. Billions of rows.
Does the database scale easily with as little interference as possible? (DynamoDB's model is perfect: 0 overhead).
Can we do somewhat flexible querying: limit by date, and filter/limit by a number of tags?
Can the javascript dump data directly into the database without having to go through a server (preferred)?
For the databases of which I'm aware, that would require browser clients having write access to the database, such that it would be trivial for an attacker to pollute your database with some simple JavaScript. If that's sufferable, then it's certainly possible. For something like CouchDB or Cloudant, you'd just make the DB globally writable (but not readable or editable), so clients can push events as they occur.
Does the database scale easily with as little interference as possible? (DynamoDB's model is perfect: 0 overhead).
Cloudant specifically is built on BigCouch, which the creators built to deal with systems generating petabytes of data per second. So, it scales. It uses Dynamo's concept of quorum to maximize consistency between nodes.
FYI: BigCouch is merging with CouchDB later this year.
Can we do somewhat flexible querying: limit by date, and filter/limit by a number of tags?
CouchDB, BigCouch, and Cloudant all use MapReduce for queries, which build as your data enters the system, such that accessing the results of a MapReduce query occur in O(log n) time. Each system also provides special methods for streaming information about changes to the database as changes occur, which is perfect for a dashboard.

Combining MongoDB and a GraphDB like Neo4J

As part of a CMS I'm developing I've got MongoDB as the primary datastore which feeds to ElasticSearch and Redis. All this is configured decleratively.
I'm currently trying to develop a declarative api in JSON (A DSL of sorts) which, when implemented, will enable me to write uniform queries in JSON, but at the backend these datastores work in tandem to come up with the result. Federated search if you will.
Now, while fleshing out the supported types of queries for this Json api, I've come across a class of queries not (efficiently) supported by my current setup: graph-based queries, like friend-of-friend, RDF-queries, etc. Something I'd like to support as well.
So I'm looking for a way to introduce a GraphDB into this ecosystem with the best fit. I should probably say the app-layer sits in Node.js.
I've come across lots of articles comparing Neo4J (a popular GraphDB) vs MongoDB, but not so much of actual use-cases, real world scenarios in which the 2 are complemented.
Any pointers highly appreciated.
You might want to take a look at structr[1], which has a RESTful graph database backend that you can configure using Java beans. In future versions, there will be a configuration option using REST calls only, so that you can fire up a structr server and configure and use it as a standalone graph database backend.
Just contact us on twitter or via email.
(disclaimer: I'm one of the developers of structr, so this comment may not be 100% impartial :))
[1] http://structr.org
The databases are very much complementary.
Use MongoDB to store your raw data/system of record and load the raw data into Neo4j for additional insights/analysis. When you are dealing with unstructured data, you want to store the information in a datastore which is conducive to unstructure data - MongoDB fits the bill (as does other similar NOSQL databases). While Neo4j is considered a NOSQL database, it doesn't fit the bill for unstructured data. Because you have to determine what is a relationship, what is a node, and what properties are stored for each - it's better suited when you have semi-structured data and some understanding of the type of analysis you want to do.
A great architecture is store your unstructured data in MongoDB and use jobs to load them into Neo4j. This allows you to re-load your graph if you figure out there are new pieces of information you'd like to store in the graph for additional analysis.
They are definitely NOT replacements for each other. They fit very different use cases.

Using Multiple Database Types to Model Data in a single application

Does it make sense to break up the data model of an application into different database systems? For example, the application stores all user data and relationships in a graph database (ideal for storing relationships), while storing other data in a document database, such as CouchDB or MongoDB? This would require the user graph database to reference unique ids in the document databases and vice versa.
Is this over complicating the data model and application? Or is this using the best uses of both types of database systems for scaling your application?
It definitely can make sense and depends fully on the requirements of your application. If you can use other database systems for things in which they are really good at.
Take for example full text search. Of course you can do more or less complex full text searches with a relational database like MySql. But there are systems like e.g. Lucene/Solr which are optimized for such things and can search fast in millions of documents. So you could use these systems for their special task (here: make a nifty full text search), then you return the identifiers and maybe load the relational structured data from the RDBMS.
Or CouchDB. I use couchDB in some projects as a caching systems. In combination with a relational database. Of course I need to care about consistency, but it it's definitely worth the effort. It pushed performance in the projects a lot and decreases for example load on the server from 2 to 0.2. :)
Something like this is for instance called cross-store persistence. As you mentioned you would store certain data in your relational database, social relationships in a graphdb, user-generated data (documents) in a document-db and user provided multimedia files (pictures, audio, video) in a blob-store like S3.
It is mainly about looking at the use-cases and making sure that from wherever you need it you might access the "primary" or index key of each store (back and forth). You can encapsulate the actual lookup in your domain or dao layer.
Some frameworks like the Spring Data projects provide some initial kind of cross-store persistence out of the box, mostly integrating JPA with a different NOSQL datastore. For instance Spring Data Graph allows it to store your entities in JPA and add social graphs or other highly interconnected data as a secondary concern and leverage a graphdb for the typical traversal and other graph operations (e.g. ranking, suggestions etc.)
Another term for this is polyglot persistence.
Here are two contrary positions on the question:
Pro:
"Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase."
http://codemonkeyism.com/nosql-polyglott-persistence/
Con:
"I don’t think I need to say that I’m a proponent of polyglot persistence. And that I believe in Unix tools philosophy. But while adding more components to your system, you should realize that such a system complexity is “exploding” and so will operational costs grow too (nb: do you remember why Twitter started to into using Cassandra?) . Not to mention that the more components your system has the more attention and care must be invested figuring out critical aspects like overall system availability, latency, throughput, and consistency."
http://nosql.mypopescu.com/post/1529816758/why-redis-and-memcached-cassandra-lucene

NoSQL database and many semi-large blobs

Is there a NoSQL (or other type of) database suitable for storing a large number (i.e. >1 billion) of "medium-sized" blobs (i.e. 20 KB to 2 MB). All I need is a mapping from A (an identifier) to B (a blob), the ability to retrieve "B" given A, a consistent external API for access, and the ability to "just add another computer" to scale the system.
Something simpler than a database, e.g. a distributed key-value system, may just fine, and I'd appreciate any thoughts along that vein as well.
Thank you for reading.
Brian
If your API requirements are purely along the lines of "Get(key), Put(key,blob), Remove(key)" then a key-value store (or more accurately a "Persistent distributed hash table") is exactly what you are looking for.
There a quite a few of these available, but without additional information it is hard to make a solid recommendation - What OS are you targeting? Which language(s) are you developing with? What are the I/O characteristics of your app (cold/immutable data such as images? high write loads aka tweets?)
Some of the KV systems worth looking into:
- MemcacheDB
- Berkeley DB
- Voldemort
You may also want to look into document stores such as CouchDB or RavenDB*. Document Stores are similar to KV stores but they understand the persistence format (usually JSON) so they can provide additional services such as indexing.
If you are developing in .Net then skip directly to RavenDB (you'll thank me later)
What about Jackrabbit?
Apache Jackrabbit™ is a fully
conforming implementation of the
Content Repository for Java Technology
API (JCR, specified in JSR 170 and
283).
A content repository is a hierarchical
content store with support for
structured and unstructured content,
full text search, versioning,
transactions, observation, and more.
I knew Jackrabbit when I worked with Liferay CMS. Liferay uses Jackrabbit to implement its Document Library. It stores user files in the server's file system.
You'll also want to take a look at Riak. Riak is very focused on doing exactly what you're asking (just add node, easy to access).