I am working on application which requires features offered by both graph database(to store raw data) and document database(extracted reports from raw data). I planned to use neo4j and mongodb. I am having second thoughts about and looking at orientDB. is it better to have a single multimodel database than two separate databases? The reason I leaned towards neo4j is its native graph storage which might come in handy for memory locality for large graphs. OrientDB doesn't store graph natively. or does it?
OrientDB stores graph natively. Its engine is 100% a Graph Database like Neo4j. Actually OrientDB and Neo4j are the only Graph Databases with index-free adjacency. Some other Graph Database acts as a layer on top of an existent model (RDBMS, Column or Document stores).
So there is nothing you can do with Neo4j that you can't do with OrientDB. But OrientDB allows to model more complex data, like Document DBMS (MongoDB) can do. For example each vertices and edges in OrientDB is a document (json), so you can store in the vertex and edge complex types like embedded properties, list, sets, date, decimal, etc.
Don't be dazzled by terminology. "Index-free adjacency" is a term that simply means graph vertices are stored "with" their edges. Each database does this in a slightly different way. Neo4J stores them on disk in a linked list. If you have them in memory, and there's not too many of them, they're fast. If you have to hit them on disk, then you may need an index. Titan stores them as columns in a wide-column database such as Cassandra. If they're in memory, they're fast. If you have to hit them on disk, the underlying database's range queries make them fast to load in bulk, and extra indexing can decrease the cost of searching large edge lists.
This discussion is fairly valuable: How does Titan achieve constant time lookup using HBase / Cassandra?
Whether you're using OrientDB or any other database, your efficiency at graph queries will rely in large part on the indexing you put in place so that you start your graph queries on, and traverse through, a relatively small set of nodes. Be sure to model some of the queries you're doing to make sure that whatever database you choose will support the right indexes, whether they're across the whole graph, or local to each vertex.
Related
I have successfully implemented the Time Series Use case like shown in the documentation. The data(Event class) pointed by the smallest time unit is indexed with a lucene spatial index.
I have two types of events : private or public.
How should I design my database and clusters for my use case ?
Would it be better to use an Edge or a linklist to make the link from Min to Event ?
I am worried that my lucene spatial index will get too big in the future.
By reading at the documentation, it looks like having clusters for the geolocation data would be a great strategy.
It is possible to use the index only on the subquery:
select
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].events)
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]
The documentation on indexes tells me it is possible to use indexes on an edge properties. The bad side is that it takes more storage then linked list I tested it and it works :
select
expand(inV())
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].outE('MinPublicEvent'))
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]
In regard of edge vs link, taken from OrientDB doc: lightweight edges, the first difference is that edges can store properties and links don't.
These are the PROS and CONS of Lightweight Edges vs Regular Edges:
PROS:
faster in creation and traversing, because don't need an additional
document to keep the relationships between 2 vertices
CONS:
cannot store properties harder working with Lightweight edges from
SQL, because there is no a regular document under the edge
Since you already mentioned using properties on edges, which makes sense to me, as you can use these properties in the edges to transverse the graph, this means that you can't use a link to store that relationship.
In the case you want to embed these properties on the Event vertex, that is also fine, and you'd be able to use links, loosing the hability of using the properties in the edge to transverse the graph in favour of improved performance.
The edge approach is more expressive, but when performance really matters, and there is risk of a bottleneck, you should monitor the metrics and the performance, and refactor to the embed + link approach in case there is an issue with performance.
Update:
Clusters are basically a mechanism to split data in OrientDB (Clusters tutorial), which works for both, edge and vertex.
You may also find it beneficial to locate different clusters on
different servers, physically separating where you store records in
your database. The advantages of this include:
Optimization: Faster query execution against clusters, given that you
need only search a subset of the clusters in a class.
Indexes: With good partitioning, you can reduce or remove the use of > indexes.
Parallel Queries: Queries can be run in parallel when made to data on
multiple disks.
Sharding: You can shard large data-sets across
multiple instances.
Unless you can identify clearly a good way to partition your data, and can distribute your database between different servers, i suggest you start with the default, as OrientDB already creates 1 cluster for each class in the schema, and add more clusters as your database grow.
When to add more clusters? Metrics, metrics and metrics. Keep track of how your application access your database, what kind of queries, amount of queries, etc.
Does OrientDB's support efficient computations of connected components?
I am not experienced with graph databases. My naiive intuition is that this operation should be quite efficient.
If it is efficiently supported, how would a query look like to find all connected components?
I had your same issue but I finally ended up writing an OSQL query to compute connected components in a graph, here is my solution
Below is an excerpt from the OrientDB website. I've highlighted a few relevant portions.
OrientDB can embed documents like any other document database, but
also supports relationships. It doesn’t use the costly JOIN. Instead,
OrientDB uses super-fast, persistent pointers between records, taken
from the graph database world. You can traverse parts of or entire
trees and graphs of records in just a few milliseconds.
This illustration shows how the original document has been
split into two documents linked using the Customer’s Record ID #8:124
to connect the Order to the Customer document. Links can be thought of
as in-memory pointers, but persistent on disk.
[snip]
Equipped With document and relational DBMS, the more data you
have, the slower the database will be. Joins have a heavy runtime
cost. Instead, OrientDB handles relationships as physical links to the
records, assigned only once, when the edge is created O(1). Compare
this to an RDBMS that “computes“ the relationship every single time
you query a database O(LogN). With OrientDB, traversing speed is not
affected by the database size. It is always constant, whether for one
record or 100 billion records. This is critical in the age of Big
Data!
And here is an example query taken from the tutorial document, which will get all the friends of the person called Luca.
SELECT EXPAND( BOTH( 'Friend' ) ) FROM Person WHERE name = 'Luca'
In this presentation there was a chart that showed the following horizontal scalability ceiling as data gets larger:
key-value > column family > document database > graph database
http://youtu.be/UodTzseLh04?t=13m36s
In other words, as data gets more connected (i.e. complex) the limit on which you can let the database grow gets lower.
Why is data size not as scalable for document databases compared to key-value stores? Have I answered my own question by saying "the more freedom in connecting data, the harder it is to partition data"?
(The "what I'm trying to do" part which everyone usually asks: I have a database with a schema that is MOSTLY tree-like but occasionally has nodes with 2 parents. I used Neo4j in my prototype but for a production-scale app I'd need to think more about partitioning. I'm going to have to use Mongo DB since Graph Databases cannot easily be partitioned, and it will be harder to write code for my "multiple parents" relationships in Mongo DB. So I'm wondering if it's worth going the extra mile and use key-value stores - or at least a column family store).
For graph databases ... I would consider looking at Titan for scalability. https://github.com/thinkaurelius/titan.
They wrote a good blog post recently about how their database engine stores data for scaling/performance: http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/
Titan also can be configured to work hand in hand with Cassandra, so you get the benefit of a columnar database as well.
I think you hit the nail on the head with your understanding of relationships (one entity relating to another) and scalability.
The more "joins" or "connections" you have to manage, the harder it is to scale.
Key/value systems assume you will relate data in your application. There are no concepts of queries, so to scale, you can shard based on the key. Pretty easy and very scalable.
If you read some of the articles about Titan it's easy to see why it's hard to scale something like a graph database.
I'm getting ready to start a project where I will be building a recommendation engine for restaurants. I have been waffling between neo4j (graph db) and mongodb (document db). my nodes/documents will be things like restaurant and person. i know i will want some edges, something like person->likes->restaurant, or person->ate_at->restaurant. my main query, however, will be to find restaurants within X miles of location Y.
if i have 20 restaurant's within X miles of Y, but not connected by any edges, how will neo4j be able to handle the spatial query? i know with mongodb i can index on lat/long and query all restaurant types. does neo4j offer the same functionality in a disconnected graph?
when it comes to answering questions like, 'which restaurants do my friends eat at most often?', is neo4j (graph db) the way to go? or will mongodb (document db) provide me similar functionality?
Neo4j Spatial introduces a Spatial RTree (or other means) index that is part of the graph itself. That means, even disconnected domain entities will be found via the spatial search, if you index them (that is relationships will connect the Spatial index to the Restaurants). Also, this is flexible enough that you can combine the Raw BBox search in the RTree with other things like check on the restaurants categories in the same go, since you can hop out and in the different parts of the graph.
This way, neo4j Spatial is supporting the full range of search capabilities that you would expect form a full Topology, like combined searches and searches on polygons with holes etc.
Be aware that Neo4j Spatial is in 0.7, so be gentle and ask on http://groups.google.com/group/neo4j/about :)
I'm not that familiar with Neo4J Spatial but it would seem that MongoDB is at the very least a good fit since it's the database Foursquare uses with exactly the purpose you describe. MongoDB geo indexing is extremely fast and scales up nicely.
Another possible solution is to use CouchBase. It uses a document model as well - though you need to be much more comfortable with MapReduce for queries. It has better spatial capabilities right now thank MongoDB but that may change over time.
Suggestion aside, I agree that of the two choices you have given Mongo will suit your needs fine and probably more appropriate for your spatial queries.
Neo4j geospatial doesn't scale up that good. I created a geospatial layer in neo4j and added nodes to this layer. Beyond 10,000 nodes the addition of nodes to the layer becomes very slow even when using neo4j2.0
On the other hand, mongodb geo-location works comparatively much faster and is more scalable.
What is the difference between a Graph Database (e.g. Neo4J) and a Network Database (e.g. IDS, CODASYL)? In principle are they the same thing?
The network databases like CODSASYL are still more or less based on a hierarchical data model, thinking in terms of parent-child (or owner-member in CODASYL terminology) relationships. This also means that in network database you can't relate arbitrary records to each other, which makes it hard to work with graph-oriented datasets. For example, you may use a graph database to analyze what relationships exist between entities.
Also, network databases use fixed records with a predefined set of fields, while graph databases use the more flexible Property Graph Model, allowing for arbitrary key/value pairs on both nodes/vertices and relationships/edges.
Copying from the book Designing Data-Intensive Applications by Martin Kleppmann.
In the network model, the database had a schema that specified which record type could be nested within which other record type. In a graph database, there is no such restriction: any vertex can have an edge to any other vertex. This gives much greater flexibility for applications to adapt to changing requirements.
In the network model, the only way to reach a particular record was to traverse one of the access paths to it. In a graph database, you can refer directly to any vertex by its unique ID, or you can use an index to find vertices with a particular value.
In the network model, the children of a record were an ordered set, so the database had to maintain that ordering (which had consequences for the storage layout) and applications that inserted new records into the database had to worry about the positions of the new records in these sets. In a graph database, vertices and edges are not ordered (you can only sort the results when making a query).
In the network model, all queries were imperative, difficult to write and easily broken by changes in the schema. In a graph database, you can write your traversal in imperative code if you want to, but most graph databases also support high-level, declarative query languages such as Cypher or SPARQL.
First, let´s ask the question correctly. There are TWO types of graph databases: RD Graph (standard) and Property Graph (non-standard). Neo4J is a Property Database, not a "standard" RDF Graph.
Then, if you read Sumit Sethia´s answer above, you will have the right answer in terms of the relationship between the Network Model and the Graph DB (which, by deafult should be understood as an RDF graph).
It helps to think of the relationships as a development time-line, where next step "improves" previous step. Then it would be something like the Hierarchical DB first, then the Network Model, then Graph, and then Property Graph. This is not "strict", by the way.