What is the difference between query:
I want to find all data with city name XYZ,
When I use Titan graph query like xyzCityVertex.query().direction(IN).label("hasCity").iterator()
2 .When I directly execute Elastic Search query without titan like cityField:"XYZ"
Yes one is Graph query other is plain ES query. But for the moment forget that and imagine i have set of data stored using Titan and i want to run both query for the same purpose.
Are both same internally and performance wise?
It is hard to "forget" that one is a vertex query and the other is a ES query. That is what makes them so markedly different. I would say they also represent possibly different use cases. You typically use a "graph query" or "ES query" for the same thing to find a vertex or set of vertices to start a traversal from. A "graph query" will likely be faster than an ES query but a graph query should have high selectivity in that it typically should not return "lots of things." An ES query has less restriction that way and is better suited to return lots of things.
In your example, you show a "vertex query" which means that you've already found your vertex of interest and want to traverse from that. Traversals over incident edges (such as your example shows) should be very fast and I would assume that it would be faster than a similar lookup of the same adjacent vertices with "graph query" or "ES query".
In other words:
g.V('city','XYZ').in('hasCity').cityField
would be faster than a graph query where cityField was indexed to return lots of vertices:
g.V('cityField','XYZ')
and would be faster than the equivalent ES query over cityField
Related
I'm working on a querying tool that would allow to make complex relational queries along to fulltext search on quite big datasets. My data is stored in a Postgres database with a elasticsearch engine for convenient and efficient fulltext search. However, some queries require complex join, cardinality tests, filters on joined data ...
My dilemma is that I cannot use only ElasticSearch or only Postgres. I need both to answer specific needs. But combining them seems to be a really difficult task.
My approach
... was to perform the ES query first and then use the results' id as a filter for the SQL query. The problem is that ES has a max_result_window that prevents to get all the matching data at once. Even worse, ES may return the first 10K results matching the fulltext search, but the subsequent SQL query may narrow those results number to something ridiculously small, while there's actually much more matching items but the 10K limit acted as bottleneck.
Taking the other way around is no better since if we use the result of the SQL query as the document ids as filters for the ElasticSearch query, the max_clause_count limit will be easily reached and the ES query wouldn't be able to be performed on more than (default) 1024 items.
Maybe my logic isn't good. Is there any other approach to combine both ES and Postgres queries simultaneously ? Thanks.
Does OrientDB's support efficient computations of connected components?
I am not experienced with graph databases. My naiive intuition is that this operation should be quite efficient.
If it is efficiently supported, how would a query look like to find all connected components?
I had your same issue but I finally ended up writing an OSQL query to compute connected components in a graph, here is my solution
Below is an excerpt from the OrientDB website. I've highlighted a few relevant portions.
OrientDB can embed documents like any other document database, but
also supports relationships. It doesn’t use the costly JOIN. Instead,
OrientDB uses super-fast, persistent pointers between records, taken
from the graph database world. You can traverse parts of or entire
trees and graphs of records in just a few milliseconds.
This illustration shows how the original document has been
split into two documents linked using the Customer’s Record ID #8:124
to connect the Order to the Customer document. Links can be thought of
as in-memory pointers, but persistent on disk.
[snip]
Equipped With document and relational DBMS, the more data you
have, the slower the database will be. Joins have a heavy runtime
cost. Instead, OrientDB handles relationships as physical links to the
records, assigned only once, when the edge is created O(1). Compare
this to an RDBMS that “computes“ the relationship every single time
you query a database O(LogN). With OrientDB, traversing speed is not
affected by the database size. It is always constant, whether for one
record or 100 billion records. This is critical in the age of Big
Data!
And here is an example query taken from the tutorial document, which will get all the friends of the person called Luca.
SELECT EXPAND( BOTH( 'Friend' ) ) FROM Person WHERE name = 'Luca'
I'm using Titan with Cassandra and have several (related) questions about querying the database with Gremlin:
1.) Is there an faster way to count all vertices than
g.V.count()
Titan claims to use an index. But how can I use an index without property?
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [<>]. For better performance, use indexes
2.) Is there an faster way to count all vertices with property 'myProperty' than
g.V.has('myProperty').count()
Again Titan means following:
WARN c.t.t.g.transaction.StandardTitanTx - Query requires iterating over all vertices [(myProperty<> null)]. For better performance, use indexes
But again, how can I do this? I already have an index of 'myProperty', but it needs a value to query fast.
3.) And the same questions with edges...
Iterating all vertices with g.V.count() is the only way to get the count. It can't be done "faster". If your graph is so large that it takes hours to get an answer or your query just never returns at all, you should consider using Faunus. However, even with Faunus you can expect to wait for your answer (such is the nature of Hadoop...no sub-second response here), but at least you will get one.
Any time you do a table scan (i.e. iterate all Vertices) you get that warning of "iterating over all vertices". Generally speaking, you don't want to do that, as you will never get a response. Adding an index won't help you count all vertices any faster.
Edges have the same answer. Use g.E.count() in Gremlin if you can. If it takes too long, then try Faunus so you can at least get an answer.
doing a count is expensive in big distributed graph databases. You can have a node that keeps track of many of the databases frequent aggregate numbers and update it from a cron job so you have it handy. Usually if you have millions of vertices having the count from the previous hour is not such disaster.
I'm getting ready to start a project where I will be building a recommendation engine for restaurants. I have been waffling between neo4j (graph db) and mongodb (document db). my nodes/documents will be things like restaurant and person. i know i will want some edges, something like person->likes->restaurant, or person->ate_at->restaurant. my main query, however, will be to find restaurants within X miles of location Y.
if i have 20 restaurant's within X miles of Y, but not connected by any edges, how will neo4j be able to handle the spatial query? i know with mongodb i can index on lat/long and query all restaurant types. does neo4j offer the same functionality in a disconnected graph?
when it comes to answering questions like, 'which restaurants do my friends eat at most often?', is neo4j (graph db) the way to go? or will mongodb (document db) provide me similar functionality?
Neo4j Spatial introduces a Spatial RTree (or other means) index that is part of the graph itself. That means, even disconnected domain entities will be found via the spatial search, if you index them (that is relationships will connect the Spatial index to the Restaurants). Also, this is flexible enough that you can combine the Raw BBox search in the RTree with other things like check on the restaurants categories in the same go, since you can hop out and in the different parts of the graph.
This way, neo4j Spatial is supporting the full range of search capabilities that you would expect form a full Topology, like combined searches and searches on polygons with holes etc.
Be aware that Neo4j Spatial is in 0.7, so be gentle and ask on http://groups.google.com/group/neo4j/about :)
I'm not that familiar with Neo4J Spatial but it would seem that MongoDB is at the very least a good fit since it's the database Foursquare uses with exactly the purpose you describe. MongoDB geo indexing is extremely fast and scales up nicely.
Another possible solution is to use CouchBase. It uses a document model as well - though you need to be much more comfortable with MapReduce for queries. It has better spatial capabilities right now thank MongoDB but that may change over time.
Suggestion aside, I agree that of the two choices you have given Mongo will suit your needs fine and probably more appropriate for your spatial queries.
Neo4j geospatial doesn't scale up that good. I created a geospatial layer in neo4j and added nodes to this layer. Beyond 10,000 nodes the addition of nodes to the layer becomes very slow even when using neo4j2.0
On the other hand, mongodb geo-location works comparatively much faster and is more scalable.
I have an undirected graph where each node contains an array. Data can be added/deleted from the array. What's the best way to store this in Mongodb and be able to do this query effectively: given node A, select all the data contained in the adjacent nodes of A.
In relational DB, you can create a table representing the edges and another table for storing the data in each node this so.
table 1
NodeA, NodeB
NodeA, NodeC
table 2
NodeA, item1
NodeA, item2
NodeB, item3
And then you join the tables when you query for the data in adjacent nodes. But join is not possible in MongoDB, so what's the best way to setup this database and efficiently query for data in adjacent nodes (favoring performance slightly over space).
Specialized Distributed Graph Databases
I know this is sounds a little far afield from the OPs question about Mongo, but these days there are more specialized graph databases that excel at this kind of work and may be much easier for you to use, especially on large graphs.
There is a comparison of 7 such offerings here: https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0
Of the three most significant open source offerings (Titan, OrientDB, and Neo4J), all of them support the Tinkerpop Blueprints interface. So for a graph that looks like this...
... a query for "all the people that Juno greatly admires who she has known since the year 2011" would look like this:
Iterable<Vertex> results = juno.query().labels("knows").has("since",2011).has("stars",5).vertices()
This, of course, is just the tip of the iceberg. Pretty powerful stuff!
If you have to stay with Mongo
Think of Tinkerpop Blueprints as the "JDBC of storing graph structures" in various databases. The Tinkerpop Blueprints API has a specific MongoDB implementation that would work for you I'm sure. Then using Tinkerpop Gremlin, you have all sorts of advanced traversal and search methods at your disposal.
I'm picking up mongo, looking into this sort of schema as well (undirected graphs, querying for information from neighbors) I think the way that I favor so far looks something like this:
Each node contains an array of neighbor keys, like so.
{
nodeIndex: 4
myData: "data"
neighbors: [8,15,16,23,42]
}
To find data from neighbors, use the $in "operator":
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}});
You can use field selection to limit results to the relevant data.
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}}, {myData:1});
See http://www.mongodb.org/display/DOCS/Trees+in+MongoDB for inspiration.
MongoDB will introduce native graph capabilities in version 3.4 and it could be used to store graph stuctures and do analytics on them although performance might not be that good compared to native graph databases like Neo4j depending on the cases but it is too early to judge.
Check those links for more information:
$graphLookup (aggregation)
MongoDB 3.4 Accelerates Digital Transformation for the Modern Enterprise
MongoDB can simulate a graph using a flexible tree hierarchy. You may want to consider neo4j for strict graphing needs.