It is mentioned as a feature of Apache AGE that it achieve both fast indexing and efficient query processing can anyone explain how Apache AGE is achieving this feature ?
Figure
Apache AGE stores nodes and relationships instead of tables, or documents.
Information is put away very much like you could portray thoughts on a whiteboard. Your information is put away without confining it to a pre-characterized model, permitting a truly adaptable perspective about and utilizing it.
This is the main reason how it achieves fast indexing and efficient query.
Apache AGE achieve both fast indexing and efficient query processing,
Apache AGE is a graph database that uses nodes to represent entities and edges to represent relationships.A relational database stores data in tables with rows and columns and employs the "JOIN" function for fast querying.
Apache AGE is quick even for large datasets, while relational databases are slower.
Graph databases(Apache AGE) typically use index-free adjacency, meaning that every node is connected to every other node in the database, while relational databases use indexed pointers to connect related data.
Graph databases are more scalable
You can model your data however you want with a graph database, so you're not limited to the rigid structures of a relational database.
Graph databases can more accurately capture the intricate web of relationships by representing data as a series of interconnected nodes.
Therefore, AGE achieve fast indexing and query processing.
Apache AGE achieves fast indexing through a combination of techniques, including:
Native graph storage: AGE stores graph data natively as edges and vertices in a PostgreSQL database. This allows for efficient indexing and querying of graph data.
Multi-level indexing: AGE uses multi-level indexing techniques to speed up graph queries. This includes both indexing of nodes and edges, as well as indexing of graph properties.
Graph query optimization: AGE optimizes graph queries to make use of the underlying multi-level index structures. This includes techniques such as early termination, path expansion, and filtering.
Parallel query processing: AGE supports parallel query processing, which allows for faster query response times on large graphs.
Overall, Apache AGE uses a combination of graph-specific indexing and query optimization techniques to achieve fast indexing and querying of graph data. It also leverages the power of PostgreSQL as a reliable and scalable relational database management system to support graph data and operations.
Related
I have successfully implemented the Time Series Use case like shown in the documentation. The data(Event class) pointed by the smallest time unit is indexed with a lucene spatial index.
I have two types of events : private or public.
How should I design my database and clusters for my use case ?
Would it be better to use an Edge or a linklist to make the link from Min to Event ?
I am worried that my lucene spatial index will get too big in the future.
By reading at the documentation, it looks like having clusters for the geolocation data would be a great strategy.
It is possible to use the index only on the subquery:
select
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].events)
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]
The documentation on indexes tells me it is possible to use indexes on an edge properties. The bad side is that it takes more storage then linked list I tested it and it works :
select
expand(inV())
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].outE('MinPublicEvent'))
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]
In regard of edge vs link, taken from OrientDB doc: lightweight edges, the first difference is that edges can store properties and links don't.
These are the PROS and CONS of Lightweight Edges vs Regular Edges:
PROS:
faster in creation and traversing, because don't need an additional
document to keep the relationships between 2 vertices
CONS:
cannot store properties harder working with Lightweight edges from
SQL, because there is no a regular document under the edge
Since you already mentioned using properties on edges, which makes sense to me, as you can use these properties in the edges to transverse the graph, this means that you can't use a link to store that relationship.
In the case you want to embed these properties on the Event vertex, that is also fine, and you'd be able to use links, loosing the hability of using the properties in the edge to transverse the graph in favour of improved performance.
The edge approach is more expressive, but when performance really matters, and there is risk of a bottleneck, you should monitor the metrics and the performance, and refactor to the embed + link approach in case there is an issue with performance.
Update:
Clusters are basically a mechanism to split data in OrientDB (Clusters tutorial), which works for both, edge and vertex.
You may also find it beneficial to locate different clusters on
different servers, physically separating where you store records in
your database. The advantages of this include:
Optimization: Faster query execution against clusters, given that you
need only search a subset of the clusters in a class.
Indexes: With good partitioning, you can reduce or remove the use of > indexes.
Parallel Queries: Queries can be run in parallel when made to data on
multiple disks.
Sharding: You can shard large data-sets across
multiple instances.
Unless you can identify clearly a good way to partition your data, and can distribute your database between different servers, i suggest you start with the default, as OrientDB already creates 1 cluster for each class in the schema, and add more clusters as your database grow.
When to add more clusters? Metrics, metrics and metrics. Keep track of how your application access your database, what kind of queries, amount of queries, etc.
I am working on application which requires features offered by both graph database(to store raw data) and document database(extracted reports from raw data). I planned to use neo4j and mongodb. I am having second thoughts about and looking at orientDB. is it better to have a single multimodel database than two separate databases? The reason I leaned towards neo4j is its native graph storage which might come in handy for memory locality for large graphs. OrientDB doesn't store graph natively. or does it?
OrientDB stores graph natively. Its engine is 100% a Graph Database like Neo4j. Actually OrientDB and Neo4j are the only Graph Databases with index-free adjacency. Some other Graph Database acts as a layer on top of an existent model (RDBMS, Column or Document stores).
So there is nothing you can do with Neo4j that you can't do with OrientDB. But OrientDB allows to model more complex data, like Document DBMS (MongoDB) can do. For example each vertices and edges in OrientDB is a document (json), so you can store in the vertex and edge complex types like embedded properties, list, sets, date, decimal, etc.
Don't be dazzled by terminology. "Index-free adjacency" is a term that simply means graph vertices are stored "with" their edges. Each database does this in a slightly different way. Neo4J stores them on disk in a linked list. If you have them in memory, and there's not too many of them, they're fast. If you have to hit them on disk, then you may need an index. Titan stores them as columns in a wide-column database such as Cassandra. If they're in memory, they're fast. If you have to hit them on disk, the underlying database's range queries make them fast to load in bulk, and extra indexing can decrease the cost of searching large edge lists.
This discussion is fairly valuable: How does Titan achieve constant time lookup using HBase / Cassandra?
Whether you're using OrientDB or any other database, your efficiency at graph queries will rely in large part on the indexing you put in place so that you start your graph queries on, and traverse through, a relatively small set of nodes. Be sure to model some of the queries you're doing to make sure that whatever database you choose will support the right indexes, whether they're across the whole graph, or local to each vertex.
Does OrientDB's support efficient computations of connected components?
I am not experienced with graph databases. My naiive intuition is that this operation should be quite efficient.
If it is efficiently supported, how would a query look like to find all connected components?
I had your same issue but I finally ended up writing an OSQL query to compute connected components in a graph, here is my solution
Below is an excerpt from the OrientDB website. I've highlighted a few relevant portions.
OrientDB can embed documents like any other document database, but
also supports relationships. It doesn’t use the costly JOIN. Instead,
OrientDB uses super-fast, persistent pointers between records, taken
from the graph database world. You can traverse parts of or entire
trees and graphs of records in just a few milliseconds.
This illustration shows how the original document has been
split into two documents linked using the Customer’s Record ID #8:124
to connect the Order to the Customer document. Links can be thought of
as in-memory pointers, but persistent on disk.
[snip]
Equipped With document and relational DBMS, the more data you
have, the slower the database will be. Joins have a heavy runtime
cost. Instead, OrientDB handles relationships as physical links to the
records, assigned only once, when the edge is created O(1). Compare
this to an RDBMS that “computes“ the relationship every single time
you query a database O(LogN). With OrientDB, traversing speed is not
affected by the database size. It is always constant, whether for one
record or 100 billion records. This is critical in the age of Big
Data!
And here is an example query taken from the tutorial document, which will get all the friends of the person called Luca.
SELECT EXPAND( BOTH( 'Friend' ) ) FROM Person WHERE name = 'Luca'
I have a relational database with about 300M customers and their attributes from several perspectives (360).
To perform some analytics I intent to make an extract to a MongoDB in order to have a 'flat' representation that is more suited to apply data mining techniques.
Would that make sense? Why?
Thanks!
No.
Its not storage that would be the concern here, its your flattening strategy.
How and where you store the flattened data is a secondary concern, note MongoDB is a document database and not inherently flat anyway.
Once you have your data in the shape that is suitable for your analytics, then, look at storage strategies, MongoDB might be suitable or you might find that something that allows easy Map Reduce type functionality would be better for analysis... (HBase for example)
It may make sense. One thing you can do is setup MongoDB in a horizontal scale-out setup. Then with the right data structures, you can run queries in parallel across the shards (which it can do for you automatically):
http://www.mongodb.org/display/DOCS/Sharding
This could make real-time analysis possible when it otherwise wouldn't have been.
If you choose your data models right, you can speed up your queries by avoiding any sorts of joins (again good across horizontal scale).
Finally, there is plenty you can do with map/reduce on your data too.
http://www.mongodb.org/display/DOCS/MapReduce
One caveat to be aware of is there is nothing like SQL Reporting Services for MongoDB AFAIK.
I find MongoDB's mapreduce to be slow (however they are working on improving it, see here: http://www.dbms2.com/2011/04/04/the-mongodb-story/ ).
Maybe you can use Infobright's community edition for analytics? See here: http://www.infobright.com/Community/
A relational db like Postgresql can do analytics too (afaik MySQL can't do a hash join but other relational db's can).
I have an undirected graph where each node contains an array. Data can be added/deleted from the array. What's the best way to store this in Mongodb and be able to do this query effectively: given node A, select all the data contained in the adjacent nodes of A.
In relational DB, you can create a table representing the edges and another table for storing the data in each node this so.
table 1
NodeA, NodeB
NodeA, NodeC
table 2
NodeA, item1
NodeA, item2
NodeB, item3
And then you join the tables when you query for the data in adjacent nodes. But join is not possible in MongoDB, so what's the best way to setup this database and efficiently query for data in adjacent nodes (favoring performance slightly over space).
Specialized Distributed Graph Databases
I know this is sounds a little far afield from the OPs question about Mongo, but these days there are more specialized graph databases that excel at this kind of work and may be much easier for you to use, especially on large graphs.
There is a comparison of 7 such offerings here: https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0
Of the three most significant open source offerings (Titan, OrientDB, and Neo4J), all of them support the Tinkerpop Blueprints interface. So for a graph that looks like this...
... a query for "all the people that Juno greatly admires who she has known since the year 2011" would look like this:
Iterable<Vertex> results = juno.query().labels("knows").has("since",2011).has("stars",5).vertices()
This, of course, is just the tip of the iceberg. Pretty powerful stuff!
If you have to stay with Mongo
Think of Tinkerpop Blueprints as the "JDBC of storing graph structures" in various databases. The Tinkerpop Blueprints API has a specific MongoDB implementation that would work for you I'm sure. Then using Tinkerpop Gremlin, you have all sorts of advanced traversal and search methods at your disposal.
I'm picking up mongo, looking into this sort of schema as well (undirected graphs, querying for information from neighbors) I think the way that I favor so far looks something like this:
Each node contains an array of neighbor keys, like so.
{
nodeIndex: 4
myData: "data"
neighbors: [8,15,16,23,42]
}
To find data from neighbors, use the $in "operator":
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}});
You can use field selection to limit results to the relevant data.
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}}, {myData:1});
See http://www.mongodb.org/display/DOCS/Trees+in+MongoDB for inspiration.
MongoDB will introduce native graph capabilities in version 3.4 and it could be used to store graph stuctures and do analytics on them although performance might not be that good compared to native graph databases like Neo4j depending on the cases but it is too early to judge.
Check those links for more information:
$graphLookup (aggregation)
MongoDB 3.4 Accelerates Digital Transformation for the Modern Enterprise
MongoDB can simulate a graph using a flexible tree hierarchy. You may want to consider neo4j for strict graphing needs.