Are there any open-source graph-databases around which are able to store binary data, scale horizontally and optionally provide versioning of stored data?
I am overwhelmed by the sheer amount of dbs out there, but none of them seems to have all the desired features.
Look at OrientDB: open source (Apache 2 license), very fast. Supports SQL and graph GREMLIN language.
The binary storage, horizontal scale, and versioning requirements all sound like good candidates for a BigTable model like Cassandra or HBase. If you really need a graph database, those may not be a good fit, however. If you can expand a bit more on what the requirements are, we could make a better suggestion.
[http://en.wikipedia.org/wiki/NoSQL][1]
for example:
InfiniteGraph - High-performance, scalable, distributed Graph Database
Horizontal scaling, look at Titan (uses Cassandra underneath): Titan homepage, Titan presentation video
For versioning your graph (if that's what you really need), you could try using Antiquity on top of a graph store.
From the Titan site:
Titan is a highly scalable graph database optimized for storing and querying massive-scale graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals.
In addition, Titan provides the following features:
Elastic and linear scalability for a growing data and user base.
Data distribution and replication for performance and fault tolerance.
Multi-datacenter high availability and hot backups.
Support for ACID and eventual consistency.
Support for various storage backends:
Apache Cassandra
Apache HBase
Oracle BerkeleyDB
Support for geo, numeric range, and full-text search via:
ElasticSearch
Apache Lucene
Native integration with the TinkerPop graph stack:
Gremlin graph query language
Frames object-to-graph mapper
Rexster graph server
Blueprints standard graph API
Open source with the liberal Apache 2 license.-
Related
I'm trying to understand key differences between mongoDB and Hadoop.
I understand that mongoDB is a database, while Hadoop is an ecosystem that contains HDFS. Some similarities in the way that data is processed using either technology, while major differences as well.
I'm confused as to why someone would use mongoDB over the Hadoop cluster, mainly what advantages does mongoDB offer over Hadoop. Both perform parallel processing, both can be used with Spark for further data analytics, so what is the value add one over the other.
Now, if you were to combine both, why would you want to store data in mongoDB as well as HDFS? MongoDB has map/reduce, so why would you want to send data to hadoop for processing, and again both are compatible with Spark.
First lets look at what we're talking about
Hadoop - an ecosystem. Two main components are HDFS and MapReduce.
MongoDB - Document type NoSQL DataBase.
Lets compare them on two types of workloads
High latency high throughput (Batch processing) - Dealing with the question of how to process and analyze large volumes of data. Processing will be made in a parallel and distributed way in order to finalize and retrieve results in the most efficient way possible. Hadoop is the best way to deal with such a problem, managing and processing data in a distributed and parallel way across several servers.
Low Latency and low throughput (immediate access to data, real time results, a lot of users) - When dealing with the need to show immediate results in the quickest way possible, or make small parallel processing resulting in NRT results to several concurrent users a NoSQL database will be the best way to go.
A simple example in a stack would be to use Hadoop in order to process and analyze massive amounts of data, then store your end results in MongoDB in order for you to:
Access them in the quickest way possible
Reprocess them now that they are on a smaller scale
The bottom line is that you shouldn't look at Hadoop and MongoDB as competitors, since each one has his own best use case and approach to data, they compliment and complete each other in your work with data.
Hope this makes sense.
Firstly, we should know what these two terms mean.
HADOOP
Hadoop is an open-source tool for Big Data analytics developed by the Apache foundation. It is the most popularly used tool for both storing as well as analyzing Big Data. It uses a clustered architecture for the same.
Hadoop has a vast ecosystem and this ecosystem comprises of some robust tools.
MongoDB
MongoDB is an open-source, general-purpose, document-based, distributed NoSQL database built for storing Big Data. MongoDB has a very rich query language which results in high performance. MongoDB is a document-based database, which implies that it stores data in JSON-like format documents.
DIFFERENCES
Both these tools are good enough for harnessing Big Data. It depends on your requirements. For some projects, Hadoop would be a good option and some MongoDB fits well.
Hope this helps you to distinguish between the two.
I need some help to be confirm my choice... and to learn if you can give me some information.
My storage database is TitanDb with Cassandra.
I have a very large graph. My goal is to use Mllib on the graph latter.
My first idea : use Titan with GraphX but I did not found anything or in development in progress... TinkerPop is not ready yet.
So I have a look to Giraph. TinkerPop, Titan can communique with Rexster from TinkerPop.
My question is :
What are the benefit to use Giraph ? Gremlin seems to do the same think and is distributed.
Thank you very much to explain me. I think I don't really understand the difference between Gremlin and Giraph (or GraphX).
Have a nice day.
Gremlin is a graph traversal language while
Giraph or Graphx is graph processing system.
I believe you're asking for difference between graphx or giraph and titan. To be more specific, why should you use graph processing system when you already have your data in graph database?
So it essentially is the difference between graph database and graph processing system.
Graph database is your guy when your application requires frequently querying the data. E.g. for a facebook kind of application, given a user, return all his/her friends. This is suitable for graph database and you can use gremlin to query.
Now, if you want to compute rank of each user in facebook, you need to run the pagerank algorithm over whole graph. In other words, pagerank algorithm process your whole graph and returns you the map . This is suitable application for graph processing system. Yes, you can write queries using gremlin framework to do this but 1. it won't be as userfriendly as underlying pregel model used by giraph or graphx. 2. it won't be efficient.
To summarize, it really depends on your application. If you think your application is like query. Don't bother loading unloading into any graph processing system. If you think your application is more like pagerank (which requires processing whole graph) and you have a large graph (atleast 1M edges). Go for giraph or graphx.
giraph and graphx has the graph input format. You can dump your data into that form in a file and can input it into one of these systems or you can write your own input format.
p.s. it'd be good to have an input format added in giraph graphx which accepts data stored in titan.
Interesting question. I am on the same track.
First your question about MLlib. I assume that you mean Apache Spark MLlib, the machine learning (ML) implementation on top of Apache Spark. So my conclusion is: you want to run ML algorithms for purposes such as clustering and classification using the data in your Titan/Cassandra based graph database.
Please note that you could also use graph processing algorithms like Page Rank mentioned by spidy to do things like clustering on top of your Titan/Cassandra graph database. In other words: you don't need ML to do clustering when your starting point is a graph database.
Apache Spark MLlib seems to be future proof and widely supported, their most recent announcements were regarding new ML algorithms, although Apache Mahout, another Apache ML project, is more mature regarding the amount of supported ML algorithms. Apache Mahout has also adopted Apache Spark as their data storage layer, so I therefore mention it in this post.
Apache Spark offers, in addition to in-memory computing, the mentioned MLlib for machine learning, Spark SQL which is like Hive on Spark, GraphX which is a graph processing system as explained by spidy and Spark Streaming for processing of streaming data.
I consider Apache Spark itself as a logical data layer, represented as RDDs (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop/Hcatalog and HBase. Apache Spark offers a connector to Cassandra. Note that RDDs are immutable, you cannot alter data using Spark, you can only process and analyze the data in Spark.
Regarding the Apache Spark logical storage layer RDD: You could compare an RDD as a view in the good old SQL times, RDDs give you a view on for example a table in Cassandra of HBase. Note also that Apache Spark offers an API for 3 development environments: Scala, Java and Python.
Apache Giraph is also a graph processing toolset, functional equivalent to Apache Spark GraphX. Apache Giraph uses Hadoop as the data storage layer. You are using Titan/Cassandra so you will probably enter data migration tasks when you select Apache Giraph as your solution. Secondly, you started your post with a question regarding ML using MLlib and Apache Giraph is not a ML solution.
Your conclusion regarding Giraph and Gremlin is not correct: they are not the same although both are using a graph database. Giraph is a solution for graph processing as spidy explained. Using Giraph you can execute graph analysis algorithms such as Page Rank, e.g. who has the most followers, whilst Gremlin is meant for traversing e.g. query the graph database using the complex relationships (edges) between entities (vertices) obtaining result sets of vertex and edge properties.
I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.
Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:
Spark (with/without HBase over Hadoop)
MongoDB or Postgres
Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.
I think you are on the right direction to search for software stack/architecture which can:
handle different types of load: batch, real time computing etc.
scale in size and speed along with business growth
be a live software stack which are well maintained and supported
have common library support for domain specific computing such as machine learning, etc.
To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.
In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.
For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.
I guess you should separate data storage and data processing. In particular, "Spark or MongoDB?" is not a good thing to ask, but rather "Spark or Hadoop or Storm?" and also "MongoDB or Postgres or HDFS?"
In any case, I would refrain from having the database do processing.
I have to admit that I'm a little biased but if you want to learn something new, you have serious spare time, you're willing to read a lot, and you have the resources (in terms of infrastructure), go for HBase*, you won't regret it. A whole new universe of possibilities and interesting features open up when you can have +billions of atomic counters in real time.
*Alongside Hadoop, Hive, Spark...
In my opinion, it depends more on your requirements and the data volume you will have than the number of users -which is also a requirement-. Hadoop (aka Hive/Impala, HBase, MapReduce, Spark, etc.) works fine with big amounts -GB/TB per day- of data and scales horizontally very well.
In the Big Data environments I have worked with I have always used Hadoop HDFS to store raw data and leverage the distributed file system to analyse the data with Apache Spark. The results were stored in a database system like MongoDB to obtain low latency queries or fast aggregates with many concurrent users. Then we used Impala to get on demmand analytics. The main question when using so many technologies is to scale well the infraestructure and the resources given to each one. For example, Spark and Impala consume a lot of memory (they are in memory engines) so it's a bad idea to put a MongoDB instance on the same machine.
I would also suggest you a graph database since you are building a social network architecture; but I don't have any experience with this...
Are you looking to stay purely open-sourced? If you are going to go enterprise at some point, a lot of the enterprise distributions of Hadoop include Spark analytics bundled in.
I have a bias, but, there is also the Datastax Enterprise product, which bundles Cassandra, Hadoop and Spark, Apache SOLR, and other components together. It is in use at many of the major internet entities, specifically for the applications you mention. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
You want to think about how you will be hosting this as well.
If you are staying in the cloud, you will not have to choose, you will be able to (depending on your cloud environment, but, with AWS for example) use Spark for continuous-batch process, Hadoop MapReduce for long-timeline analytics (analyzing data accumulated over a long time), etc., because the storage will be decoupled from the collection and processing. Put data in S3, and then process it later with whatever engine you need to.
If you will be hosting the hardware, building a Hadoop cluster will give you the ability to mix hardware (heterogenous hardware supported by the framework), will give you a robust and flexible storage platform and a mix of tools for analysis, including HBase and Hive, and has ports for most of the other things you've mentioned, such as Spark on Hadoop (not a port, actually the original design of Spark.) It is probably the most versatile platform, and can be deployed/expanded cheaply, since the hardware does not need to be the same for every node.
If you are self-hosting, going with other cluster options will force hardware requirements on you that may be difficult to scale with later.
We use Spark +Hbase + Apache Phoenix + Kafka +ElasticSearch and scaling has been easy so far.
*Phoenix is a JDBC driver for Hbase, it allows to use java.sql with hbase, spark (via JDBCrdd) and ElasticSearch (via JDBC river), it really simplifies integration.
As part of a CMS I'm developing I've got MongoDB as the primary datastore which feeds to ElasticSearch and Redis. All this is configured decleratively.
I'm currently trying to develop a declarative api in JSON (A DSL of sorts) which, when implemented, will enable me to write uniform queries in JSON, but at the backend these datastores work in tandem to come up with the result. Federated search if you will.
Now, while fleshing out the supported types of queries for this Json api, I've come across a class of queries not (efficiently) supported by my current setup: graph-based queries, like friend-of-friend, RDF-queries, etc. Something I'd like to support as well.
So I'm looking for a way to introduce a GraphDB into this ecosystem with the best fit. I should probably say the app-layer sits in Node.js.
I've come across lots of articles comparing Neo4J (a popular GraphDB) vs MongoDB, but not so much of actual use-cases, real world scenarios in which the 2 are complemented.
Any pointers highly appreciated.
You might want to take a look at structr[1], which has a RESTful graph database backend that you can configure using Java beans. In future versions, there will be a configuration option using REST calls only, so that you can fire up a structr server and configure and use it as a standalone graph database backend.
Just contact us on twitter or via email.
(disclaimer: I'm one of the developers of structr, so this comment may not be 100% impartial :))
[1] http://structr.org
The databases are very much complementary.
Use MongoDB to store your raw data/system of record and load the raw data into Neo4j for additional insights/analysis. When you are dealing with unstructured data, you want to store the information in a datastore which is conducive to unstructure data - MongoDB fits the bill (as does other similar NOSQL databases). While Neo4j is considered a NOSQL database, it doesn't fit the bill for unstructured data. Because you have to determine what is a relationship, what is a node, and what properties are stored for each - it's better suited when you have semi-structured data and some understanding of the type of analysis you want to do.
A great architecture is store your unstructured data in MongoDB and use jobs to load them into Neo4j. This allows you to re-load your graph if you figure out there are new pieces of information you'd like to store in the graph for additional analysis.
They are definitely NOT replacements for each other. They fit very different use cases.
What's currently the best choice to persist graph-like structures? Graph databases (e.g. Neo4j) or RDF triple stores (e.g. Virtuoso)?
For example, we have the following use case:
the weakly connected graph (similar to the one of scholarly papers in a collection) with nearly 10M nodes;
quite rare updates;
critical operations: retrieving particular sub-graphs, updating nodes in a given sub-graph, re-computing link analysis measures (e.g. HITS or PageRank) after updating some nodes.
Providing the standard API to query the data for third party applications (a la Facebook's or Twitter's) is desired as well.
With Virtuoso you have the following working for you:
-- SPARQL, SQL, SPASQL (SPARQL inside SQL), and SQL inside SPARQL support (e.g. for dealing with N-ary relations via magic/function predicates/properties.
-- works as a compact engine (e.g., as exploited via KDE Desktop) or massive DBMS as demonstrated via the live 17 Billion Triples+ LOD Cloud Cache or the smaller DBpedia live instance.
-- includes Full Text indexing and text patterns in SPARQL (via bif:contains) it also included XPath/Xquery (via xcontains)
-- Acid or Non Acid mode ditto Schema-Last when dealing with Property Graph Store
-- Via Transformation Middleware it can pull data from 80+ data sources (includes REST APIs, SOAP services, Hypermedia Resource, ODBC or JDBC accessible relational data sources etc..) and transform into Transient or Persistent Linked Data graphs
-- Linked Data publishing is automatic i.e., post DBMS record creation you have in-built Linked Data Pages that as views into the DBMS. No messing around re. URL-Rewrite rules, 303 redirects or anything like that. InterWeb scale Super Keys just work!
That's it for now :-)
For horizontal scale (thus small to medium sized databases) graph databases like neo4j will currently give better performance for graph traversals. Triplestores are catching up though. The big advantage of a Triple Store compared to a graph database is that data dumps and query language are standardized, which means its a lot easier to move to another product and prevent vendor lock-in.