Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres? - mongodb

I am architecting a social-network, incorporating various features, many powered by big-data intensive workloads (such as Machine Learning). E.g.: recommender systems, search-engines and time-series sequence matchers.
Given that I currently have 5< users—but forsee significant growth—what metrics should I use to decide between:
Spark (with/without HBase over Hadoop)
MongoDB or Postgres
Looking at Postgres as a means of reducing porting pressure between it and Spark (use a SQL abstraction layer which works on both). Spark seems quite interesting, can imagine various ML, SQL and Graph questions it can be made to answer speedily. MongoDB is what I usually use, but I've found its scaling and map-reduce features to be quite limiting.

I think you are on the right direction to search for software stack/architecture which can:
handle different types of load: batch, real time computing etc.
scale in size and speed along with business growth
be a live software stack which are well maintained and supported
have common library support for domain specific computing such as machine learning, etc.
To those merits, Hadoop + Spark can give you the edges you need. Hadoop is relatively mature for now to handle large scale data in a batch manner. It supports reliable and scalable storage(HDFS) and computation(Mapreduce/Yarn). With the addition of Spark, you can leverage storage (HDFS) plus real-time computing (performance) added by Spark.
In terms of development, both systems are natively supported by Java/Scala. Library support, performance tuning of those are abundant here in stackoverflow and everywhere else. There are at least a few machine learning libraries(Mahout, Mlib) working with hadoop, spark.
For deployment, AWS and other cloud provider can provide host solution for hadoop/spark. Not an issue there either.

I guess you should separate data storage and data processing. In particular, "Spark or MongoDB?" is not a good thing to ask, but rather "Spark or Hadoop or Storm?" and also "MongoDB or Postgres or HDFS?"
In any case, I would refrain from having the database do processing.

I have to admit that I'm a little biased but if you want to learn something new, you have serious spare time, you're willing to read a lot, and you have the resources (in terms of infrastructure), go for HBase*, you won't regret it. A whole new universe of possibilities and interesting features open up when you can have +billions of atomic counters in real time.
*Alongside Hadoop, Hive, Spark...

In my opinion, it depends more on your requirements and the data volume you will have than the number of users -which is also a requirement-. Hadoop (aka Hive/Impala, HBase, MapReduce, Spark, etc.) works fine with big amounts -GB/TB per day- of data and scales horizontally very well.
In the Big Data environments I have worked with I have always used Hadoop HDFS to store raw data and leverage the distributed file system to analyse the data with Apache Spark. The results were stored in a database system like MongoDB to obtain low latency queries or fast aggregates with many concurrent users. Then we used Impala to get on demmand analytics. The main question when using so many technologies is to scale well the infraestructure and the resources given to each one. For example, Spark and Impala consume a lot of memory (they are in memory engines) so it's a bad idea to put a MongoDB instance on the same machine.
I would also suggest you a graph database since you are building a social network architecture; but I don't have any experience with this...

Are you looking to stay purely open-sourced? If you are going to go enterprise at some point, a lot of the enterprise distributions of Hadoop include Spark analytics bundled in.
I have a bias, but, there is also the Datastax Enterprise product, which bundles Cassandra, Hadoop and Spark, Apache SOLR, and other components together. It is in use at many of the major internet entities, specifically for the applications you mention. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise
You want to think about how you will be hosting this as well.
If you are staying in the cloud, you will not have to choose, you will be able to (depending on your cloud environment, but, with AWS for example) use Spark for continuous-batch process, Hadoop MapReduce for long-timeline analytics (analyzing data accumulated over a long time), etc., because the storage will be decoupled from the collection and processing. Put data in S3, and then process it later with whatever engine you need to.
If you will be hosting the hardware, building a Hadoop cluster will give you the ability to mix hardware (heterogenous hardware supported by the framework), will give you a robust and flexible storage platform and a mix of tools for analysis, including HBase and Hive, and has ports for most of the other things you've mentioned, such as Spark on Hadoop (not a port, actually the original design of Spark.) It is probably the most versatile platform, and can be deployed/expanded cheaply, since the hardware does not need to be the same for every node.
If you are self-hosting, going with other cluster options will force hardware requirements on you that may be difficult to scale with later.

We use Spark +Hbase + Apache Phoenix + Kafka +ElasticSearch and scaling has been easy so far.
*Phoenix is a JDBC driver for Hbase, it allows to use java.sql with hbase, spark (via JDBCrdd) and ElasticSearch (via JDBC river), it really simplifies integration.

Related

hadoop with mongodb and hadoop vs mongodb

I'm trying to understand key differences between mongoDB and Hadoop.
I understand that mongoDB is a database, while Hadoop is an ecosystem that contains HDFS. Some similarities in the way that data is processed using either technology, while major differences as well.
I'm confused as to why someone would use mongoDB over the Hadoop cluster, mainly what advantages does mongoDB offer over Hadoop. Both perform parallel processing, both can be used with Spark for further data analytics, so what is the value add one over the other.
Now, if you were to combine both, why would you want to store data in mongoDB as well as HDFS? MongoDB has map/reduce, so why would you want to send data to hadoop for processing, and again both are compatible with Spark.
First lets look at what we're talking about
Hadoop - an ecosystem. Two main components are HDFS and MapReduce.
MongoDB - Document type NoSQL DataBase.
Lets compare them on two types of workloads
High latency high throughput (Batch processing) - Dealing with the question of how to process and analyze large volumes of data. Processing will be made in a parallel and distributed way in order to finalize and retrieve results in the most efficient way possible. Hadoop is the best way to deal with such a problem, managing and processing data in a distributed and parallel way across several servers.
Low Latency and low throughput (immediate access to data, real time results, a lot of users) - When dealing with the need to show immediate results in the quickest way possible, or make small parallel processing resulting in NRT results to several concurrent users a NoSQL database will be the best way to go.
A simple example in a stack would be to use Hadoop in order to process and analyze massive amounts of data, then store your end results in MongoDB in order for you to:
Access them in the quickest way possible
Reprocess them now that they are on a smaller scale
The bottom line is that you shouldn't look at Hadoop and MongoDB as competitors, since each one has his own best use case and approach to data, they compliment and complete each other in your work with data.
Hope this makes sense.
Firstly, we should know what these two terms mean.
HADOOP
Hadoop is an open-source tool for Big Data analytics developed by the Apache foundation. It is the most popularly used tool for both storing as well as analyzing Big Data. It uses a clustered architecture for the same.
Hadoop has a vast ecosystem and this ecosystem comprises of some robust tools.
MongoDB
MongoDB is an open-source, general-purpose, document-based, distributed NoSQL database built for storing Big Data. MongoDB has a very rich query language which results in high performance. MongoDB is a document-based database, which implies that it stores data in JSON-like format documents.
DIFFERENCES
Both these tools are good enough for harnessing Big Data. It depends on your requirements. For some projects, Hadoop would be a good option and some MongoDB fits well.
Hope this helps you to distinguish between the two.

Which NoSQL ... again :), but a different use case

Suggestions for a NoSQL datastore so that we can push data and generate real time Qlikview reports easily?
Easily means:
1. Qlikview support for reads (mongodb connector available, otherwise maybe can write a JDBC connector, otherwise maybe can write a custom QVX connector to the datastore)
Easily adaptable to changes in schema, or schemaless. We change our schema quite frequently ...
Java support for writes
Super fast reads - real time incremental access, as well as batch access for old data within a time range. I read that Cassandra excels in ranges.
Reasonably fast writes
Reasonably big data storage - 20 million rows stored per day, about 200 bytes each
Would be nice if it can scale for a years worth of data, elasticity not so important.
Easy to use, install, and run. Looking at minimal setup and configuration time.
Matlabe support for adhoc querying
Initially I don't think we need a distributed system however a cluster is a possibility.
I've looked at Mongodb, Cassandra and Hbase. I don't think going over REST is a good idea due to (theoretically) slower performance.
I'm leaning towards MongoDB at the moment due to its ease of use, matlab support, totally schema less, Qlikview support (beta connector is available). However if anyone can suggest something better that would be great!
Depending on the server infrastructure you will use, I guess the best choice is amazon's NoSQL service, avalaible in aws.amazon.com.
The fact is any DB will have a poor performance in cloud infrastructure due to the way it stores data, amazon EC2 with EBS for instance is VERY slow for this task, requiring you to connect up to 20 EBS volumes in raid to acquire a decent speed. They solved this issue creating this NoSQL service, which I never used, but seems nice.

Advantages of databases like Greenplum or Vertica compared to MongoDB or Cassandra [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am currently working in a few projects with MongoDB and Apache Cassandra respectively. I am also using Solr a lot and I am handling "lots" of data with them (approx. 1-2TB). I've heard of Greenplum and Vertica the first time in the last week and I am not really sure, where to put them in my brain. They seem to me like Dataware House (DWH) solutions and I haven't really worked DWH. And they seem to cost lots of money (e.g. $60k for 1TB storage in Greenplum). I am currently not handling Petabyte of data and won't do so I think, but products like cassandra seem also to be able to handle this
Cassandra is the acknowledged NoSQL leader when it comes to
comfortably scaling to terabytes or petabytes of data.
via http://www.datastax.com/why-cassandra
So my question: Why should people use Greenplum & Co? Is there a huge advantage in comparison to these other products?
Thanks.
Cassandra, Greenplum and Vertica all handle huge amounts of data but in very different ways.
Some made up usecases where each database has its strengths:
Use cassandra for:
tweets.insert(key:user, data:blob);
tweets.get(key:user)
Use greenplum for:
begin;
update account set balance = balance - 10 where account_id = 1;
update account set balance = balance + 10 where account_id = 2;
commit;
Use Vertica for:
select sum(balance)
over (partition by region order by account rows unbounded preceding)
from transactions;
I work in the telecom industry. We deal with large data-sets and complex EDW(enterprise data warehouse) models.We started with Teradata and it was good for few years. Then the data increased exponentially, and as you know expansion in Teradata is expensive. So, we evaluated EMCs namely green plum, oracle exadata, hp Vertica and IBM netteza.
In speed, generation of 20 reports
went like this: 1. Vertica, 2. Netteza, 3. green plum, 4. oracle
In compression ratio: Vertica had a natural advantage. Among others IBM is good too.
The worst as per the benchmarks is emc and oracle. As always expected as its both want to sell ton of storage and hardware.
Scalability: All do scale well.
Loading time: emc is the best here, others (teradata , Vertica, oracle , IBM) are good too.
Concurrent user query :Vertica, emc, green plum, then only IBM. Oracle exadata is slow in any type of query case comparatively but much better than its old school 10g.
Price: Teradata > Oracle > IBM > HP > EMC
Note: Need to compare apple to apple, same no of cores ,ram,data volume, and reports
We chose Vertica for hardware independent pricing model, lower pricing and good performance. Now all 40+ users are happy to generate reports without waiting and it all fit in the low cost hp dl380 servers. it is great for olap /edw use case.
All this analysis is only for edw/analytics/olap case. I am still an oracle fan boy for all oltp, rich plsql, connectivity etc on any hardware or system. Exadata gives a decent mixed workload, but unreasonable in Price/performance ratio and still need to migrate 10g code to exadata best practice (sort of MMP like, bulk processing etc, and its time consuming than what they claim.
We've been working in Hadoop for 4 years, and Vertica for 2. We had massive loading and indexing problems with our tables in MySQL. We were running on fumes with our home-grown sharding solution. We could have invested heavily in developing a more sophisticated sharding solution, which would have been quite painful, imo. We could have thought harder about what data we absolutely needed to keep in a SQL database.
But at the end of the day, switching from MySQL to Vertica was what we chose. Vertica performance patterns are quite different from MySQL's, which comes with its own headaches. But it can load a lot of data very quickly, and it is good at heavy duty queries that would make MySQL's head spin.
The way I see it, Vertica is a solution when you are already invested in SQL and need a heavier duty SQL database. I'm not an expert, so I couldn't tell you what a transition to Oracle or DB2 would have been like compared to Vertica, neither in terms of integration effort or monetary cost.
Vertica offers a lot of features we've barely looked into. Those might be very attractive to others with use cases different to ours.
I'm a Vertica DBA and prior to that was a developer with Vertica. Michael Stonebreaker (the guy behind Ingres, Vertica, and other databases) has some critiques of NoSQL that are worth listening to.
Basically, here are the advantages of Vertica as I see them:
it's rather fast on large amounts of data
it's performance is similar (so I can gather) to other data warehousing solutions but it's advantage is clustering and commodity hardware. So you can scale by adding more commodity hardware. It looks cheap in terms of overall cost per TB. (Going from memory not an exact quote.)
Again, it's for data warehousing.
You get to use traditional SQL and tables. It's under the hood that's different.
I can't speak to the other products, but I'm sure a lot of them are fine too.
Edit: Here's a talk from Stonebreaker: http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb
Pivotal, formerly Greenplum, is the well-funded spinoff from EMC, VMware and GE. Pivotal's market are enterprises (and Homeland Cybersecurity agencies) with multi-Petabyte size databases needing complex analytics and high speed ETL. Greenplum’s origin is a PostgreSQL DB redesigned for Map Reduced MPP, with later additions for columnar-support and HDFS. It marries the best of SQL + NoSQL making NewSQL.
Features:
In 2015H1 most of their code, including Greenplum DB & HAWQ, will go
Open Source. Some advanced management & performance features at the
top of the stack will remain proprietary.
MPP (Massively Parallel Processing) share-nothing RDBMS database designed for multi-terrabyte to multi-petabyte environments.
Full SQL Compliance - supporting all versions of SQL: ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2.
•Only SQL over HADOOP capable of handling all 99 queries used by the TPC-DS benchmark standard without rewriting. The competition cannot do many of them and are significantly slower. SIGMON whitepaper.
ACID compliance.
Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files.
Solr/Lucene integration for multi-lingual full-text search embedded in the SQL.
Incorporates Open Source Software: Spring, Cloud Foundry, Redis.io, RabbitMQ, Grails, Groovy, Open Chorus, Pig, ZooKeeper, Mahout, MADlib, MapR. Some of these are used at EBSCO.
Native connectivity to HBase, which is a popular column-store-like technology for Hadoop.
VMware's participation in $150m investment in MongoDB will likely lead to integration of petabyte-scale XML files.
Table-by-table specification of distribution keys allow you to design your table schemas to take advantage of node-local joins and group bys, but will perform will even without this.
Row and/or Column-oriented data storage. It is the only database where a table can be polymorphic with both columnar and row-based partitions as defined by the DBA.
A column-store table can have a different compression algorithm per column because different datatypes have different compression characteristics to optimize their storage.
Advanced Map-Reduce-like CBO Query Optimizer – queries can be run on hundreds of thousands of nodes.
It is the only database with a dynamic distributed pipeline execution model for query processing. While older databases rely on materialized execution Greenplum doesn't have to write data to disk with every intermediate query step. It streams data to the next stage of a query plan in memory, and never has to materialize the data to disk, so it's much faster than what anybody has demonstrated on Hadoop.
Complex queries on large data sets are solved in seconds or even sub-seconds.
Data management – provides table statistics, table security.
Deep analytics – including data mining or machine learning algorithms using MADlib. Deep Semantic Textual Analytics using GPText.
Graphical Analysis - billion edge distributed in-memory graph database and algorithms using GraphLab.
Integration of SQL, Solr indexes, GPText, MADlib and GraphLab in a single query for massive syntactical parsing and graph/matrix affinity analysis for deep search analytics.
Fully ODBC/JDBC compliant.
Distributed ETL rate of 16 TB/hr!! Integration with Talend available.
Cloud support: Pivotal plans to package its Cloud Foundry software so that it can be used to host Pivotal atop other clouds as well, including Amazon Web Services' EC2. Pivotal data management will be available for use in a variety of cloud settings and will not be dependent on a proprietary VMware system. Will target OpenStack, vSphere, vCloud Director, or private brands. IBM announced it has standardized on Cloud Foundry for its PaaS. Confluence page.
Two hardware "appliance" offerings: Isilon NAS & Greenplum DCA.
There is a lot of confusion about when to use a row database like MySQL or Oracle or a columnar DB like Infobright or Vertica or a NoSQL variant or Hadoop. We wrote a white paper to try to help sort out which technologies are best suited for which use cases - you can download Emerging Database Landscape (scroll half way down) or watch an on-demand webinar on the same topic.
Hope either is useful for you

Difference between Memcached and Hadoop?

What is the basic difference between Memcached and Hadoop? Microsoft seems to do memcached with the Windows Server AppFabric.
I know memcached is a giant key value hashing function using multiple servers. What is hadoop and how is hadoop different from memcached? Is it used to store data? objects? I need to save giant in memory objects, but it seems like I need some kind of way of splitting this giant objects into "chunks" like people are talking about. When I look into splitting the object into bytes, it seems like Hadoop is popping up.
I have a giant class in memory with upwards of 100 mb in memory. I need to replicate this object, cache this object in some fashion. When I look into caching this monster object, it seems like I need to split it like how google is doing. How is google doing this. How can hadoop help me in this regard. My objects are not simple structured data. It has references up and down the classes inside, etc.
Any idea, pointers, thoughts, guesses are helpful.
Thanks.
memcached [ http://en.wikipedia.org/wiki/Memcached ] is a single focused distributed caching technology.
apache hadoop [ http://hadoop.apache.org/ ] is a framework for distributed data processing - targeted at google/amazon scale many terrabytes of data. It includes sub-projects for the different areas of this problem - distributed database, algorithm for distributed processing, reporting/querying, data-flow language.
The two technologies tackle different problems. One is for caching (small or large items) across a cluster. And the second is for processing large items across a cluster. From your question it sounds like memcached is more suited to your problem.
Memcache wont work due to its limit on the value of object stored.
memcache faq . I read some place that this limit can be increased to 10 mb but i am unable to find the link.
For your use case I suggest giving mongoDB a try.
mongoDb faq . MongoDB can be used as alternative to memcache. It provides GridFS for storing large file systems in the DB.
You need to use pure Hadoop for what you need (no HBASE, HIVE etc). The Map Reduce mechanism will split your object into many chunks and store it in Hadoop. The tutorial for Map Reduce is here. However, don't forget that Hadoop is, in the first place, a solution for massive compute and storage. In your case I would also recommend checking Membase which is implementation of Memcached with addition storage capabilities. You will not be able to map reduce with memcached/membase but those are still distributed and your object may be cached in a cloud fashion.
Picking a good solution depends on requirements of the intended use, say the difference between storing legal documents forever to a free music service. For example, can the objects be recreated or are they uniquely special? Would they be requiring further processing steps (i.e., MapReduce)? How quickly does an object (or a slice of it) need to be retrieved? Answers to these questions would affect the solution set widely.
If objects can be recreated quickly enough, a simple solution might be to use Memcached as you mentioned across many machines totaling sufficient ram. For adding persistence to this later, CouchBase (formerly Membase) is worth a look and used in production for very large game platforms.
If objects CANNOT be recreated, determine if S3 and other cloud file providers would not meet requirements for now. For high-throuput access, consider one of the several distributed, parallel, fault-tolerant filesystem solutions: DDN (has GPFS and Lustre gear), Panasas (pNFS). I've used DDN gear and it had a better price point than Panasas. Both provide good solutions that are much more supportable than a DIY BackBlaze.
There are some mostly free implementations of distributed, parallel filesystems such as GlusterFS and Ceph that are gaining traction. Ceph touts an S3-compatible gateway and can use BTRFS (future replacement for Lustre; getting closer to production ready). Ceph architecture and presentations. Gluster's advantage is the option for commercial support, although there could be a vendor supporting Ceph deployments. Hadoop's HDFS may be comparable but I have not evaluated it recently.

HBase cassandra couchdb mongodb..any fundamental difference?

I just wanted to know if there is a fundamental difference between hbase, cassandra, couchdb and monogodb ? In other words, are they all competing in the exact same market and trying to solve the exact same problems. Or they fit best in different scenarios?
All this comes to the question, what should I chose when. Matter of taste?
Thanks,
Federico
Those are some long answers from #Bohzo. (but they are good links)
The truth is, they're "kind of" competing. But they definitely have different strengths and weaknesses and they definitely don't all solve the same problems.
For example Couch and Mongo both provide Map-Reduce engines as part of the main package. HBase is (basically) a layer over top of Hadoop, so you also get M-R via Hadoop. Cassandra is highly focused on being a Key-Value store and has plug-ins to "layer" Hadoop over top (so you can map-reduce).
Some of the DBs provide MVCC (Multi-version concurrency control). Mongo does not.
All of these DBs are intended to scale horizontally, but they do it in different ways. All of these DBs are also trying to provide flexibility in different ways. Flexible document sizes or REST APIs or high redundancy or ease of use, they're all making different trade-offs.
So to your question: In other words, are they all competing in the exact same market and trying to solve the exact same problems?
Yes: they're all trying to solve the issue of database-scalability and performance.
No: they're definitely making different sets of trade-offs.
What should you start with?
Man, that's a tough question. I work for a large company pushing tons of data and we've been through a few years. We tried Cassandra at one point a couple of years ago and it couldn't handle the load. We're using Hadoop everywhere, but it definitely has a steep learning curve and it hasn't worked out in some of our environments. More recently we've tried to do Cassandra + Hadoop, but it turned out to be a lot of configuration work.
Personally, my department is moving several things to MongoDB. Our reasons for this are honestly just simplicity.
Setting up Mongo on a linux box takes minutes and doesn't require root access or a change to the file system or anything fancy. There are no crazy config files or java recompiles required. So from that perspective, Mongo has been the easiest "gateway drug" for getting people on to KV/Document stores.
CouchDB and MongoDB are document stores
Cassandra and HBase are key-value based
Here is a detailed comparison between HBase and Cassandra
Here is a (biased) comparison between MongoDB and CouchDB
Short answer: test before you use in production.
I can offer my experience with both HBase (extensive) and MongoDB (just starting).
Even though they are not the same kind of stores, they solve the same problems:
scalable storage of data
random access to the data
low latency access
We were very enthusiastic about HBase at first. It is built on Hadoop (which is rock-solid), it is under Apache, it is active... what more could you want? Our experience:
HBase is fragile
administrator's nightmare (full of configuration settings where default ones are less than perfect, nontransparent configuration, changes from version to version,...)
loses data (unless you have set the X configuration and changed Y to... you get the point :) - we found that out when HBase crashed and we lost 2 hours (!!!) of data because WAL was not setup properly
lacks secondary indexes
lacks any way to perform a backup of database without shutting it down
All in all, HBase was a nightmare. Wouldn't recommend it to anyone except to our direct competitors. :)
MongoDB solves all these problems and many more. It is a delight to setup, it makes administrating it a simple and transparent job and the default configuration settings actually make sense. You can perform (hot) backups, you can have secondary indexes. From what I read, I wouldn't recommend MapReduce on MongoDB (JavaScript, 1 thread per node only), but you can use Hadoop for that.
And it is also VERY active when compared to HBase.
Also:
http://www.google.com/trends?q=HBase%2CMongoDB
Need I say more? :)
UPDATE: many months later I must say MongoDB delivered on all accounts and more. The only real downside is that hosting companies do not offer it the way they offer MySQL. ;)
It also looks like MapReduce is bound to become multi-threaded in 2.2. Still, I wouldn't use MR this way. YMMV.
Cassandra is good for writing the data. it has advantage of "writes never fail". It has no single point failure.
HBase is very good for data processing. HBase is based on Hadoop File System (HDFS) so HBase dosen't need to worry for data replication, data consistency. HBase has the single point of failure. I am not really sure that what does it's mean if it has single point of failure then it is somhow similar to RDBMS where we have single point of failure. I might be wrong in sense since I am quite new.
How abou RIAK ? Does someone has experience using RIAK. I red some where that you need to pay, I am not sure. Need explanation.
One more thing which one you will prefer to use when you are only concern to reading a lot of data. You don't have any concern with writing. Just imagine you have database with pitabyte and you want to make fast search which NOSQL database would you prefer ?