My usage of MongoDB is quite simple. I have only a replica set and handle some basic queries without using Map Reduce.
I have heard that Hadoop is great data processing tool which can increase performance once MongoDB connects to. It handle Map Reduce well but is it useful in my case which does not contain any Map Reduce functions?
Moreover, if I use Map Reduce in MongoDB and connect to Hadoop, how will be the performance increased?
Hadoop is good for Batch Processing and on a huge volume of data (GB's to TB's).
So if you are not expecting that volume of data in your case and you need instant output of your query you can better do it by mongo alone. Hadoop might be an overkill for the job.
Related
I need to implement a big data storage + processing system.
The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document of about 10 fields ( date,numbers, text, ids).
Data could then be queried online ( if possible) making arbitrary groupings on some of the fields of the document ( date range queries, ids ,etc ) .
I'm thinking on using a MongoDB cluster for storing all this data and build indices for the fields I need to query from, then process the data in an apache Spark cluster ( mostly simple aggregations+sorting). Maybe use Spark-jobserver to build a rest-api around it.
I have concerns about mongoDB scaling possibilities ( i.e storing 10b+ rows ) and throughput ( quickly sending 1b+ worth of rows to spark for processing) or ability to maintain indices in such a large database.
In contrast, I consider using cassandra or hbase, which I believe are more suitable for storing large datasets, but offer less performance in querying which I'd ultimately need if i am to provide online querying.
1 - is mongodb+spark a proven stack for this kind of use case?
2 - is mongodb ( storing + query performance) scalability unbounded ?
thanks in advance
As mentioned previously there are a number of NoSQL solutions that can fit your needs. I can recommend MongoDB for use with Spark*, especially if you have operational experience with large MongoDB clusters.
There is a white paper about turning analytics into realtime queries from MongoDB. Perhaps more interesting is the blog post from Eastern Airlines about their use of MongoDB and Spark and how it powers their 1.6 billion flight searches a day.
Regarding the data size, then managing a cluster with that much data in MongoDB is a pretty normal. The performance part for any solution will be the quickly sending 1b+ documents to Spark for processing. Parallelism and taking advantage of data locality are key here. Also, your Spark algorithm will need to be such to take advantage of that parallelism - shuffling lots of data is time expensive.
Disclaimer: I'm the author of the MongoDB Spark Connector and work for MongoDB.
Almost any NoSQL database can fit your needs when storing data. And you are right that MongoDB offers some extra's over Hbase and Cassandra when it comes to querying the data. But elasticsearch is a proven solution for high speed storing and retrieval/querying of data (metrics).
Here is some more information on using elasticsearch with Spark:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
I would actually use the complete ELK stack. Since Kibana will allow you to go easily through the data with visualization capabilities (charts etc.).
I bet you already have Spark, so I would recommend to install the ELK stack on the same machine/cluster to test if it suits your needs.
Traditional definitions of MapReduce state that it is a "programming model for processing large data sets with a parallel, distributed algorithm on a cluster."
Over the weekend, I was trying out MongoDB and tried a few simple MapReduce queries. (The basic word count problem in a book). MongoDB performed really well, but then I began to wonder if it was actually a MapReduce operation, or just a simple group-by aggregation, hence my question:
In case of a single-node "cluster", does it make sense to use MapReduce?
Traditional definitions of MapReduce state that it is a "programming model for processing large data sets with a parallel, distributed algorithm on a cluster."
Well i does not need to be on a cluster, a single node can implement map and reduce functions with threads or processes. I do not know the inner dynamics in the mongodb while doing map reduce, but i know map reduce in MongoDB does not have the same dynamics as Hadoop.
In case of a single-node "cluster", does it make sense to use MapReduce?
For large amount of data map and reduce functions need to be executed in a distributed environment so no i does not make sense to process big data(for small sized data it is OK).
Opinion
Sorry it section will be not an answer but statement open to discussion. In a software system MongoDB s responsibilty should be to keep data not to process it. If there is a data processing requirement(will be 99%) MongoDB MapReduce can be used until a certain amount of data size(import to determine threshold before hand), then should be propagated to a Hadoop cluster.(or a similar distributed solution)
Mongo support Map/Reduce queries but they don't seem to be map reduce in the Hadoop sense (running in parallel). What is the best way to run queries on a massive Mongo database? Do I need to export it to another place?
Depending on what exactly you need to do, your options (while staying within Mongo) are:
1) Keep using map/reduce in Mongo, but fire up some secondaries for m/r purposes. This is one somewhat easy way of parallelizing map reduce. There are limits, though, you can only use the "out: inline" option, so the results need to be ~16MB or less. This is only really feasible if you haven't sharded yet.
2) Look into the aggregation framework coming in 2.2 (2.2.0-rc0 is out, we've found it to be pretty stable at MongoHQ). This is better optimized on the db level, mostly keeps you out of the janky javascript engine, and is one of the more interesting features 10gen has added. It will also work in a sharded environment.
For either of the above, you want to make sure you have enough RAM (or really fast disks) to hold all the input data, the intermediate steps, and the result. Otherwise you're bound by IO speeds and not getting much out of your CPU.
If you want to step outside of Mongo, you can try the Mongo Hadoop adapter. Hadoop is a much better way of doing map/reduce, and this will let you use your Mongo data as input. This can be operationally complicated, though, which means either high effort or fragile.
I'm using MongoDB with Node.js. Is there any speed advantage to using a MapReduce in Mongo as opposed to getting the full result set and doing a map and reduce in JS on my own?
There is usually no performance advantage to retrieving the entire resultset and performing the m/r app-side. In fact, in almost all situations cramming the entire resultset in memory on your node server is a particularly bad idea.
Doing the map/reduce on MongoDB will make sure no bandwidth between the database and your app server is wasted on retrieving the resultset and writing back the results of your m/r. MongoDB's map/reduce can also be easily scaled up.
TL;DR : Always do it in MongoDB
If your database is on a different host than your server, the transfer of data will be smaller, which will waste less bandwidth and time.
The actual transfer of data can be costly and time consuming. Imagine if everytime you wanted to do an inventory count you shipped all your items to another warehouse.
Also you have to factor in how things will scale.
With mongodb you will typically want at least one replica for your data and that will add performance for read based tasks.
With node you probably wont need to add a second server for a good while due to how well it scales. Adding an intensive task to it could cause you to need to expand the amount of node servers facing outwards.
I have a relational database with about 300M customers and their attributes from several perspectives (360).
To perform some analytics I intent to make an extract to a MongoDB in order to have a 'flat' representation that is more suited to apply data mining techniques.
Would that make sense? Why?
Thanks!
No.
Its not storage that would be the concern here, its your flattening strategy.
How and where you store the flattened data is a secondary concern, note MongoDB is a document database and not inherently flat anyway.
Once you have your data in the shape that is suitable for your analytics, then, look at storage strategies, MongoDB might be suitable or you might find that something that allows easy Map Reduce type functionality would be better for analysis... (HBase for example)
It may make sense. One thing you can do is setup MongoDB in a horizontal scale-out setup. Then with the right data structures, you can run queries in parallel across the shards (which it can do for you automatically):
http://www.mongodb.org/display/DOCS/Sharding
This could make real-time analysis possible when it otherwise wouldn't have been.
If you choose your data models right, you can speed up your queries by avoiding any sorts of joins (again good across horizontal scale).
Finally, there is plenty you can do with map/reduce on your data too.
http://www.mongodb.org/display/DOCS/MapReduce
One caveat to be aware of is there is nothing like SQL Reporting Services for MongoDB AFAIK.
I find MongoDB's mapreduce to be slow (however they are working on improving it, see here: http://www.dbms2.com/2011/04/04/the-mongodb-story/ ).
Maybe you can use Infobright's community edition for analytics? See here: http://www.infobright.com/Community/
A relational db like Postgresql can do analytics too (afaik MySQL can't do a hash join but other relational db's can).