I need to implement a big data storage + processing system.
The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document of about 10 fields ( date,numbers, text, ids).
Data could then be queried online ( if possible) making arbitrary groupings on some of the fields of the document ( date range queries, ids ,etc ) .
I'm thinking on using a MongoDB cluster for storing all this data and build indices for the fields I need to query from, then process the data in an apache Spark cluster ( mostly simple aggregations+sorting). Maybe use Spark-jobserver to build a rest-api around it.
I have concerns about mongoDB scaling possibilities ( i.e storing 10b+ rows ) and throughput ( quickly sending 1b+ worth of rows to spark for processing) or ability to maintain indices in such a large database.
In contrast, I consider using cassandra or hbase, which I believe are more suitable for storing large datasets, but offer less performance in querying which I'd ultimately need if i am to provide online querying.
1 - is mongodb+spark a proven stack for this kind of use case?
2 - is mongodb ( storing + query performance) scalability unbounded ?
thanks in advance
As mentioned previously there are a number of NoSQL solutions that can fit your needs. I can recommend MongoDB for use with Spark*, especially if you have operational experience with large MongoDB clusters.
There is a white paper about turning analytics into realtime queries from MongoDB. Perhaps more interesting is the blog post from Eastern Airlines about their use of MongoDB and Spark and how it powers their 1.6 billion flight searches a day.
Regarding the data size, then managing a cluster with that much data in MongoDB is a pretty normal. The performance part for any solution will be the quickly sending 1b+ documents to Spark for processing. Parallelism and taking advantage of data locality are key here. Also, your Spark algorithm will need to be such to take advantage of that parallelism - shuffling lots of data is time expensive.
Disclaimer: I'm the author of the MongoDB Spark Connector and work for MongoDB.
Almost any NoSQL database can fit your needs when storing data. And you are right that MongoDB offers some extra's over Hbase and Cassandra when it comes to querying the data. But elasticsearch is a proven solution for high speed storing and retrieval/querying of data (metrics).
Here is some more information on using elasticsearch with Spark:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
I would actually use the complete ELK stack. Since Kibana will allow you to go easily through the data with visualization capabilities (charts etc.).
I bet you already have Spark, so I would recommend to install the ELK stack on the same machine/cluster to test if it suits your needs.
Related
I am trying to figure out which DB to use for a project with the following requirements,
Requirements:
scalability should be high, availability should be high
Data format is Json Document of several MBs in size
Query capabilities are my least concern, More of a key-value usecase
High performance/ low latency
i considered MongoDb, Cassandra, Redis, postgres (jsonb), a few other document oriented DBs, embedded databases ( small footprint will be a plus ).
Please help me find out which DB will be the best choice.
i wont need document/row wise comparison queries at all. at most requirement will be subset pick from the document. What i am looking for is a lightweight db with smaller footprint and low latency with high scalability. very low query capabilities are acceptable. should i be choosing embedded DBs? What are the points to consider here?
thanks for the help!.
If you use documents (json) use a document database. Especially if the documents differ in structure.
PostgreSQL does not scale horizontally. Have a look at CockroachDB if you like.
Cassandra can do key-value at scale as redis, but both are not really document databases.
I would suggest MongoDB or CouchDB - which one would be a good match depends on your needs. MongoDB is consistent and partition tolerant while CouchDB is partition tolerant and available.
If you can live with some limits for querying and like high availability try out CouchDB.
At my company we are about to store a big amount of geo-location data coming from mobile GPS.
The requirements are :
1) To be able to to keep these data to our database for at least six months (history)
2) Clients can perform search queries in real time. That means we need to perform some spatial functions on them
3) to be able to analyze data and path of points in order we can have a good average of the older points in these six months.
We think about Hadoop file System in order to save data and use mapReduce for analize them. For real time queries we thinking about elasticsearch (SPATIAL FUNCTIONS AND INDEX ) or Mongodb or Cassandra.
What do you think should be the approach in this scenario ?
Yes, MongoDB can be used for realtime analytics as MongoDB provides with Built-In "geoNear" based geoSpatial queries. Have a look at GeoSpatial data and how it is addressed by MongoDB at this Link : http://blog.mlab.com/2014/08/a-primer-on-geospatial-data-and-mongodb/
I am looking for a NoSQL technology that meets the requirement of being able to process geospatial as well as time queries on a large scale with decent performance. I want to batch-process several hundred of GBs to TBs of data with the proposed NoSQL technology along with Spark. This will obviously be run on a cluster with several nodes.
Types of queries I want to run:
"normal" queries for attributes like "field <= value"
Basic geospatial queries like querying all data that relies within a bbox.
Time queries like "date <= 01.01.2011" or "time >= 11:00 and time <= 14:00"
a combination of all of the three query types (something like "query all data that where location is within bbox and on date 01.01.2011 and time <= 14:00 and field_x <= 100")
I am currently evaluating which technologies are possible for my usecase but I'm overwhelmed by the sheer amount of technologies there are available. I have thought about popular technologies like MongoDB and Cassandra. Both seem to be applicable for my usecase (Cassandra only with Stratios Lucene index) but there might be a different technology that works even better.
Is there any technology that will heavily outperform others based on these requirements?
I want to batch-process several hundred of GBs to TBs of data
That's not really a cassandra use case. Cassandra is firstly optimized for write performance. If you have a really huge amount of writes, Cassandra could be a good option for your. Cassandra isn't a database for Exploratory queries. Cassandra is a database for known queries. On read level Cassandra is optimized for sequentiell reads. Cassandra can only query data sequentially. It's also possible to ignore this but it's not recommended. Huge amount of data could be, with the wrong data model, a problem in Cassandra. Maybe a hadoop based database system is a better option for your.
Time queries like "date <= 01.01.2011" or "time >= 11:00 and time <= 14:00"
Cassandra is really good for time series data.
"normal" queries for attributes like "field <= value"
If you know the queries before you modeling you database, Cassandra is also a good choice.
a combination of all of the three query types (something like "query all data that where location is within bbox and on date 01.01.2011 and time <= 14:00 and field_x <= 100")
Cassandra could be a good solution. Why could? As i said: You have to know this queries before you create your tables. If you know that you will have thousands of queries where you need a time range and the location (city, country, content etc.) it is a good solution for your.
time queries on a large scale with decent performance.
Cassandra will have the best performance in this use case. The data are already in the needed order. MonoDB is a nice replacement for MySQL use cases. If you need a better scale, but scaling mongodb is not so simple as in Cassandra, and flexibly and you care about the consistency. Cassandra has eventual consistency is scalable and performance is really important. MongoDB has also relations, Cassandra not. In Cassandra is everything denormalized because performance cares.
Sharding provide a scalable throughput and storage. Scalable throughput and storage is kind of a paradise for analytics. However there is a huge trade off that I think about.
If I use hashed shard key,
- write will be very scalable
- however, if I am doing sequential read for facts, it will be exhaustive since it has to access all server
If I use ranged shard key, e.g. using field A,
- write might be scalable, if we are not using timestamp field
- however, sequential read will not be scalable if we are not using field A
In my opinion, it won't be very scalable as a data warehouse. However, I have no idea what other solution to make mongoDB data warehouse scalable.
Does mongoDB sharding is really suitable to make data warehouse scalable?
Erm, if you read a lot of data, it is most likely that you will exhaust the physical read capacity of one server. You want the reads to be done in parallel - unless I have a very wrong understanding of data warehousing and the limitations of the HDDs and SSDs around nowadays.
What you would do first is to select a subset of the data you want to analyze, right? If you have a lot of data, it makes sense that this matching is done in parallel. When the subset is selected, further analysis should be made, right? This is exactly what MongoDB does in the aggregation framework. An early match is done on all of the affected shards and the result is sent to the primary shard for that database, where further steps of the aggregation pipeline are applied.
I've read in the Cassandra docs that it's recommended that you stay with the random generated GUI for IDs to prevent hotspots instead of implementing my own IDs for each document. From what I know it is much slower (see this presentation). How Cassandra can help me achieve very high WRITE performance while still following this guideline?
PlayOrm uses it's own generater that keeps the ids small for cassandra. The important thing is that your generator be random, that is all, so it provides a good distribution of keys. We are doing 10,000 writes / second with PlayOrm and cassandra ourselves and that is while stuff is being indexed so we can query using PlayOrm Scalable SQL.