I am looking for a NoSQL technology that meets the requirement of being able to process geospatial as well as time queries on a large scale with decent performance. I want to batch-process several hundred of GBs to TBs of data with the proposed NoSQL technology along with Spark. This will obviously be run on a cluster with several nodes.
Types of queries I want to run:
"normal" queries for attributes like "field <= value"
Basic geospatial queries like querying all data that relies within a bbox.
Time queries like "date <= 01.01.2011" or "time >= 11:00 and time <= 14:00"
a combination of all of the three query types (something like "query all data that where location is within bbox and on date 01.01.2011 and time <= 14:00 and field_x <= 100")
I am currently evaluating which technologies are possible for my usecase but I'm overwhelmed by the sheer amount of technologies there are available. I have thought about popular technologies like MongoDB and Cassandra. Both seem to be applicable for my usecase (Cassandra only with Stratios Lucene index) but there might be a different technology that works even better.
Is there any technology that will heavily outperform others based on these requirements?
I want to batch-process several hundred of GBs to TBs of data
That's not really a cassandra use case. Cassandra is firstly optimized for write performance. If you have a really huge amount of writes, Cassandra could be a good option for your. Cassandra isn't a database for Exploratory queries. Cassandra is a database for known queries. On read level Cassandra is optimized for sequentiell reads. Cassandra can only query data sequentially. It's also possible to ignore this but it's not recommended. Huge amount of data could be, with the wrong data model, a problem in Cassandra. Maybe a hadoop based database system is a better option for your.
Time queries like "date <= 01.01.2011" or "time >= 11:00 and time <= 14:00"
Cassandra is really good for time series data.
"normal" queries for attributes like "field <= value"
If you know the queries before you modeling you database, Cassandra is also a good choice.
a combination of all of the three query types (something like "query all data that where location is within bbox and on date 01.01.2011 and time <= 14:00 and field_x <= 100")
Cassandra could be a good solution. Why could? As i said: You have to know this queries before you create your tables. If you know that you will have thousands of queries where you need a time range and the location (city, country, content etc.) it is a good solution for your.
time queries on a large scale with decent performance.
Cassandra will have the best performance in this use case. The data are already in the needed order. MonoDB is a nice replacement for MySQL use cases. If you need a better scale, but scaling mongodb is not so simple as in Cassandra, and flexibly and you care about the consistency. Cassandra has eventual consistency is scalable and performance is really important. MongoDB has also relations, Cassandra not. In Cassandra is everything denormalized because performance cares.
Related
I am trying to figure out which DB to use for a project with the following requirements,
Requirements:
scalability should be high, availability should be high
Data format is Json Document of several MBs in size
Query capabilities are my least concern, More of a key-value usecase
High performance/ low latency
i considered MongoDb, Cassandra, Redis, postgres (jsonb), a few other document oriented DBs, embedded databases ( small footprint will be a plus ).
Please help me find out which DB will be the best choice.
i wont need document/row wise comparison queries at all. at most requirement will be subset pick from the document. What i am looking for is a lightweight db with smaller footprint and low latency with high scalability. very low query capabilities are acceptable. should i be choosing embedded DBs? What are the points to consider here?
thanks for the help!.
If you use documents (json) use a document database. Especially if the documents differ in structure.
PostgreSQL does not scale horizontally. Have a look at CockroachDB if you like.
Cassandra can do key-value at scale as redis, but both are not really document databases.
I would suggest MongoDB or CouchDB - which one would be a good match depends on your needs. MongoDB is consistent and partition tolerant while CouchDB is partition tolerant and available.
If you can live with some limits for querying and like high availability try out CouchDB.
I need to implement a big data storage + processing system.
The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document of about 10 fields ( date,numbers, text, ids).
Data could then be queried online ( if possible) making arbitrary groupings on some of the fields of the document ( date range queries, ids ,etc ) .
I'm thinking on using a MongoDB cluster for storing all this data and build indices for the fields I need to query from, then process the data in an apache Spark cluster ( mostly simple aggregations+sorting). Maybe use Spark-jobserver to build a rest-api around it.
I have concerns about mongoDB scaling possibilities ( i.e storing 10b+ rows ) and throughput ( quickly sending 1b+ worth of rows to spark for processing) or ability to maintain indices in such a large database.
In contrast, I consider using cassandra or hbase, which I believe are more suitable for storing large datasets, but offer less performance in querying which I'd ultimately need if i am to provide online querying.
1 - is mongodb+spark a proven stack for this kind of use case?
2 - is mongodb ( storing + query performance) scalability unbounded ?
thanks in advance
As mentioned previously there are a number of NoSQL solutions that can fit your needs. I can recommend MongoDB for use with Spark*, especially if you have operational experience with large MongoDB clusters.
There is a white paper about turning analytics into realtime queries from MongoDB. Perhaps more interesting is the blog post from Eastern Airlines about their use of MongoDB and Spark and how it powers their 1.6 billion flight searches a day.
Regarding the data size, then managing a cluster with that much data in MongoDB is a pretty normal. The performance part for any solution will be the quickly sending 1b+ documents to Spark for processing. Parallelism and taking advantage of data locality are key here. Also, your Spark algorithm will need to be such to take advantage of that parallelism - shuffling lots of data is time expensive.
Disclaimer: I'm the author of the MongoDB Spark Connector and work for MongoDB.
Almost any NoSQL database can fit your needs when storing data. And you are right that MongoDB offers some extra's over Hbase and Cassandra when it comes to querying the data. But elasticsearch is a proven solution for high speed storing and retrieval/querying of data (metrics).
Here is some more information on using elasticsearch with Spark:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
I would actually use the complete ELK stack. Since Kibana will allow you to go easily through the data with visualization capabilities (charts etc.).
I bet you already have Spark, so I would recommend to install the ELK stack on the same machine/cluster to test if it suits your needs.
I'm evaluating a database for my next project. I want to store all the cities in the world (2,5 million) and save weather forecast for every city every day. So you can imagine that the dataset will get quite big fast.
I also need to perform geo queries - get me the city and temperature for this day in this bounding box.
So far I've looked at hbase and couchdb. Hbase looked interesting, but the hardware requirement for production is too expensive for me (a presentation said you need 5 separate servers).
I'd like to keep the costs as low as possible, it's my personal project.
So what other options do I have? Can mongo handle this ammount of data? Anything else?
TL;DR
The requirements are
Large amount of data
Fast bounding box queries
Low/cheap hardware requirements
Optimized for read, but needs to handle insert of 2,5 million records daily
Yeah, you can go with mongodb. Mongodb was designed for scaling (sharding, replication). In addition mongodb support geospacial search.
What is an example of data that is "predistilled or aggregated in runtime"? (And why isn't MongoDB very good with it?)
This is a quote from the MongoDB docs:
Traditional Business Intelligence. Data warehouses are more suited to new, problem-specific BI databases. However note that MongoDB can work very well for several reporting and analytics problems where data is pre-distilled or aggregated in runtime -- but classic, nightly batch load business intelligence, while possible, is not necessarily a sweet spot.
Let's take something simple like counting clicks. There are a few ways to report on clicks.
Store the clicks in a single place. (file, database table, collection) When somebody wants stats, you run a query on that table and aggregate the results. Of course, this doesn't scale very well, so typically you use...
Batch jobs. Store your clicks as in #1, but only summarize them every 5 minutes or so. When people want to query the summary table. Note that "clicks" may have millions of rows, but "summary" may only have a few thousand rows, so it's much quicker to query.
Count the clicks in real-time. Every time there's a click you increment a counter somewhere. Typically this means incrementing the "summary" table(s).
Now most big systems use #2. There are several systems that are very good for this specifically (see Hadoop).
#3 is difficult to do with SQL databases (like MySQL), because there's a lot of disk locking happening. However, MongoDB isn't constantly locking the disk and tends to have much better write throughput.
So MongoDB ends up being very good for such "real-time counters". This is what they mean by predistilled or aggregated in runtime.
But if MongoDB has great write throughput, shouldn't it be good at doing batch jobs?
In theory, this may be true and MongoDB does support Map/Reduce. However, MongoDB's Map/Reduce is currently quite slow and not on par with other Map/Reduce engines like Hadoop. On top of that, the Business Intelligence (BI) field is filled with many other tools that are very specific and likely better-suited than MongoDB.
What is an example of data that is "predistilled or aggregated in
runtime"?
Example of this can be any report that require data from multiple collections.
And why isn't MongoDB very good with it?
In document databases you can't make a join and because of this it hard to build reports. Usually reports it's data aggregating from many tables/collections.
And since mongodb (and document database in general) good fit for data distribution and denormalization better to prebuild reports whenever it possible and just display data from this collection in runtime.
For some tasks/reports it is not possible to prebuild data, in this case mongodb give to you map/reduce, grouping, etc.
We are looking at using a NoSQL database system for a large project. Currently, we have read a bit about MongoDB and Cassandra, though we have absolutely no experience with either. We are very proficient with traditional relational databases like MySQL and Microsoft SQL, but the NoSQL (key/value store) is a new paradigm for us.
So basically, which NoSQL database do you guys recommend for our use?
We do both heavy writes and reads. Basically we have tens of thousands of devices that are reporting:
device_id (int), latitude (decimal), longitude (decimal), date/time (datetime), heading char(2), speed (int)
Every minute. So, at peak times we need to be able to process hundreds of writes a second.
Then, we also have users, that are querying this information in the form of, give me all messages from device_id 1234 for the last day, or last week. Also, users do other querying like, give me all messages from device_1234 where speed is greater than 50 and date is today.
So, our initial thoughts are that MongoDB or Cassandra are going to allow us to scale this much easier then using a traditional database.
A document or value in MongoDB or Cassandra for us, might look like:
{
device_id: 1234,
location: [-118.12719739973545, 33.859012351859946],
datetime: 1282274060,
heading: "N",
speed: 34
}
Which system do you guys recommend? Thanks greatly.
MongoDB has built-in support for geospatial indexes: http://www.mongodb.org/display/DOCS/Geospatial+Indexing
As an example to find the 10 closest devices to that location you can just do
db.devices.find({location: {$near: [-118.12719739973545, 33.859012351859946]}}).limit(10)
I have post on a location based app using MongoDB, just like the one you described. MongoDB, with it's strong query and index support, might make it a better choice for you. Just like Cassandra, MongoDB has partitioning and replication, for scaling read and writes. Their underlying architecture is very different.
Although you have not mentioned any location based queries, if you are interested in queries like "give me all the devices within the radius r of location l and between time t1 and t2", you will find MongoDB's geospatial query and indexing extremely useful.
I have done some work with mongodb and geospatial data, but not on the scale mentioned above. The geospatial searches are very fast, much more so than mysql.
I suggest looking into mongodb's sharding, replication, and clustering functionality to deal with the volume of writes. Sharding across device identifier may be a good way to deal with the write volume. If you're interested in proximity of events then sharding across lat/lng may be more appropriate.
jack
Go with mongodb for geo-location search. Release 2.4 improves on core geo features. Lot's of big sites use it for geolocation search.
You might consider using ElasticSearch. ES keeps the JSON of the original document stored, along with all the indexed fields. JSON can be instantiated into any modern languages variables/arguments. In Java, one could even disable that, and store native Java persistence data in a field. After search retrieval, just loop through and instantiate a collection of the original object types.
Using Elastics Search gives you Trie Indexes for high speed numberic range indexes, obviously you get full text searches of every flavor, and geographic bounding box queries, all in AND or OR filtering. Date searches are also native (although Java's handing of dates sucks so I switched to BIG INT representations of timestamps to represent dates)
UNLIKE some past and maybe present NoSQL solutions, the geographic indexing and querying is PART of any query and no extra steps are required. I.E., one MongoDB solution in the recent past required a geospatial search to collect conforming document IDs, then you used those IDs inside another query and searched within those for your other criteria. In reality, that's what happens in all solutions anyways, but it's much faster and cached in ElasticSearch.