At my company we are about to store a big amount of geo-location data coming from mobile GPS.
The requirements are :
1) To be able to to keep these data to our database for at least six months (history)
2) Clients can perform search queries in real time. That means we need to perform some spatial functions on them
3) to be able to analyze data and path of points in order we can have a good average of the older points in these six months.
We think about Hadoop file System in order to save data and use mapReduce for analize them. For real time queries we thinking about elasticsearch (SPATIAL FUNCTIONS AND INDEX ) or Mongodb or Cassandra.
What do you think should be the approach in this scenario ?
Yes, MongoDB can be used for realtime analytics as MongoDB provides with Built-In "geoNear" based geoSpatial queries. Have a look at GeoSpatial data and how it is addressed by MongoDB at this Link : http://blog.mlab.com/2014/08/a-primer-on-geospatial-data-and-mongodb/
Related
Escenario:
Many many devices
Each devices sent values each few seconds
Want to store the values sent
I can use mongo with time series collections (https://www.mongodb.com/docs/manual/core/timeseries-collections/), but this needs a big mongo database, as the objetive to store this values is only save an read, dont modify that values and use them like historic data, i have thought use Elastic Search, for example, take a VPS put Elastic on it with big disk space (dedicated mongo is expensive than this option).
Mongo vs Elastic in this scenario?
IMO, Elasticsearch seems to be a better choice due to below reason, I don't know if same is available in mongoDB.
As you have a timeseries data, you can easily rollover your indices using the index lifecycle management, that saves your cost as well, as after sometime(according to your use-case), you can move your data to cheaper storage or even delete them automatically.
You can also visualise your data stored in Elasticsearch using Kibana.
I need to implement a big data storage + processing system.
The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document of about 10 fields ( date,numbers, text, ids).
Data could then be queried online ( if possible) making arbitrary groupings on some of the fields of the document ( date range queries, ids ,etc ) .
I'm thinking on using a MongoDB cluster for storing all this data and build indices for the fields I need to query from, then process the data in an apache Spark cluster ( mostly simple aggregations+sorting). Maybe use Spark-jobserver to build a rest-api around it.
I have concerns about mongoDB scaling possibilities ( i.e storing 10b+ rows ) and throughput ( quickly sending 1b+ worth of rows to spark for processing) or ability to maintain indices in such a large database.
In contrast, I consider using cassandra or hbase, which I believe are more suitable for storing large datasets, but offer less performance in querying which I'd ultimately need if i am to provide online querying.
1 - is mongodb+spark a proven stack for this kind of use case?
2 - is mongodb ( storing + query performance) scalability unbounded ?
thanks in advance
As mentioned previously there are a number of NoSQL solutions that can fit your needs. I can recommend MongoDB for use with Spark*, especially if you have operational experience with large MongoDB clusters.
There is a white paper about turning analytics into realtime queries from MongoDB. Perhaps more interesting is the blog post from Eastern Airlines about their use of MongoDB and Spark and how it powers their 1.6 billion flight searches a day.
Regarding the data size, then managing a cluster with that much data in MongoDB is a pretty normal. The performance part for any solution will be the quickly sending 1b+ documents to Spark for processing. Parallelism and taking advantage of data locality are key here. Also, your Spark algorithm will need to be such to take advantage of that parallelism - shuffling lots of data is time expensive.
Disclaimer: I'm the author of the MongoDB Spark Connector and work for MongoDB.
Almost any NoSQL database can fit your needs when storing data. And you are right that MongoDB offers some extra's over Hbase and Cassandra when it comes to querying the data. But elasticsearch is a proven solution for high speed storing and retrieval/querying of data (metrics).
Here is some more information on using elasticsearch with Spark:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
I would actually use the complete ELK stack. Since Kibana will allow you to go easily through the data with visualization capabilities (charts etc.).
I bet you already have Spark, so I would recommend to install the ELK stack on the same machine/cluster to test if it suits your needs.
Very newbie question I assume.. I started playing around with ES and MongoDB and I'm trying to move data out a SQL DB as an exercise.
I can't help but wonder, what data would I store in Mongo and what in ES? Can I store everything in ES? Assume big data load, as in price trends.
To begin with, MongoDB is so-called a document store. Key feature of such concept is that is stores schema-dynamic documents:
Each record in a document collection can have a different structure
Types of each records can be different
Document properties (columns) can have nested structures
It's not schema-free, it's schema-dynamic (or flexible schema). To get into the concept, you can find a great tutorial here: https://docs.mongodb.org/manual/data-modeling/
MongoDB is the most widely used document store - please, see http://db-engines.com/en/system/MongoDB.
It has "drivers" for most programming languages, enabling rapid development. You can dive into Mongo quite quickly, there are a lot of tutorials and official Mongo University - a great course for developers and DBAs.
In short terms it supports indexing, aggregations, filters, load balancing, sharding, replications (replica sets) etc. Data is stored and transferred in a BSON format (http://bsonspec.org/).
A good comparison of MongoDB vs RDBMS concepts can be found in this official reference: https://docs.mongodb.org/manual/reference/sql-comparison/
What is it good for? It enables agile development, where schema can change over a period of time, especially form based data, user generated content, location based data, user profiles and more. It also enables storing large documents (up to 16MB each).
Now, Elasticsearch is not a database. It is a search engine with some great aggregation capabilities (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html - make sure you check out Metrics, Buckets and Pipeline aggregations).
Typical RDBSM is not designed for full-text searches or loosely structured data. Queries in ES can return results much faster than any database (e.g. seconds in RDBMS compared to milliseconds in ES). You need to remember that a key is to design indexes well, and that they will take your disk space.
There is a very detailed article comparing both in regards to performance, you may find it useful: http://blog.quarkslab.com/mongodb-vs-elasticsearch-the-quest-of-the-holy-performances.html.
You can actually use both successfully - MongoDB will store your data, where ES will be used as serving layer (search, aggregations etc.).
There is a big difference between mongodb and ES.
MongoDB is a database which was design in order to store data in it and query thats, while elasticsearch is an lucene base indexer in which you should only index data for searches and should not trust elastisearch. even though you can use store:true in elastic search, it is not recommended and i wouldn't rely on that for important data.
I'm evaluating a database for my next project. I want to store all the cities in the world (2,5 million) and save weather forecast for every city every day. So you can imagine that the dataset will get quite big fast.
I also need to perform geo queries - get me the city and temperature for this day in this bounding box.
So far I've looked at hbase and couchdb. Hbase looked interesting, but the hardware requirement for production is too expensive for me (a presentation said you need 5 separate servers).
I'd like to keep the costs as low as possible, it's my personal project.
So what other options do I have? Can mongo handle this ammount of data? Anything else?
TL;DR
The requirements are
Large amount of data
Fast bounding box queries
Low/cheap hardware requirements
Optimized for read, but needs to handle insert of 2,5 million records daily
Yeah, you can go with mongodb. Mongodb was designed for scaling (sharding, replication). In addition mongodb support geospacial search.
We are looking at using a NoSQL database system for a large project. Currently, we have read a bit about MongoDB and Cassandra, though we have absolutely no experience with either. We are very proficient with traditional relational databases like MySQL and Microsoft SQL, but the NoSQL (key/value store) is a new paradigm for us.
So basically, which NoSQL database do you guys recommend for our use?
We do both heavy writes and reads. Basically we have tens of thousands of devices that are reporting:
device_id (int), latitude (decimal), longitude (decimal), date/time (datetime), heading char(2), speed (int)
Every minute. So, at peak times we need to be able to process hundreds of writes a second.
Then, we also have users, that are querying this information in the form of, give me all messages from device_id 1234 for the last day, or last week. Also, users do other querying like, give me all messages from device_1234 where speed is greater than 50 and date is today.
So, our initial thoughts are that MongoDB or Cassandra are going to allow us to scale this much easier then using a traditional database.
A document or value in MongoDB or Cassandra for us, might look like:
{
device_id: 1234,
location: [-118.12719739973545, 33.859012351859946],
datetime: 1282274060,
heading: "N",
speed: 34
}
Which system do you guys recommend? Thanks greatly.
MongoDB has built-in support for geospatial indexes: http://www.mongodb.org/display/DOCS/Geospatial+Indexing
As an example to find the 10 closest devices to that location you can just do
db.devices.find({location: {$near: [-118.12719739973545, 33.859012351859946]}}).limit(10)
I have post on a location based app using MongoDB, just like the one you described. MongoDB, with it's strong query and index support, might make it a better choice for you. Just like Cassandra, MongoDB has partitioning and replication, for scaling read and writes. Their underlying architecture is very different.
Although you have not mentioned any location based queries, if you are interested in queries like "give me all the devices within the radius r of location l and between time t1 and t2", you will find MongoDB's geospatial query and indexing extremely useful.
I have done some work with mongodb and geospatial data, but not on the scale mentioned above. The geospatial searches are very fast, much more so than mysql.
I suggest looking into mongodb's sharding, replication, and clustering functionality to deal with the volume of writes. Sharding across device identifier may be a good way to deal with the write volume. If you're interested in proximity of events then sharding across lat/lng may be more appropriate.
jack
Go with mongodb for geo-location search. Release 2.4 improves on core geo features. Lot's of big sites use it for geolocation search.
You might consider using ElasticSearch. ES keeps the JSON of the original document stored, along with all the indexed fields. JSON can be instantiated into any modern languages variables/arguments. In Java, one could even disable that, and store native Java persistence data in a field. After search retrieval, just loop through and instantiate a collection of the original object types.
Using Elastics Search gives you Trie Indexes for high speed numberic range indexes, obviously you get full text searches of every flavor, and geographic bounding box queries, all in AND or OR filtering. Date searches are also native (although Java's handing of dates sucks so I switched to BIG INT representations of timestamps to represent dates)
UNLIKE some past and maybe present NoSQL solutions, the geographic indexing and querying is PART of any query and no extra steps are required. I.E., one MongoDB solution in the recent past required a geospatial search to collect conforming document IDs, then you used those IDs inside another query and searched within those for your other criteria. In reality, that's what happens in all solutions anyways, but it's much faster and cached in ElasticSearch.