Cassandra Or MongoDB For Our Location Based Application - mongodb

We are looking at using a NoSQL database system for a large project. Currently, we have read a bit about MongoDB and Cassandra, though we have absolutely no experience with either. We are very proficient with traditional relational databases like MySQL and Microsoft SQL, but the NoSQL (key/value store) is a new paradigm for us.
So basically, which NoSQL database do you guys recommend for our use?
We do both heavy writes and reads. Basically we have tens of thousands of devices that are reporting:
device_id (int), latitude (decimal), longitude (decimal), date/time (datetime), heading char(2), speed (int)
Every minute. So, at peak times we need to be able to process hundreds of writes a second.
Then, we also have users, that are querying this information in the form of, give me all messages from device_id 1234 for the last day, or last week. Also, users do other querying like, give me all messages from device_1234 where speed is greater than 50 and date is today.
So, our initial thoughts are that MongoDB or Cassandra are going to allow us to scale this much easier then using a traditional database.
A document or value in MongoDB or Cassandra for us, might look like:
{
device_id: 1234,
location: [-118.12719739973545, 33.859012351859946],
datetime: 1282274060,
heading: "N",
speed: 34
}
Which system do you guys recommend? Thanks greatly.

MongoDB has built-in support for geospatial indexes: http://www.mongodb.org/display/DOCS/Geospatial+Indexing
As an example to find the 10 closest devices to that location you can just do
db.devices.find({location: {$near: [-118.12719739973545, 33.859012351859946]}}).limit(10)

I have post on a location based app using MongoDB, just like the one you described. MongoDB, with it's strong query and index support, might make it a better choice for you. Just like Cassandra, MongoDB has partitioning and replication, for scaling read and writes. Their underlying architecture is very different.
Although you have not mentioned any location based queries, if you are interested in queries like "give me all the devices within the radius r of location l and between time t1 and t2", you will find MongoDB's geospatial query and indexing extremely useful.

I have done some work with mongodb and geospatial data, but not on the scale mentioned above. The geospatial searches are very fast, much more so than mysql.
I suggest looking into mongodb's sharding, replication, and clustering functionality to deal with the volume of writes. Sharding across device identifier may be a good way to deal with the write volume. If you're interested in proximity of events then sharding across lat/lng may be more appropriate.
jack

Go with mongodb for geo-location search. Release 2.4 improves on core geo features. Lot's of big sites use it for geolocation search.

You might consider using ElasticSearch. ES keeps the JSON of the original document stored, along with all the indexed fields. JSON can be instantiated into any modern languages variables/arguments. In Java, one could even disable that, and store native Java persistence data in a field. After search retrieval, just loop through and instantiate a collection of the original object types.
Using Elastics Search gives you Trie Indexes for high speed numberic range indexes, obviously you get full text searches of every flavor, and geographic bounding box queries, all in AND or OR filtering. Date searches are also native (although Java's handing of dates sucks so I switched to BIG INT representations of timestamps to represent dates)
UNLIKE some past and maybe present NoSQL solutions, the geographic indexing and querying is PART of any query and no extra steps are required. I.E., one MongoDB solution in the recent past required a geospatial search to collect conforming document IDs, then you used those IDs inside another query and searched within those for your other criteria. In reality, that's what happens in all solutions anyways, but it's much faster and cached in ElasticSearch.

Related

ElasticSearch & Mongo

Very newbie question I assume.. I started playing around with ES and MongoDB and I'm trying to move data out a SQL DB as an exercise.
I can't help but wonder, what data would I store in Mongo and what in ES? Can I store everything in ES? Assume big data load, as in price trends.
To begin with, MongoDB is so-called a document store. Key feature of such concept is that is stores schema-dynamic documents:
Each record in a document collection can have a different structure
Types of each records can be different
Document properties (columns) can have nested structures
It's not schema-free, it's schema-dynamic (or flexible schema). To get into the concept, you can find a great tutorial here: https://docs.mongodb.org/manual/data-modeling/
MongoDB is the most widely used document store - please, see http://db-engines.com/en/system/MongoDB.
It has "drivers" for most programming languages, enabling rapid development. You can dive into Mongo quite quickly, there are a lot of tutorials and official Mongo University - a great course for developers and DBAs.
In short terms it supports indexing, aggregations, filters, load balancing, sharding, replications (replica sets) etc. Data is stored and transferred in a BSON format (http://bsonspec.org/).
A good comparison of MongoDB vs RDBMS concepts can be found in this official reference: https://docs.mongodb.org/manual/reference/sql-comparison/
What is it good for? It enables agile development, where schema can change over a period of time, especially form based data, user generated content, location based data, user profiles and more. It also enables storing large documents (up to 16MB each).
Now, Elasticsearch is not a database. It is a search engine with some great aggregation capabilities (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html - make sure you check out Metrics, Buckets and Pipeline aggregations).
Typical RDBSM is not designed for full-text searches or loosely structured data. Queries in ES can return results much faster than any database (e.g. seconds in RDBMS compared to milliseconds in ES). You need to remember that a key is to design indexes well, and that they will take your disk space.
There is a very detailed article comparing both in regards to performance, you may find it useful: http://blog.quarkslab.com/mongodb-vs-elasticsearch-the-quest-of-the-holy-performances.html.
You can actually use both successfully - MongoDB will store your data, where ES will be used as serving layer (search, aggregations etc.).
There is a big difference between mongodb and ES.
MongoDB is a database which was design in order to store data in it and query thats, while elasticsearch is an lucene base indexer in which you should only index data for searches and should not trust elastisearch. even though you can use store:true in elastic search, it is not recommended and i wouldn't rely on that for important data.

Relational or Mondo Db in a network of location application?

I'm trying to start a project where a user can add a location and each location can have a relation to each other. It will be a web application and I'm planning to use javaee7 + jpa + wildfly.
Basically a location will have at least a name, text, longitude and latitude fields. Some other entities will be linked to this location. There will also be query like, search he nearest x location to y.
I've encountered mongodb several times before but, I'm always on the other part of the system, so I'm not really sure if there's a big benefit in using it in this kind of system. The application is not a store so there's no checkout process and therefore no transaction.
If ever I decided to use mongodb I'm looking at:
http://hibernate.org/ogm/ and
mongo db geospatial queries
I would like to ask for opinions. Thanks.
From your description, I think MongoDB is a pretty good choice:
Your data model is fairly simple. You might even think of embedding the other entities in the location documents, reducing the number of queries needed to retrieve these entities which are near to a specific point to exactly 1 by doing
db.collection.ensureIndex({fieldHoldingGeoJSON:"2dsphere"})
db.collection.find({
$near:{
$geometry:{
type:"Point",
coordinates: [latitude, longitude]
},
$maxDistance: distanceInMeters
}
})
Keep in mind though that the maximum size of a BSON document is limited to 16MB. So if there are a lot of related entities, it might be worth to actually embed the locations into the entities eliminating the need of a dedicated locations collection, though those still can be created on demand using the aggregation framework. With embedding the locations into the entities, you can scale basically indefinitely and you have all properties of said entities stored where they belong to (very NoSQLish).
Although the second approach imposes redundancy, we are only talking a a few bytes, and even with 100M records, that would be less than 1Gb, with disk space being cheap.
Also keep in mind that MongoDB comes with powerful features such as ( relatively ) easy scaling and an awesome aggregation framework, to name a few.
MongoDB supports queries where you can ask for "nearest" within a radius -- very useful, if thats what you are looking for.
Actually, PostgeSQL has the extension PostGIS that implements spacial and geographical objects that can be queried using SQL. An example extracted from their site is the following:
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham';

Can DynamoDB or SimpleDB replace my MongoDB use-case?

I was wondering if DynamoDB or SimpleDB can replace my MongoDB use-case? Here is how I use MongoDB
15k entries, and I add 200 entries per hour
15 columns each of which is indexed using (ensureIndex)
Half of the columns are integers, the others are text fields (which basically have no more than 10 unique values)
I run about 10k DB reads per hour, and they are super fast with MongoDB right now. It's an online dating site. So the average Mongo query is doing a range search on 2 of the columns (e.g. age and height), and "IN" search for about 4 columns (e.g. ethnicity is A, B, or C... religion is A, B, ro C).
I use limit and skip very frequently (e.g. get me the first 40 entries, the next 40 entries, etc)
I use perl to read/write
I'm assuming you're asking because you want to migrate directly to an AWS hosted persistence solution? Both DynamoDB and SimpleDB are k/v stores and therefor will not be a "drop-in" replacement for a document store such as MongoDB.
With the exception of the limit/skip one (which require a more k/v compatible approach) all your functional requirements can easily be met by either of the two solutions you mentioned (although DynamoDB in my opinion is the better option) but that's not going to be the main issue. The main issue is going to be to move from a document store, especially one with extensive query capabilities, to a k/v store. You will need to rework your schema, rethink your indexes, work within the constraints of a k/v store, make use of the benefits of a k/v store, etc.
Frankly if your current solution works and if MongoDB feels like a good functional fit I'd certainly not migrate unless you have very strong non-technical reasons to do so (such as, say, your boss wants you to ;) )
What would you say is the reason you're considering this move or are you just exploring whether or not it's possible in the first place?
If you are planning to have your complete application on AWS then you might also consider using Amazon RDS (hosted managed MySQL). It's not clear from your description if you actually need MongoDB's document model so considering only the querying capabilities RDS might come close to what you need.
Going with either SimpleDB or DynamoDB will most probably require you to rethink some of the features based around aggregation queries. As regards choosing between SimpleDB and DynamoDB there are many important differences, but I'd say that the most interesting ones from your point of view are:
SimpleDB indexes all attributes
there're lots of tips and tricks that you'll need to learn about SimpleDB (see what the guys from Netflix learned while using SimpleDB)
DynamoDB pricing model is based on actual write/read operations (see my notes about DynamoDB)

Database to handle huge amount of data

I'm evaluating a database for my next project. I want to store all the cities in the world (2,5 million) and save weather forecast for every city every day. So you can imagine that the dataset will get quite big fast.
I also need to perform geo queries - get me the city and temperature for this day in this bounding box.
So far I've looked at hbase and couchdb. Hbase looked interesting, but the hardware requirement for production is too expensive for me (a presentation said you need 5 separate servers).
I'd like to keep the costs as low as possible, it's my personal project.
So what other options do I have? Can mongo handle this ammount of data? Anything else?
TL;DR
The requirements are
Large amount of data
Fast bounding box queries
Low/cheap hardware requirements
Optimized for read, but needs to handle insert of 2,5 million records daily
Yeah, you can go with mongodb. Mongodb was designed for scaling (sharding, replication). In addition mongodb support geospacial search.

CouchDB and MongoDB really search over each document with JavaScript?

From what I understand about these two "Not only SQL" databases. They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
Is that actually how it works? Sounds worse than using a plain RBMS without any indexed keys.
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4) gives the database a simple list of ID's which it uses to find in the actual rows in the massive data collection.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works). Perhaps someone can correct me because I know I must be missing something.
#Xeoncross
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4)
Well then, you'll love MongoDB. MongoDB support indexes so you can index user_id and revision and this query will be able to return relatively quickly.
However, please note that many NoSQL DBs only support Key lookups and don't necessarily support "secondary indexes" so you have to do you homework on this one.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works).
Well if you run a query in an SQL-based database and you don't have an index that database will perform a table scan (i.e.: looking through every row).
They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
So in practice most NoSQL databases support this. But please never use it for real-time queries. This option is primarily for performing map-reduce operations that are used to summarize data.
Here's maybe a different take on NoSQL. SQL is really good at relational operations, however relational operations don't scale very well. Many of the NoSQL are focused on Key-Value / Document-oriented concepts instead.
SQL works on the premise that you want normalized non-repeated data and that you to grab that data in big sets. NoSQL works on the premise that you want fast queries for certain "chunks" of data, but that you're willing to wait for data dependent on "big sets" (running map-reduces in the background).
It's a big trade-off, but if makes a lot of sense on modern web apps. Most of the time is spent loading one page (blog post, wiki entry, SO question) and most of the data is really tied to or "hanging off" that element. So the concept of grabbing everything you need with one query horizontally-scalable query is really useful.
It's the not the solution for everything, but it is a really good option for lots of use cases.
In terms of CouchDB, the Map function can be Javascript, but it can also be Erlang. (or another language altogether, if you pull in a 3rd Party View Server)
Additionally, Views are calculated incrementally. In other words, the map function is run on all the documents in the database upon creation, but further updates to the database only affect the related portions of the view.
The contents of a view are, in some ways, similar to an indexed field in an RDBMS. The output is a set of key/value pairs that can be searched very quickly, as they are stored as b-trees, which some RDBMSs use to store their indexes.
Think CouchDB stores the docs in a btree according to the "index" (view) and just walks this tree.. so it's not searching..
see http://guide.couchdb.org/draft/btree.html
You should study them up a bit more. It's not "worse" than and RDMBS it's different ... in fact, given certain domains/functions the "NoSQL" paradigm works out to be much quicker than traditional and in some opinions, outdated, RDMBS implementations. Think Google's Big Table platform and you get what MongoDB, Riak, CouchDB, Cassandra (Facebook) and many, many others are trying to accomplish. The primary difference is that most of these NoSQL solutions focus on Key/Value stores (some call these "document" databases) and have limited to no concept of relationships (in the primary/foreign key respect) and joins. Join operations on tables can be very expensive. Also, let's not forget the object relational impedance mismatch issue... You don't need an ORM to access MongoDB. It can actually store your code object (or document) as it is in memory. Can you imagine the savings in lines of code and complexity!? db4o is another lightweight solution that does this.
I don't know what you mean when you say "Not only SQL" database? It's a NoSQL paradigm - wherein no SQL is used to query the underlying data store of the system. NoSQL also means not an RDBMS which SQL is generally built on top of. Although, MongoDB does has an SQL like syntax that can be used from .NET when retrieving data - it's called NoRM.
I will say I've only really worked with Riak and MongoDB... I'm by no means familiar with Cassandra or CouchDB past a reading level and feature set comprehension. I prefer to use MongoDB over them all. Riak was nice too but not for what I needed. You should download a few of these NoSQL solutions and you will get the concept. Check out db4o, MongoDB and Riak as I've found them to be the easiest with more support for .NET based languages. It will just make sense for certain applications. All in all, the NoSQL or Document databse or OODBMS ... whatever you want to call it is very appealing and gaining lots of movement.
I also forgot about your javascript question... MongoDB has JavaScript "bindings" that enable it to be used as one method of searching for data. Riak handles data via a JSON format. MongoDB uses BSON I believe and I can't remember what the others use. In any case, the point is instead of SQL (structured query language) to "ask" the database for information some of these (MongoDB being one) use Javascript and/or RESTful syntax to ask the NoSQL system for data. I believe CouchDB and Riak can be queried over HTTP to which makes them very accessible. Not to mention, that's pretty frickin cool.
Do your research.... download them, they are all free and OSS.
db4o: http://www.db4o.com/ (Java & .NET versions)
MongoDB: mongodb.org/
Riak: http://www.basho.com/Riak.html
NoRM: http://thechangelog.com/post/436955815/norm-bringing-mongodb-to-net-linq-and-mono