MongoDB sharding for data warehouse - mongodb

Sharding provide a scalable throughput and storage. Scalable throughput and storage is kind of a paradise for analytics. However there is a huge trade off that I think about.
If I use hashed shard key,
- write will be very scalable
- however, if I am doing sequential read for facts, it will be exhaustive since it has to access all server
If I use ranged shard key, e.g. using field A,
- write might be scalable, if we are not using timestamp field
- however, sequential read will not be scalable if we are not using field A
In my opinion, it won't be very scalable as a data warehouse. However, I have no idea what other solution to make mongoDB data warehouse scalable.
Does mongoDB sharding is really suitable to make data warehouse scalable?

Erm, if you read a lot of data, it is most likely that you will exhaust the physical read capacity of one server. You want the reads to be done in parallel - unless I have a very wrong understanding of data warehousing and the limitations of the HDDs and SSDs around nowadays.
What you would do first is to select a subset of the data you want to analyze, right? If you have a lot of data, it makes sense that this matching is done in parallel. When the subset is selected, further analysis should be made, right? This is exactly what MongoDB does in the aggregation framework. An early match is done on all of the affected shards and the result is sent to the primary shard for that database, where further steps of the aggregation pipeline are applied.

Related

MongoDB Cache or Not Cache using Redis

In my project, main database is mongodb and for caching, i have redis.
Now for long and more complex queries, it is obviously better that i use redis to cache them.
But i'm wondering if i should cache simple queries like lookup by id, or lookup by some other mongodb indexed field? Does it make sense to use redis for this kind of indexed lookup ?
or should i just not cache this kind of query because mongodb already has good caching mechanism internally?
Is it faster looking up to mongodb indexed field or is it faster to lookup to redis?
Lookup in Redis is definitely faster (because of the key-value nature of Redis).
MongoDB Can't cache queries' results:
MongoDB is a Database and can't cache the result of queries for you because data may change anytime. So managing the cache is the responsibility of the Developer.
But also the MongoDB has some good internal mechanisms to use the RAM for better performance. (check this Question for more info)
DataBase query is expensive:
When you are executing a query in MongoDB, there will be many processes to find data, even on simple queries. But Redis can find a key very, very fast. So it's clear that you must use Redis for keeping things and use MongoDB only for permanent storage and queries.
My recommendation:
It's recommended to cache any high-usage or heavy query's results in the Redis, Memcached, or other key-value in-memory storage.
(It doesn't make sense to look up a simple post in Database/MongoDB a thousand times per day. It's just wasting of resources. The first duty of Cache systems is to keep high-usage data closer)
Also attention you must have a good "cache invalidation" mechanism to update cached data in Redis.
I recommend use the write-through technique to keep models and data in Redis.
I hope this helps.

Choosing a Database to store Report Json

I am trying to figure out which DB to use for a project with the following requirements,
Requirements:
scalability should be high, availability should be high
Data format is Json Document of several MBs in size
Query capabilities are my least concern, More of a key-value usecase
High performance/ low latency
i considered MongoDb, Cassandra, Redis, postgres (jsonb), a few other document oriented DBs, embedded databases ( small footprint will be a plus ).
Please help me find out which DB will be the best choice.
i wont need document/row wise comparison queries at all. at most requirement will be subset pick from the document. What i am looking for is a lightweight db with smaller footprint and low latency with high scalability. very low query capabilities are acceptable. should i be choosing embedded DBs? What are the points to consider here?
thanks for the help!.
If you use documents (json) use a document database. Especially if the documents differ in structure.
PostgreSQL does not scale horizontally. Have a look at CockroachDB if you like.
Cassandra can do key-value at scale as redis, but both are not really document databases.
I would suggest MongoDB or CouchDB - which one would be a good match depends on your needs. MongoDB is consistent and partition tolerant while CouchDB is partition tolerant and available.
If you can live with some limits for querying and like high availability try out CouchDB.

comment system rdbms vs nosql

What is the best way to implement a commenting system (huge data writing)?
1) Use a RDBMS database such as MySQL, 2 tables one for the topics and one for the comments
Pros is that the insertion of new comment is fast, efficient and simple, efficient indexing. Cons is that scaling out (horizontal scaling) is hard.
2) Use a nosql database such as couchdb or mongodb, Pros is that scaling out (horizontal scaling) is easy, Supports huge data writes, schemaless Cons I think that the insertion of new data is not fast and efficient as the RDBMS
For example to update couchdb document you need to grab the whole document, update it locally the submit it again, and the document size will be huge so it will consume bandwidth.
Also I think that couchdb in-place updates, Mongodb updates would be slow and won't be efficient as in RDBMS
Also when you want to get the comments of the each user in various topics I think the search would be faster in RDBMS than in the nosql system.
That is a sample of couchdb database document [document sample for each topic]
{"_id":"doc id",
"_rev":"45521231465421"
"topic_title":"the title of the topic"
"topic_body":"the body of the topic"
"comments":[
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla1"}, {"user":"user1"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla2"}, {"user":"user2"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla3"}, {"user":"user3"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla4"}, {"user":"user4"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla5"}, {"user":"user5"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla6"}, {"user":"user6"}
]
}
I think that the insertion of new data is not fast and efficient as the RDBMS
You have hit something there. Insertion speed of NoSQL databases relies upon your scenario. I cannot make that clear enough, so many people expect MongoDB to just perform magically faster than SQL and are sorely disappointed when it does not for them, in fact before now the mongodb-user Google group has been filled with such people.
For example to update couchdb
Not only that but CouchDB also uses versioning and JSON which is not as efficient as storing it in SQL and will consume more space per record.
Mongodb updates would be slow and won't be efficient as in RDBMS
Schema, Queries, Schema, Queries...
That is what it comes down to. Ask yourself one question.
Will I be expecting a lot of comments per post?
If so the in-memory (yes, in-memory) $push, $pull and other subdocument operators may get slow on a large subdocument (let's be honest, will).
Not only that but consistently growing documents can be a problem and can cause heavy fragmentation and space usage, creating a "swiss cheese" effect slowing your system down massively (bringing it to a grinding halt). This presentation should help understand more about how storage really works: http://www.10gen.com/presentations/storage-engine-internals
So you already know that, if used wrong, subdocuments can be a bad idea. That being said you could partially remedy it with power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes but if you are getting way too many comment insertions then it won't help too much.
I personally would not embed this relationship.
So I would go for the same set-up as a RDBMS and now you start to see the problem. Insertions will probably be about the same speed if it wasn't for MongoDBs fsync queue, unlike SQL which writes straight to disk. You can set-up MongoDB with journalled writes but then you will probably get the same performance metrics from SQL at the end of the day.
As for querying, this is where MongoDB can still come out on top, providing your working set fits into RAM. I cannot bold that last bit enough!!
Unlike SQL, MongoDB maps everything (your entire data) to virtual memory, not RAM and definitely not to be confused with RAM. This does make it faster for larger lookups, for smaller lookups the speed will be about the same because both will be serving from in-memory cache.
Also when you want to get the comments of the each user in various topics I think the search would be faster in RDBMS than in the nosql system.
If the topic id is in the comment document it would definitely be faster in MongoDB, providing your working set is ready in RAM.
What is meant by the working set? Here is a good answer: What does it mean to fit "working set" into RAM for MongoDB?
Hope this helps,
I can speak only about MongoDB and you are indeed wrong about inserts. Here is nice comparison of Mongo with MSSQL and Mongo is performing 100x times better then MSSQL. So it's quite suitable for large data processing.
Searching is also much more faster (what would be the whole point of NoSQL if inserting and searching wouldn't be faster?) - but with one caveat, you can't perform joins in queries, you have to join tables manually in your application (but there is recommanded workaround - nested documents).

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

What operations are cheap/expensive in mongodb?

I'm reading up on MongoDB, and trying to get a sense of where it's best used. One question that I don't see a clear answer to is which operations are cheap or expensive, and under what conditions.
Can you help clarify?
Thanks.
It is often claimed that mongodb has insanely fast writes. While they are not slow indeed, this is quite an overstatement. Write throughput in mongodb is limited by global write lock. Yes, you heard me right, there can be only ONE* write operation happening on the server at any given moment.
Also I suggest you take advantage of schemaless nature of mongodb and store your data denormalized. Often it is possible to do just one disk seek to fetch all required data (because it is all in the same document). Less disk seeks - faster queries.
If data sits in RAM - no disk seeks are required at all, data is served right from memory. So, make sure you have enough RAM.
Map/Reduce, group, $where queries are slow.
It is not fast to keep writing to one big document (using $push, for example). The document will outgrow its disk boundaries and will have to be copied to another place, which involves more disk operations.
And I agree with #AurelienB, some basic principles are universal across all databases.
Update
* Since 2011, several major versions of mongodb were released, improving situation with locking (from server-wide to database-level to collection-level). A new storage engine was introduced, WiredTiger, which has document-level locks. All in all, writes should be significantly faster now, in 2018.
From my practice one thing that should mentioned is that mongodb not very good fit for reporting, because usual in reports you need data from different collections ('join') and mongodb does not provide good way to aggregate data multiple collections (and not supposed to provide). For sure for some reports map/reduce or incremental map/reduce can work well, but it rare situations.
For reports some people suggest to migrate data into relations databases, that's have a lot of tools for reporting.
This is not very different than all database systems.
Query on indexed data are fast. Query on a lot of data are... slow.
Due to denormalization, if there is no index, writing on the base will be fast, that's why logging is the basic use case.
At the opposite, reading data which are on disk (not in RAM) without index can be very slow when you have billion of document.