MongoDB Cache or Not Cache using Redis - mongodb

In my project, main database is mongodb and for caching, i have redis.
Now for long and more complex queries, it is obviously better that i use redis to cache them.
But i'm wondering if i should cache simple queries like lookup by id, or lookup by some other mongodb indexed field? Does it make sense to use redis for this kind of indexed lookup ?
or should i just not cache this kind of query because mongodb already has good caching mechanism internally?
Is it faster looking up to mongodb indexed field or is it faster to lookup to redis?

Lookup in Redis is definitely faster (because of the key-value nature of Redis).
MongoDB Can't cache queries' results:
MongoDB is a Database and can't cache the result of queries for you because data may change anytime. So managing the cache is the responsibility of the Developer.
But also the MongoDB has some good internal mechanisms to use the RAM for better performance. (check this Question for more info)
DataBase query is expensive:
When you are executing a query in MongoDB, there will be many processes to find data, even on simple queries. But Redis can find a key very, very fast. So it's clear that you must use Redis for keeping things and use MongoDB only for permanent storage and queries.
My recommendation:
It's recommended to cache any high-usage or heavy query's results in the Redis, Memcached, or other key-value in-memory storage.
(It doesn't make sense to look up a simple post in Database/MongoDB a thousand times per day. It's just wasting of resources. The first duty of Cache systems is to keep high-usage data closer)
Also attention you must have a good "cache invalidation" mechanism to update cached data in Redis.
I recommend use the write-through technique to keep models and data in Redis.
I hope this helps.

Related

How do I make the most use of MongoDB and Redis Caching for a high scalable application?

I want to use the best features of Redis Caching and Mongo DB database for my current product.
I have a very heavy database, and would like to avoid multiple database calls.
Can I cache my documents in Redis and do a query instead?
What level caching would be suggested for the best performance?

MongoDB sharding for data warehouse

Sharding provide a scalable throughput and storage. Scalable throughput and storage is kind of a paradise for analytics. However there is a huge trade off that I think about.
If I use hashed shard key,
- write will be very scalable
- however, if I am doing sequential read for facts, it will be exhaustive since it has to access all server
If I use ranged shard key, e.g. using field A,
- write might be scalable, if we are not using timestamp field
- however, sequential read will not be scalable if we are not using field A
In my opinion, it won't be very scalable as a data warehouse. However, I have no idea what other solution to make mongoDB data warehouse scalable.
Does mongoDB sharding is really suitable to make data warehouse scalable?
Erm, if you read a lot of data, it is most likely that you will exhaust the physical read capacity of one server. You want the reads to be done in parallel - unless I have a very wrong understanding of data warehousing and the limitations of the HDDs and SSDs around nowadays.
What you would do first is to select a subset of the data you want to analyze, right? If you have a lot of data, it makes sense that this matching is done in parallel. When the subset is selected, further analysis should be made, right? This is exactly what MongoDB does in the aggregation framework. An early match is done on all of the affected shards and the result is sent to the primary shard for that database, where further steps of the aggregation pipeline are applied.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

MongoDB with redis

Can anyone give example use cases of when you would benefit from using Redis and MongoDB in conjunction with each other?
Redis and MongoDB can be used together with good results. A company well-known for running MongoDB and Redis (along with MySQL and Sphinx) is Craiglist. See this presentation from Jeremy Zawodny.
MongoDB is interesting for persistent, document oriented, data indexed in various ways. Redis is more interesting for volatile data, or latency sensitive semi-persistent data.
Here are a few examples of concrete usage of Redis on top of MongoDB.
Pre-2.2 MongoDB does not have yet an expiration mechanism. Capped collections cannot really be used to implement a real TTL. Redis has a TTL-based expiration mechanism, making it convenient to store volatile data. For instance, user sessions are commonly stored in Redis, while user data will be stored and indexed in MongoDB. Note that MongoDB 2.2 has introduced a low accuracy expiration mechanism at the collection level (to be used for purging data for instance).
Redis provides a convenient set datatype and its associated operations (union, intersection, difference on multiple sets, etc ...). It is quite easy to implement a basic faceted search or tagging engine on top of this feature, which is an interesting addition to MongoDB more traditional indexing capabilities.
Redis supports efficient blocking pop operations on lists. This can be used to implement an ad-hoc distributed queuing system. It is more flexible than MongoDB tailable cursors IMO, since a backend application can listen to several queues with a timeout, transfer items to another queue atomically, etc ... If the application requires some queuing, it makes sense to store the queue in Redis, and keep the persistent functional data in MongoDB.
Redis also offers a pub/sub mechanism. In a distributed application, an event propagation system may be useful. This is again an excellent use case for Redis, while the persistent data are kept in MongoDB.
Because it is much easier to design a data model with MongoDB than with Redis (Redis is more low-level), it is interesting to benefit from the flexibility of MongoDB for main persistent data, and from the extra features provided by Redis (low latency, item expiration, queues, pub/sub, atomic blocks, etc ...). It is indeed a good combination.
Please note you should never run a Redis and MongoDB server on the same machine. MongoDB memory is designed to be swapped out, Redis is not. If MongoDB triggers some swapping activity, the performance of Redis will be catastrophic. They should be isolated on different nodes.
Obviously there are far more differences than this, but for an extremely high overview:
For use-cases:
Redis is often used as a caching layer or shared whiteboard for distributed computation.
MongoDB is often used as a swap-out replacement for traditional SQL databases.
Technically:
Redis is an in-memory db with disk persistence (the whole db needs to fit in RAM).
MongoDB is a disk-backed db which only needs enough RAM for the indexes.
There is some overlap, but it is extremely common to use both. Here's why:
MongoDB can store more data cheaper.
Redis is faster for the entire dataset.
MongoDB's culture is "store it all, figure out access patterns later"
Redis's culture is "carefully consider how you'll access data, then store"
Both have open source tools that depend on them, many of which are used together.
Redis can be used as a replacement for a traditional datastore, but it's most often used with another normal "long" data store, like Mongo, Postgresql, MySQL, etc.
Redis works excellently with MongoDB as a caching server. Here is what happens.
Anytime that mongoose issues a cache query, it will first go over to the cache server.
The cache server will check to see if that exact query has ever been issued before.
If it hasn’t then the cache server will take the query, send it over to mongodb and Mongo will execute the query.
We will then take the result of that query, it then goes back to the cache server, the cache server will store the result of the query on itself.
It will say anytime I execute that query, I get this response and so its going to maintain a record between queries that are issued and responses that come back from those queries.
The cache server will take the response and send it back to mongoose, mongoose will give it to express and it eventually ends up inside the application.
Anytime that the same exact query is issued again, mongoose will send the same query to the cache server, but if the cache server sees that this query was issued before it will not send the query onto mongodb, instead its going to take the response to the query it got the last time and immediately send it back over to mongoose. There is no indices here, no full table scan, nothing.
We are doing a simple lookup to say has this query been executed? Yes? Okay, take the request and send it back immediately and don’t send anything to mongo.
We have the mongoose server, the cache server (Redis) and Mongodb.
On the cache server there might be a datastore with key value type of data store where all the keys are some type of query issued before and the value the result of that query.
So maybe we are looking up a bunch of blogposts by _id.
So maybe the keys in here are the _id of the records we have looked up before.
So lets imagine that mongoose issues a new query where it tries to find a blogpost with _id of 123, the query flows into the cache server, the cache server will check to see if it has a result for any query that was looking for an _id of 123.
If it does not exist in the cache server, this query is taken and sent on to the mongodb instance. Mongodb will execute the query, get a response and send it back.
This result is sent back over to the cache server who takes that result and immediately sends it back to mongoose so we get as fast a response as possible.
Right after that, the cache server will also take the query issued, and add that on to its collection of queries that have been issued and take the result of the query and store it right up against the query.
So we can imagine that in the future we issue the same query again, it hits the cache server, it looks at all the keys it has and says oh I already found that blogpost, it doesn’t reach out to mongo, it just takes the result of the query and sends it directly to mongoose.
We are not doing complex query logic, no indices, nothing like that. Its as fast as possible. Its a simple key value lookup.
Thats an overview of how the cache server (Redis) works with MongoDB.
Now there are other concerns. Are we caching data forever? How do we update records?
We don’t want to always be storing data in the cache and be reading from the cache.
The cache server is not used for any write actions. The cache layer is only used for reading data. If we ever write data, writing will always go over to the mongodb instance and we need to ensure that anytime we write data we clear any data stored on the cache server that is related to the record we just updated in Mongo.

What operations are cheap/expensive in mongodb?

I'm reading up on MongoDB, and trying to get a sense of where it's best used. One question that I don't see a clear answer to is which operations are cheap or expensive, and under what conditions.
Can you help clarify?
Thanks.
It is often claimed that mongodb has insanely fast writes. While they are not slow indeed, this is quite an overstatement. Write throughput in mongodb is limited by global write lock. Yes, you heard me right, there can be only ONE* write operation happening on the server at any given moment.
Also I suggest you take advantage of schemaless nature of mongodb and store your data denormalized. Often it is possible to do just one disk seek to fetch all required data (because it is all in the same document). Less disk seeks - faster queries.
If data sits in RAM - no disk seeks are required at all, data is served right from memory. So, make sure you have enough RAM.
Map/Reduce, group, $where queries are slow.
It is not fast to keep writing to one big document (using $push, for example). The document will outgrow its disk boundaries and will have to be copied to another place, which involves more disk operations.
And I agree with #AurelienB, some basic principles are universal across all databases.
Update
* Since 2011, several major versions of mongodb were released, improving situation with locking (from server-wide to database-level to collection-level). A new storage engine was introduced, WiredTiger, which has document-level locks. All in all, writes should be significantly faster now, in 2018.
From my practice one thing that should mentioned is that mongodb not very good fit for reporting, because usual in reports you need data from different collections ('join') and mongodb does not provide good way to aggregate data multiple collections (and not supposed to provide). For sure for some reports map/reduce or incremental map/reduce can work well, but it rare situations.
For reports some people suggest to migrate data into relations databases, that's have a lot of tools for reporting.
This is not very different than all database systems.
Query on indexed data are fast. Query on a lot of data are... slow.
Due to denormalization, if there is no index, writing on the base will be fast, that's why logging is the basic use case.
At the opposite, reading data which are on disk (not in RAM) without index can be very slow when you have billion of document.