Mongoose: Updating many entries starves connection pool

Mongoose: Updating many entries starves connection pool - mongodb

So I receive on my node server 80,000+ records at a time that need to be put into Mongo as updates. I am aware that mongoose doesn't support this functionality, so each one has be be updated individually.
When I do this however, even when a connection pool is set up to be say 100, it still overwhelms the connection pool and the result is that any other web or system traffic needing to do a database call cannot complete. Is there any way to have a model limit the amount of connection it uses, or any other good way to work around this? For now we are resource limited to having only a single db or node instance to handle front and back end items.
Any comments or suggestions to try welcome.
Thanks

You can perform bulk updates with Mongoose, using Model.bulkWrite().
The documentation doesn't specify how many documents you can update per batch, but the MongoDB documentation documenting the underlying mechanism seems to suggest that you might be able to send all 80K updates at once.

Related

Does a running MongoDB aggregation pipeline slow down reads and writes to the affected collection?

As the title suggests, I'd like to know if reads and writes to a collection are delayed/paused while a MongoDB aggregation pipeline is running. I'm considering adding a pipeline in a user collection, and I think the query could sometimes affect a lot of users (possibly tens of thousands), or just run for longer than I expect. So I'm wondering if that will "block" reads and writes to the collection. The server isn't live, so I don't have real user data to inform this decision. I'd appreciate any feedback or suggestions, thanks!

Each server has certain resource capacity. If you are sending a query to the server, it has less capacity remaining to do other work (be that other queries or writes).
For locking and concurrency in MongoDB, see https://docs.mongodb.com/manual/faq/concurrency/.
If you are planning for high load/high throughput you need to benchmark your specific use case.

MongoDb concurrency best practices

I am new with MongoDb, I am creating an application that manage a very big list of items (resources), and for each resources the application should manage a kind of booking.
My idea is to embed booking document inside resource document, and to avoid concurrency problem I need to lock the resource during booking.
I see that MongoDB allow locks at collection level, but this will create a bottleneck on the booking functionality because all resources inside the collection will be looked until the current booking is in progress, so for a large amount of users and large amount of resources this solution will have poor performance.
In addition to that, in case of a deadlock occurred booking a resource, all resources will be locked.
Are there alternative solutions or best practices to improve performance and scalability of this use case?
A possible solution should be to have a lock not at collection level but a document level (the resource in my example), in this way a user booking a resource doesn't lock another user to book another resource, even if (also in this case) I am not sure of the final result because write commands are not executed in parallel: I suppose I'll probably also need a cluster of servers to manage multiple writes in parallel.

You are absolutely right, you should definitely not lock the entire collection for just updating a single document.
Now this problem depends on how you update your document.
If you update your document with a single update query, then since document update is atomic you would have no problem.
But if you first have to read the document, change the document, save the document, then you would have the concurrency problem. Just before you save the changed document, it could be updated by some other request and the document you have read would no longer be up to date, hence your new updates will not be right either.
The simple solution to this concurrency problem is solved by storing a version number(usually _v) in each of your documents. And for every update you increment the version number. Then every time you do a read & change & update, you make sure that the version of your read document and the version of that document in the database are identical. When the version number differs the update will fail and you can simply try again.
If you are using node.js, then you are probably using mongoose and mongoose will generate _v and do concurrency checks behind the scenes. So you do not have to do any extra job to solve this concurrency issue.

does mongodb have the properties such as trigger and procedure in a relational database?

as the title suggests, include out the map-reduce framework
if i want to trigger an event to run a consistency check or security operations before a record is inserted, how can i do that with MongoDB?

MongoDB does not support triggers, but people have created solutions around them, mostly using the oplog, though this will only help you if you are running with replica sets, as the oplog is a capped collection that keeps track of data changes for the purposes of replication.
For a nodejs solution see: https://www.npmjs.org/package/mongo-watch or see an earlier SO thread: How to listen for changes to a MongoDB collection?
If you are concerned with consistency, read about write concern in mongoDB. http://docs.mongodb.org/manual/core/write-concern/ You can be as relaxed or as strict as you want by setting insert write concern levels, from fire and hope to getting an acknowledgement from all members of the replica set.
So, if you want to run a consistency check before inserting data, you probably will have to move that logic to the client application and set your write concern level to a level that will ensure consistency.

MongoDb does not have triggers or stored procedures. While there are solutions that some have used to try to emulate the behavior, as it is not a built-in feature, you'll need to decide whether the solutions are effective for you. Searching for "triggers and mongodb" should find dozens. All depend on the oplog and replicas.
But, given the nature of MongoDb and a typical 3 tier architecture, I would expect that at the point of data insertion, which could be on a web server for example, you would run, on the web server, the necessary consistency and security checks. You wouldn't allow a client such as a mobile application to directly set data into the database collection without some checks.
Many drivers for MongoDb and extended libraries have validation and consistency checks built in already, so there is less to do. Using unique indexes for some fields can also provide a level of consistency that you cannot do from the driver alone. Look at calls like findAndModify which make atomic updates.

Configure a Mongo replica set to only replicate certain collections

I have a ~3GB mongo database with several dozen collections. Three of these collections handle ~300 queries per second, while the rest sustain a much lower volume. I expect the traffic to continue to grow quickly.
I'd like to set up a replica set to handle the high-traffic collections. It isn't necessary for this new instance to replicate the rest of the database. Is this possible?

Seems like not possible at the moment by built-in features of mongodb and only way to do is to come up with your own manual replication algorithm or use some other tools written by third parties.
https://github.com/wordnik/wordnik-oss project might help you to achieve this according to the following post.
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/Ap9V4ArGuFo
Describes workaround to filter documents in replication.
Replicate only documents where {'public':true} in MongoDB
Or just replicate the data yourself manually which might worth trying.
Good luck.

No that isn't possible now. What you could do is move those collections into another unreplicated database. But this will cause headaches once these collections see higher traffic too, so you would need to move them into your "replication"-db.
But in general Replication isn't the way to go if you need to scale, it's more considered for DR/failover. Replicaset Secondaries can only (optionally) answer read queries but no write queries, this is something you should keep in mind. So if you have high write load this may not cure your problem.
Once you allow your application to read from secondaries you need to live with eventual consistency, meaning that your application isn't guaranteed to see always the latest data. This is caused due to the asynchronous replication to the secondaries.
Indeed you can cure this problem if you configure your writeconcern, so that the write needs to succeeded on all replicas, before it's considered written and your driver returns. But this may slow down your write operations significant.
So for scaling query execution capabilities I would go with Sharding. This is possible on a per collection level, all unsharded collections will remain on a "default-shard".

Not possible but then if the data size is so small and these collections aren't updated, then the only overhead of having them replicated is the small storage size on the secondary. That is a relatively small price to pay, especially since the collections won't grow in size, compared with writing your own replication logic.

Instead of that archive the data, and have only the latest data set on the production server and the rest of the data can archive on the new server.

MongoDB with redis

Can anyone give example use cases of when you would benefit from using Redis and MongoDB in conjunction with each other?

Redis and MongoDB can be used together with good results. A company well-known for running MongoDB and Redis (along with MySQL and Sphinx) is Craiglist. See this presentation from Jeremy Zawodny.
MongoDB is interesting for persistent, document oriented, data indexed in various ways. Redis is more interesting for volatile data, or latency sensitive semi-persistent data.
Here are a few examples of concrete usage of Redis on top of MongoDB.
Pre-2.2 MongoDB does not have yet an expiration mechanism. Capped collections cannot really be used to implement a real TTL. Redis has a TTL-based expiration mechanism, making it convenient to store volatile data. For instance, user sessions are commonly stored in Redis, while user data will be stored and indexed in MongoDB. Note that MongoDB 2.2 has introduced a low accuracy expiration mechanism at the collection level (to be used for purging data for instance).
Redis provides a convenient set datatype and its associated operations (union, intersection, difference on multiple sets, etc ...). It is quite easy to implement a basic faceted search or tagging engine on top of this feature, which is an interesting addition to MongoDB more traditional indexing capabilities.
Redis supports efficient blocking pop operations on lists. This can be used to implement an ad-hoc distributed queuing system. It is more flexible than MongoDB tailable cursors IMO, since a backend application can listen to several queues with a timeout, transfer items to another queue atomically, etc ... If the application requires some queuing, it makes sense to store the queue in Redis, and keep the persistent functional data in MongoDB.
Redis also offers a pub/sub mechanism. In a distributed application, an event propagation system may be useful. This is again an excellent use case for Redis, while the persistent data are kept in MongoDB.
Because it is much easier to design a data model with MongoDB than with Redis (Redis is more low-level), it is interesting to benefit from the flexibility of MongoDB for main persistent data, and from the extra features provided by Redis (low latency, item expiration, queues, pub/sub, atomic blocks, etc ...). It is indeed a good combination.
Please note you should never run a Redis and MongoDB server on the same machine. MongoDB memory is designed to be swapped out, Redis is not. If MongoDB triggers some swapping activity, the performance of Redis will be catastrophic. They should be isolated on different nodes.

Obviously there are far more differences than this, but for an extremely high overview:
For use-cases:
Redis is often used as a caching layer or shared whiteboard for distributed computation.
MongoDB is often used as a swap-out replacement for traditional SQL databases.
Technically:
Redis is an in-memory db with disk persistence (the whole db needs to fit in RAM).
MongoDB is a disk-backed db which only needs enough RAM for the indexes.
There is some overlap, but it is extremely common to use both. Here's why:
MongoDB can store more data cheaper.
Redis is faster for the entire dataset.
MongoDB's culture is "store it all, figure out access patterns later"
Redis's culture is "carefully consider how you'll access data, then store"
Both have open source tools that depend on them, many of which are used together.
Redis can be used as a replacement for a traditional datastore, but it's most often used with another normal "long" data store, like Mongo, Postgresql, MySQL, etc.

Redis works excellently with MongoDB as a caching server. Here is what happens.
Anytime that mongoose issues a cache query, it will first go over to the cache server.
The cache server will check to see if that exact query has ever been issued before.
If it hasn’t then the cache server will take the query, send it over to mongodb and Mongo will execute the query.
We will then take the result of that query, it then goes back to the cache server, the cache server will store the result of the query on itself.
It will say anytime I execute that query, I get this response and so its going to maintain a record between queries that are issued and responses that come back from those queries.
The cache server will take the response and send it back to mongoose, mongoose will give it to express and it eventually ends up inside the application.
Anytime that the same exact query is issued again, mongoose will send the same query to the cache server, but if the cache server sees that this query was issued before it will not send the query onto mongodb, instead its going to take the response to the query it got the last time and immediately send it back over to mongoose. There is no indices here, no full table scan, nothing.
We are doing a simple lookup to say has this query been executed? Yes? Okay, take the request and send it back immediately and don’t send anything to mongo.
We have the mongoose server, the cache server (Redis) and Mongodb.
On the cache server there might be a datastore with key value type of data store where all the keys are some type of query issued before and the value the result of that query.
So maybe we are looking up a bunch of blogposts by _id.
So maybe the keys in here are the _id of the records we have looked up before.
So lets imagine that mongoose issues a new query where it tries to find a blogpost with _id of 123, the query flows into the cache server, the cache server will check to see if it has a result for any query that was looking for an _id of 123.
If it does not exist in the cache server, this query is taken and sent on to the mongodb instance. Mongodb will execute the query, get a response and send it back.
This result is sent back over to the cache server who takes that result and immediately sends it back to mongoose so we get as fast a response as possible.
Right after that, the cache server will also take the query issued, and add that on to its collection of queries that have been issued and take the result of the query and store it right up against the query.
So we can imagine that in the future we issue the same query again, it hits the cache server, it looks at all the keys it has and says oh I already found that blogpost, it doesn’t reach out to mongo, it just takes the result of the query and sends it directly to mongoose.
We are not doing complex query logic, no indices, nothing like that. Its as fast as possible. Its a simple key value lookup.
Thats an overview of how the cache server (Redis) works with MongoDB.
Now there are other concerns. Are we caching data forever? How do we update records?
We don’t want to always be storing data in the cache and be reading from the cache.
The cache server is not used for any write actions. The cache layer is only used for reading data. If we ever write data, writing will always go over to the mongodb instance and we need to ensure that anytime we write data we clear any data stored on the cache server that is related to the record we just updated in Mongo.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse