Store many documents to mongoose fast - mongodb

I need to insert 10K documents as fast as possible but it's taking a long time.
I am currently using Model.create([<huge array here>]) to do this.
Would it help to use multiple connections to the database? For example have 10 connections saving 1K each?

You can use model.insertMany(doc, options).
Some stuff to note below.
Connection Pool
10 connections is usually sufficient, but it greatly depends on your hardware. Opening up more connections may slow down your server.
In some cases, the number of connections between the applications and
the database can overwhelm the ability of the server to handle
requests.
Options
There are a couple of options for insertMany that can speed up insertion.
[options.lean «Boolean» = false] if true, skips hydrating and
validating the documents. This option is useful if you need the extra
performance, but Mongoose won't validate the documents before
inserting.
[options.limit «Number» = null] this limits the number of documents
being processed (validation/casting) by mongoose in parallel, this
does NOT send the documents in batches to MongoDB. Use this option if
you're processing a large number of documents and your app is running
out of memory.
Write concern
Setting options on writeConcern in options can also affect performance.
If applications specify write concerns that include the j option,
mongod will decrease the duration between journal writes, which can
increase the overall write load.

Use db.collection.insertMany([])
insertMany will accept array of objects and this is use to perform bulk insertions.

Related

Does a running MongoDB aggregation pipeline slow down reads and writes to the affected collection?

As the title suggests, I'd like to know if reads and writes to a collection are delayed/paused while a MongoDB aggregation pipeline is running. I'm considering adding a pipeline in a user collection, and I think the query could sometimes affect a lot of users (possibly tens of thousands), or just run for longer than I expect. So I'm wondering if that will "block" reads and writes to the collection. The server isn't live, so I don't have real user data to inform this decision. I'd appreciate any feedback or suggestions, thanks!
Each server has certain resource capacity. If you are sending a query to the server, it has less capacity remaining to do other work (be that other queries or writes).
For locking and concurrency in MongoDB, see https://docs.mongodb.com/manual/faq/concurrency/.
If you are planning for high load/high throughput you need to benchmark your specific use case.

MongoDB concurrency - reduces the performance

I understand that mongo db does locking on read and write operations.
My Use case:
Only read operations. No write operations.
I have a collection about 10million documents. Storage engine is wiredTiger.
Mongo version is 3.4.
I made a request which should return 30k documents - took 650ms on an average.
When I made concurrent requests - same requests - 100 times - It takes in seconds - few seconds to 2 minutes all requests handled.
I have single node to serve the data.
How do I access the data:
Each document contains 25 to 40 fields. I indexed few fields. I query based on one index field.
API will return all the matching documents in json form.
Other informations: API is written using Spring boot.
Concurrency tested through JMeter shell script from command line on remote machine.
So,
My question:
Am I missing any optimizations? [storage engine level, version]
Can't I achieve all read requests to be served less than a second?
If so, what sla I can keep for this use case?
Any suggestions?
Edit:
I enabled database profiler in mongodb with level 2.
My single query internally converted to 4 queries:
Initial read
getMore
getMore
getMore
These are the queries found through profiler.
Totally, it is taking less than 100ms. Is it true really?
My concurrent queries:
Now, When I hit 100 requests, nearly 150 operations are more than 100ms, 100 operations are more than 200ms, 90 operations are more than 300ms.
As per my single query analysis, 100 requests will be converted to 400 queries internally. It is fixed pattern which I verified by checking the query tag in the profiler output.
I hope this is what affects my request performance.
My single query internally converted to 4 queries:
Initial read
getMore
getMore
getMore
It's the way mongo cursors work. The documents are transferred from the db to the app in batches. IIRC the first batch is around 100 documents + cursor Id, then consecutive getMore calls retrieve next batches by cursor Id.
You can define batch size (number of documents in the batch) from the application. The batch cannot exceed 16MB, e.g. if you set batch size 30,000 it will fit into single batch only if document size is less than 500B.
Your investigation clearly show performance degradation under load. There are too many factors and I believe locking is not one of them. WiredTiger does exclusive locks on document level for regular write operations and you are doing only reads during your tests, aren't you? In any doubts you can compare results of db.serverStatus().locks before and after tests to see how many write locks were acquired. You can also run db.serverStatus().globalLock during the tests to check the queue. More details about locking and concurrency are there: https://docs.mongodb.com/manual/faq/concurrency/#for-wiredtiger
The bottleneck is likely somewhere else. There are few generic things to check:
Query optimisation. Ensure you use indexes. The profiler should have no "COLLSCAN" stage in execStats field.
System load. If your database shares system resources with application it may affect performance of the database. E.g. BSON to JSON conversion in your API is quite CPU hungry and may affect performance of the queries. Check system's LA with top or htop on *nix systems.
Mongodb resources. Use mongostat and mongotop if the server has enough RAM, IO, file descriptors, connections etc.
If you cannot spot anything obvious I'd recommend you to seek professional help. I find the simplest way to get one is by exporting data to Atlas, running your tests against the cluster. Then you can talk to the support team if they could advice any improvements to the queries.

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

MongoDB: what is faster: single find() query or many find_one()?

I have the following problem connected to the MongoDB database design. Here is my situation:
I have a collection with about 50k documents (15kB each),
every document have a dictionary storing data samples,
my query always gets all the data from the document,
every query uses an index,
the collection have only one index (based on a single datetime field),
in most cases, I need to get data from many documents (typically 25 < N < 100),
it is easier for me to perform many SELECT queries over a single one,
I have a lot of updates in my databases, much less than SELECT ones,
I use the WiredTiger engine (the newest version of MongoDB),
server instance and web application are on the same machine.
I have two possibilities for making a SELECT query:
perform a single query retrieving all documents I am interested in,
perform N queries, everyone gets a single document, where typically 25 < N < 100 (what about a different scenario when 100 < N < 1k or 1k < N < 10k?)
So the question is if there is any additional overhead when I perform many small queries over a single one? In relational databases making many queries is a very bad practice - but in NoSQL? I am asking about a general practice - should I avoid that much queries?
In the documentation, I read that the number of queries is not important but the number of searches over documents - is that true?
Thanks for help ;)
There is a similar question like the one you asked : Is it ok to query mongodb multiple times
IMO, for your use-case i.e. 25<N<100, one should definitely go with batching.
In case of Single queries :
Looping in a single thread will not suffice, you'll have to make parallel requests which would create additional overhead
creates tcp/ip overhead for every request
there is a certain amount of setup and teardown for each query creating and exhausting cursors which would create unnecessary overhead.
As explained in the answer above, there appears be a sweet-spot for how many values to batch up vs. the number of round trips and that depends on your document type as well.
In broader terms, anything 10<N<1000 should go with batching and the remaining records should form part of other batches but querying single document at a time would definitely create unnecessary overhead.
The problem when you perform small queries over one query is network overhead that is the network latency roundtrip.
For a single request in a batch processing it may be not much, but if you make multiple requests like these or use this technique on frontend it will decrease performance.
Also you may need to preprocess the data like sorting aggregating it manually.

How long will a mongo internal cache sustain?

I would like to know how long a mongo internal cache would sustain. I have a scenario in which i have some one million records and i have to perform a search on them using the mongo-java driver.
The initial search takes a lot of time (nearly one minute) where as the consecutive searches of same query reduces the computation time (to few seconds) due to mongo's internal caching mechanism.
But I do not know how long this cache would sustain, like is it until the system reboots or until the collection undergoes any write operation or things like that.
Any help in understanding this is appreciated!
PS:
Regarding the fields with which search is performed, some are indexed
and some are not.
Mongo version used 2.6.1
It will depend on a lot of factors, but the most prominent are the amount of memory in the server and how active the server is as MongoDB leaves much of the caching to the OS (by MMAP'ing files).
You need to take a long hard look at your log files for the initial query and try to figure out why it takes nearly a minute.
In most cases there is some internal cache invalidation mechanism that will drop your cached query internal record when write operation occurs. It is the simplest describing of process. Just from my own expirience.
But, as mentioned earlier, there are many factors besides simple invalidation that can have place.
MongoDB automatically uses all free memory on the machine as its cache.It would be better to use MongoDB 3.0+ versions because it comes with two Storage Engines MMAP and WiredTiger.
The major difference between these two is that whenever you perform a write operation in MMAP then the whole database is going to lock and whereas the locking mechanism is upto document level in WiredTiger.
If you are using MongoDB 2.6 version then you can also check the query performance and execution time taking to execute the query by explain() method and in version 3.0+ executionStats() in DB Shell Commands.
You need to index on a particular field which you will query to get results faster. A single collection cannot have more than 64 indexes. The more index you use in a collection there is performance impact in write/update operations.