Updating data in Mongo sorted by a particular field - mongodb

I posted this question on Software Engineering portal without conducting any tests. It was also brought to my notice that this needs to be posted on SO, not there. Thanks for the help in advance!
I need Mongo to return the documents sorted by a field value. The easiest way to achieve this would be running the command db.collectionName.find().sort({field:priority}), however, I tried this method on a dummy collection of 1000 documents; it runs in 22ms. I also tried running db.collectionName.find() on the same data, it runs in 3ms, which means that Mongo is taking time to sort and return the documents (which is understandable). Both tests were done in the same environment and were done by adding .explain("executionStats") to the query.
I will be working with a large amount of data and concurrent requests to access DB, so I need the querying to be faster. My question is, is there a way to always keep the data sorted by a field in the DB so that I don't have to sort it over and over for all requests? For instance, some sort of update command that could sort the entire DB once a week or so?

A non-unique index with that field in this collection will give the results you're after and avoid the inefficient in-memory sorting.

Related

How to efficiently query MongoDB for documents when I know that 95% are not used

I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.

Performance loss with big size of collections

I've a collection that name "test" and has 132K documents in it. When I get first document of the collection it takes between 2-5ms but it's not same for last documation. It takes 100-200ms to pull.
So I've decided to ask the community.
My questions
What is the best document amount in one collection for the performance?
Why does it take so long to get last document from the collection? (I actually don't know how mongo works partially.)
What should I do for this issue and future problems?
After some search of how mongodb works, I found the solution. I didn't use any indexes for my collection so whenever I try to pull something it scans each data and each document. After creating some indexes for my needs, it is much more faster, actually 1ms, than before.
Conclusion
Create indexes for your collection and your needs. It'd be effective write and read operation both. Do not forget to search more 'cause there're some options like background which prevents blocking operations while creating index.

Is dynamically creating and dropping collections in MongoDB going to create scalability issues?

I have an application (built in Meteor) that provides some ad hoc reporting capabilities to the end user. I have built up that functionality by using the aggregation pipeline to produce the results for a given query. This makes it extremely fast and I was using $out to push the results right into a results table.
The results table included a queryID, which the client used to figure out which were the correct results.
Unfortunately, as you may know (and I discovered), that doesn't work so well once you have more than one user running reports at a time because $out deletes the whole results table before pushing the new query in.
I see three possible workarounds:
Run the aggregation, but manually push the results into the results collection
$out the results into a temporary collection (dynamically named to avoid conflicts) and then manually copy the results from there into results collection, immediately dropping the temporary one. This made some sense when I thought I could use copyTo(), but that doesn't appear possible within Meteor, so I think this option doesn't make much sense relative to #1 in this case.
$out the results into a temporary collection (dynamically named to avoid conflicts) and have the client pull its results directly from there. I would then periodically drop the extra collections after say 24 hours (like I do with specific query results in the main collection today).
#3 would be the fastest by far - the time it takes to manually copy rows dwarfs the time it takes the queries to run. But I'm concerned about the impact of creating and dropping so many collections.
We're not talking millions of users here, but if an average of 500 users a day were each running 10-20 reports, there could be an additional 5-10k collections in the database at any one time. That seems like a lot. Perhaps I could be smarter about cleaning them up somehow, though I can't just immediately remove them because a user might want to have multiple tabs open with different reports. Even still, we're potentially talking about hundreds to thousands of collections.
Is that going to be a problem?
Are there other approaches I should consider instead?
Other recommendations?
Thanks!
Dropping a collection in mongoDB is very efficient operation, anyway much more efficient than deleting some documents in a larger collection.
Maximum number of collections is quite high, only limited by namespace namespace in MMAPv1 while no hard limit exists in wiretiger engine.
So I would favor your solution #3.
Some improvements/alternatives you can think:
Consider creating the collections in a separated database (say per day) then you can drop the entire database in a single operation without having to drop individual collections.
Use an endpoint for the result set, cash the results then drop the $out collection. Let cache handle user requirements and only rerun the aggregation if cache has expired or something.
This kind of activity is done very easily in relational databases such as mysql or pgsql. You might consider synchronising your data to a separate relational database for the purposes of reporting.
There is a package https://github.com/perak/mysql-shadow which claims to provide synchronisation. I played with it and it didn't work perfectly, although doing just one way sync is more likely to succeed.
The other option is to use Graphql over a mongo/mysql hybrid database which can be done with the Apollo stack http://www.apollodata.com/

Is it possible to run queries on 200GB data on mongodb with 16GB RAM?

I am trying to run a simple query to find number of all records with a particular value using:
db.ColName.find({id_c:1201}).count()
I have 200GB of data. When I run this query, mongodb takes up all the RAM and my system starts lagging. After an hour of futile waiting, I give up without getting any results.
What can be the issue here and how can I solve it?
I believe the right approach in the NoSQL world isn't trying to perform a full query like that, but accumulate stats overtime.
For example, you should have a collection stats with arbitrary objects which should own a kind or id property that can take a value like "totalUserCount". Whenever you add an user, you also update this count.
This way you'll get instant results. It's just getting a property value in a small collection of stats.
BTW, this slowness should be originated by querying objects by a non-indexed property in your collection. Try to index id_c and probably you'll get quicker results.
That amount of data can easily be managed by MySQL, MSSQL or Oracle with the given hardware specification. You don't need a NoSQL database for that, NoSQL databases are made for much larger storing needs which actually require lots of hardware (RAM, harddisks) to be efficient.
You need to define an index to read that id and use a normal SQL database.

Prioritize specific long-running operation

I have a mongo collection with a little under 2 million documents in it, and I have a query that I wish to run that will delete around 700.000 of them, based on a Date-field.
The remove query looks something like this:
db.collection.remove({'timestamp': { $lt: ISODate('XXXXX') }})
The exact date is not important in this case, the syntax is correct and I know it will work. However, I also know it's going to take forever (last time we did something similar it took a little under 2 hours).
There is another process inserting and updating records at the same time that I cannot stop. However, as long as those insertions/updates "eventually" get executed, I don't mind them being deferred.
My question is: Is there any way to set the priority of a specific query / operation so that it runs faster / before all the queries sent afterwards? In this case, I assume mongo has to do a lot of swapping data in and out of the database which is not helping performance.
I don't know whether the priority can be fine-tuned, so there might be a better answer.
A simple workaround might be what is suggested in the documentation:
Note: For large deletion operations it may be more effect [sic] to copy the documents that you want to save to a new collection and then use drop() on the original collection.
Another approach is to write a simple script that fetches e.g. 500 elements and then deletes them using $in. You can add some kind of sleep() to throttle the deletion process. This was recommended in the newsgroup.
If you will encounter this problem in the future, you might want to
Use a day-by-day collection so you can simply drop the entire collection once data becomes old enough (this makes aggregation harder), or
use a TTL-Collection where items will time out automatically and don't need to be deleted in a bunch.
If your application needs to delete data older than a certain amount of time i suggest using TTL indexes. Ex (from the mongodb site):
db.log.events.ensureIndex( { "status": 1 }, { expireAfterSeconds: 3600 } )
This works like a capped collection, except data is deleted by time. The biggest win for you is that it works in a background thread, your inserts/updates will be mostly unhurt. I use this technique on a SaaS based product in production, works like a charm.
This may not be your use-case, but i hope that helped.