MongoDB big collection aggregation is slow - mongodb

I'm having a problem with the time of my mongoDB query, from a node backend using mongoose. i have a collection called people that has 10M records, and every record is queried from the backend and inserted from another part of the system that's written in c++ and needs to be very fast.
this is my mongoose schema:
{
_id: {type: String, index: {unique: true}}, // We generate our own _id! Might it be related to the slowness?
age: { type: Number },
id_num: { type: String },
friends: { type: Object }
}
schema.index({'id_num': 1}, { unique: true, collation: { locale: 'en_US', strength: 2 } })
schema.index({'age': 1})
schema.index({'id_num': 'text'});
Friends is an object looking like that: {"Adam": true, "Eve": true... etc.}.
there's no meaning to the value, and we use dictionaries to avoid duplicates fast on C++.
also, we didn't encounter a set/unique-list type of field in mongoDB.
The Problem:
We display people in a table with pagination. the table has abilities of sort, search, and select number of results.
At first, I queried all people and searched, sorted and paged it on the js. but when there are a lot of documents, It's turning problematic (memory problems).
The next thing i did was to try to fit those manipulations (searching, sorting & paging) on my query.
I used mongo's text search- but it not matches a partial word. is there any way to search a partial insensitive string? (I prefer not to use regex, to avoid unexpected problems)
I have to sort before paging, so I tried to use mongo sort. the problem is, that when the user wants to sort by "Friends", we want to return the people sorted by their number of friends (number of entries in the object).
The only way i succeeded pulling it off was using $addFields in aggregation:
{$addFields: {$size: {$ifNull: [{$objectToArray: '$friends'}, [] ]}}}
this addition is taking forever! when sorting by friends, the query takes about 40s for 8M people, and without this part it takes less than a second.
I used limit and skip for pagination. it works ok, but we have to wait until the user requests the second page and make another very long query.
In the end, this is the the interesting code part:
const { sortBy, sortDesc, search, page, itemsPerPage } = req.query
// Search never matches partial string
const match = search ? {$text: {$search: search}} : {}
const sortByInDB = ['age', 'id_num']
let sort = {$sort : {}}
const aggregate = [{$match: match}]
// if sortBy is on a simple type, we just use mongos sort
// else, we sortBy friends, and add a friends_count field.
if(sortByInDB.includes(sortBy)){
sort.$sort[sortBy] = sortDesc === 'true' ? -1 : 1
} else {
sort.$sort[sortBy+'_count'] = sortDesc === 'true' ? -1 : 1
// The problematic part of the query:
aggregate.push({$addFields: {friends_count: {$size: {
$ifNull: [{$objectToArray: '$friends'},[]]
}}}})
}
const numItems = parseInt(itemsPerPage)
const numPage = parseInt(page)
aggregate.push(sort, {$skip: (numPage - 1)*numItems}, {$limit: numItems})
// Takes a long time (when sorting by "friends")
let users = await User.aggregate(aggregate)
I tried indexing all simple fields, but the time is still too much.
The only other solution i could think of, is making mongo calculate a field "friends_count" every time a document is created or updated- but i have no idea how to do it, without slowing our c++ that writes to the DB.
Do you have any creative idea to help me? I'm lost, and I have to shorten the time drastically.
Thank you!
P.S: some useful information- the C++ area is writing the people to the DB in a bulk once in a while. we can sync once in a while and mostly rely on the data to be true. So, if that gives any of you any idea for a performance boost, i'd love to hear it.
Thanks!

Related

MongoDB query is performing slow when using $in operator

With some complex queries and aggregation, I got the list of object ids as below:
var objectIdsCollection = Array of ObjectIds
I have another collection, it's say Collection1 and It has OtherCollectionObjectId. Now, I need to filter this Collection1 based on object ids I got in an array.
Collection1 {
_id: ObjectId('xyz'),
name: ..
someOtherAttributes: ...
OtherCollectionObjectId: ObjectId('abc')
}
Below is the query which I am trying. I need to use await because, the result of this query is dependent on another query.
let queryData = await Document1.aggregate([
{
$match: {
OtherCollectionObjectId: {
$in: objectIdsCollection,
},
Deleted: false,
},
},
]);
But, this query is performing way slow. Sometimes it takes around 1 minute to fetch the results.
I tried couple of other suggestions from internet but nothing seems to work for this kind of scenario.
Please let me know anything which can improve the performance of this query.
Thanks

Best way to count documents in mongoDB

we have a collection with big amount of documents, lets say around 100k. We now want to count the number of documents which has the key x set.
If I try it with Collection.countDocuments({ x: { $exists: true }}) I get the result, but it creates instantly a warning in the console: Query Targeting: Scanned Objects / Returned has gone above 1000.
So, is there a better way to count the documents? There is a Index on the field, is it possible to get the length of the index?
Thanks
Theres no real way of viewing the index trees in Mongo, what other people have linked you just returns the size of the tree, I'm not sure how useful that information is in this context.
Now to your question is this the best way to count?.
The answer is Yes ... -ish.
countDocuments is a wrapper function, it just simulates the following pipeline:
db.collection.aggregate([
{ $match: <query> },
{ $group: { _id: null, n: { $sum: 1 } } } )
])
This pipeline is the most efficient way to go, but the difference between running this aggregation and using the wrapper function is about 100-200 milliseconds, depending on your machine spec.
Meaning if you're looking for "way" better performance you're not going to find it.
With that said this warning is stupid, it just means you have more than 1000 documents with that field. The true purpose of it is to alert you in the case you're trying to query 1-20 documents without a proper index.
You can use the indexSizes field returned by the stats() method.
The stats() method "Returns statistics about the collection".
See example here :
https://docs.mongodb.com/manual/reference/method/db.collection.stats/#basic-stats-lookup
{
...,
"indexSizes" : {
"_id_" : 237568,
"cuisine_1" : 143360,
"borough_1_cuisine_1" : 151552,
"borough_1_address.zipcode_1" : 151552
},
...
}
indexSize key return size as in space used in storing not count
Check With Explain if index getting used or not . (Update in question Also)
can use hint option to check the performance after specifying index
Or precalculate count by $inc operator might good option if possible in you use case
try cursor.count if its faster countDocument should been faster but no harm in checking
https://docs.mongodb.com/manual/reference/method/cursor.count/

How to avoid memory limit when skipping a large amount of records with Mongoose?

On a collection with over 100k records, when I query with Mongoose options like so:
contact.find({}, {}, {
collation: {
locale: 'en_US',
strength: 1
},
skip: 90000,
limit: 10,
sort: {
email: 1
}
});
I get this error:
MongoError: Executor error during find command: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.
But I do have an index on the email field:
{
"v" : 2,
"key" : {
"email" : 1
},
"name" : "email_1",
"ns" : "leadfox.contact",
"background" : true
}
On the other hand when I query in the Mongo shell it works fine:
db.contact.find().sort({email: 1}).skip(90000).limit(10)
What you are experiencing is because of skip. As you can see in documentation
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.
You should find a better approach instead of skip. As you are sorting documents with email field, you can write a range query with email field instead of skiplike that:
contact.find({ "email": { $gt: the_last_email_from_previous_query } }, {}, {
collation: {
locale: 'en_US',
strength: 1
},
limit: 10,
sort: {
email: 1
}
});
Update:
First of all. Like I said above, what you want is not possible. Mongodb says it, not me.
Secondly, I suggest you to search about modern pagination methods and people use cases. Your example in comment is absurd. No user should/would go directly to 790th page for any data. If they go directly a page like that, it most probably means , they covered data till 790th page and they want to continue. So even you are building a stateless system(like all modern systems these days) you should store some information about users last-point-view for your paginated data. This is an example approach(I am not saying best, it is just an example) based on user behavior.
Another approach, you can use (like most of modern pagination tables) you only allow user to navigate 5-6 pages forward or backward. So you can skip only 50-60 document in your query combined with $gtand $lton emailfield.
Another approach can be caching data in memory with some other tools.
I think you get the picture. Happy coding.

Is it possible to physically reorganize a mongoDB collection to avoid using a sort()?

I have a collection that stores information about articles. The collection is for archival purposes so it is read only. Only two fields are being used at the moment: "title" and "page_length". Because I am always interested in getting longer articles first, I have the following index in place: { title: 1, page_length: -1}.
I have found that sorts are still slow because the collection is very large and won't fit into memory.
Assuming that almost every query I use on this collection will require a sort({page_length:-1}), is there any way to simply have the records stored on disk in order of page_length descending? In other words, is there a simple way to make the first record in the collection the largest page_length value, the second record the second largest, and so on?
That way I could just grab the first n records using limit(n) without having to run a sort. Any ideas?
Updating with more information:
I'm using this for a search autocomplete feature so speed is critical. The query I've been using looks like this:
db.articles.find({"title": /^SomeKeyword/}).sort({page_length:-1})
I'm happy to create multiple indexes since inserts are not a concern, I just want to maximize read speed.
EDIT: For reference, I actually was able to reorganize the records in the collection by using a find().forEach() into a new collection. I then searched the collection and grabbed the first N results without the need for any sort, which worked very well. Note that this ONLY works because my dataset does not ever change.
Your index { title: 1, page_length: -1 } is not used for the a query that looks like this:
db.collection.find( {} ).sort( { page_length: -1 } );
MongoDB can only use compound indexes from left to right, so in order for the index to be used, you need to have the "title" as a find or sort argument:
db.collection.find( { title: 'foo' } ).sort( { page_length: -1 } );
db.collection.find().sort( { title: 1, page_length: -1 } );
Explain will tell you:
db.so.find( {} ).sort( { page_length: -1 } ).explain();
{
"cursor" : "BasicCursor",
…
If you change your index to:
db.so.ensureIndex({ page_length: -1, title: 1 } );
Then the index will be used for sorting, but you can't use the index for just doing a lookup by title and you will need an additional index for that. If you're really only interested in those two fields and making sure you use a covered index helps. You will have to have the compound index with { page_length: -1, title: 1 } and you can make sure it is used by using a projection:
db.collection.find( {}, { page_length: 1, title: 1, _id: 0 } ).sort( { page_length: -1 } );
But you can not decide or influence how MongoDB stores things on disk.
I can think of a solution that uses two queries.
First, you can do a covered query to get the list of documents you care about. Second you can use the list of documents retrieved and the $in operator to get the final result.
The covered query will operate within memory (or at least sequentially on disk), so it should be fast, and the $in can utilize the _id index and should be tolerably efficient with a reasonable number of documents.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});