var product = db.GetCollection<Product>("Product");
var lookup1 = new BsonDocument(
"$lookup",
new BsonDocument {
{ "from", "Variant" },
{ "localField", "Maincode" },
{ "foreignField", "Maincode" },
{ "as", "variants" }
}
);
var pipeline = new[] { lookup1};
var result = product.Aggregate<Product>(pipeline).ToList();
The data of collection a is very large so it takes me 30 seconds to put the data in the list.
What should I do to make a faster lookup?
What that query is doing is retrieving every document from the Product collection, and then for each document found, perform a find query in the Variant collection. If there is no index on the Maincode field in the Variant collection, it will be reading the entire collection for each document.
This means that if there are, say, 1000 total products, with 3000 total variants (3 per product, on average), this query will be reading all 1000 documents from Product, and if that index isn't there, it would read all 3000 documents from Variant 1000 times, i.e. it will be examining 3 million documents.
Some ways to possibly speed this up:
create an index on {Maincode:1} in the Variant collection
This will reduce the number of documents that must be read in order to complete the lookup
change the schema
If the variants are stored in the same document with the product, there is no need for a lookup
filter the products prior to lookup
Again, reducing the documents read during the lookup
use a cursor to retrieve the documents in batches
If you perform any necessary sorting first, and the lookup last, you can return the documents to the application in batches, which would allow the application to display or begin processing the first batch before the second batch is available. This doen't make the query itself faster, but it can reduced the perceived wait in the application.
Related
In my PostgresDB, I'm performing a deletion operation using another table as below.
DELETE FROM user_records
USING to_delete_records
WHERE user_records.record_id = to_delete_records.record_id
user_records table contains around 200 million records while to_delete_records table contains around 5-10 million records. Everyday the to_delete_records table is updated with a new set of records, and have to perform the above deletion operation. (similar to deletion, insertion operations (around 5-10 million records) take place as well, hence the total dataset of user_records remains around 200 million)
Now I'm replacing the PostgresDB with a MongoDB, and following is the script I'm using for deleting records in user_records collection:
db.to_delete_records.find({}, {_id: 0}).forEach(function(doc){
db.user_records.deleteOne({record_id:doc.record_id});
});
As this is a loop running, seems inefficient.
Is there a better way to delete documents of a collection using another collection in Mongo?
If record_id is a unique field in both user_records and to_delete_records, you can build a unique index for the field for each collection if you have not done so.
db.user_records.createIndex({record_id: 1}, {unique:true});
db.to_delete_records.createIndex({record_id: 1}, {unique:true});
Afterwards, you can use a $merge statement to add an auxiliary field toDelete to the collection user_records, based on the content in to_delete_records
db.to_delete_records.aggregate([
{
"$merge": {
"into": "user_records",
"on": "record_id",
"whenMatched": [
{
$set: {
"toDelete": true
}
}
]
}
}
])
Finally run a deleteMany on user_records
db.user_records.deleteMany({toDelete: true});
New to Mongodb I am trying understand the 100mb limit MongoDb for aggregate pipelines. Trying to find out what this actually means? Does it apply to the size of the database collection we are performing the aggregate on?
Bit of background we have the following query on an inventory ledger where we are taking a data set, running a group sum to find out which products are still in-stock (ie amount sum is greater than 0). Based on the result where the product is in stock we return those records by running a lookup in the original collection. The query is provided below.
Assume the inventory objects contains about 10 sub fields/record pair. And assume for 1000records/1mb.
QUESTION
My question is if the inventory collection size reaches 100mb as a JSON object array does this mean the call with fail? ie the max we can run the aggregate on is 100mb x 1000 records = 100,000 records?
BTW we are on a server that does not support writing to disk hence the question.
db.inventory.aggregate([
{
$group: {
_id: {
"group_id": "$product"
},
"quantity": {
$sum: "$quantity"
}
}
},
{
"$match": {
"quantity": {
$gt: 0
}
}
},
{
$lookup: {
from: "inventory",
localField: "_id.group_id",
foreignField: "$product",
as: "records"
}
}
])
The 100MB limit is a restriction on the amount of memory used by an aggregation stage.
The pipeline in your question first needs to read every document from the collection. It does this by requesting the documents from the storage engine, which will read each document from the disk and store it in the in-memory cache. The cache does not count against the 100MB limit.
The aggregation process will receive documents individually from the storage engine, and pass it through the pipeline to the first blocking stage (group is a blocking stage).
The group stage will examine the input document, update the fields in matching group, and then discard the input document.
This means the memory required by the group stage will be the sum of:
the size of 1-2 documents
total storage size for each result group
any scratch space needed for the operations to build each result
The specific group stage in the question is return a product identifier and an integer.
Using the Object.bsonsize funtion in the mongo shell, we can see that a null product ID produces a 43-byte object:
> Object.bsonsize({_id:{group_id:null},quantity:0})
43
So the total memory required will be
<number of distinct `product` values> x (<size of a product value> + 43)
Note that the values will be stored in BSON, so a string will be length+5, a UUID would be 21 bytes, etc.
I have a simple query that I want to run on two different collections (each collection has around 50K records) :
db.collectionA.updateMany({}, { $mul: { score: 0.3 } });
db.collectionB.updateMany({}, { $mul: { score: 0.8 } });
So basically I want to multiply the field score by a certain amount. On collectionA it takes 2-3s and on collectionB it takes ~50s. I noticed that I don't have any index on the field score on collection A (which is a dynamic field, updated very often) and for some reason I have a composed index with this field on collectionB.
My question is simple, why does it take so much time for mongo to execute this query and what can I do to make it faster ?
Thanks
I think you have answered the question yourself: there is an index on that field in the second collection, and on changing the value in a document, that index needs recalculation. You could try to drop that index and re-create it after the update has finished
I am using the below query on my MongoDB collection which is taking more than an hour to complete.
db.collection.find({language:"hi"}).sort({_id:-1}).skip(5000).limit(1)
I am trying to to get the results in a batch of 5000 to process in either ascending or descending order for documents with "hi" as a value in language field. So i am using this query in which i am skipping the processed documents every time by incrementing the "skip" value.
The document count in this collection is just above 20 million.
An index on the field "language" is already created.
MongoDB Version i am using is 2.6.7
Is there a more appropriate index for this query which can get the result faster?
When you want to sort descending, you should create a multi-field index which uses the field(s) you sort on as descending field(s). You do that by setting those field(s) to -1.
This index should greatly increase the performance of your sort:
db.collection.ensureIndex({ language: 1, _id: -1 });
When you also want to speed up the other case - retrieving sorted in ascending order - create a second index like this:
db.collection.ensureIndex({ language: 1, _id: 1 });
Keep in mind that when you do not sort your results, you receive them in natural order. Natural order is often insertion order, but there is no guarantee for that. There are various events which can cause the natural order to get messed up, so when you care about the order you should always sort explicitly. The only exception to this rule are capped collections which always maintain insertion order.
In order to efficiently "page" through results in the way that you want, it is better to use a "range query" and keep the last value you processed.
You desired "sort key" here is _id, so that makes things simple:
First you want your index in the correct order which is done with .createIndex() which is not the deprecated method:
db.collection.createIndex({ "language": 1, "_id": -1 })
Then you want to do some simple processing, from the start:
var lastId = null;
var cursor = db.collection.find({language:"hi"});
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
That's the first batch. Now when you move on to the next one:
var cursor = db.collection.find({ "language":"hi", "_id": { "$lt": lastId });
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
So that the lastId value is always considered when making the selection. You store this between each batch, and continue on from the last one.
That is much more efficient than processing with .skip(), which regardless of the index will "still" need to "skip" through all data in the collection up to the skip point.
Using the $lt operator here "filters" all the results you already processed, so you can move along much more quickly.
I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?
Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.