indexing vs normalization when optimizing for read speed - mongodb

Having an architecture discussion with a coworker and we need to find an answer for this. Given a set of millions of data points that look like:
data =
[{
"v" : 1.44,
"tags" : {
"account" : {
"v" : "1055",
"name" : "Circle K"
}
"region" : "IL-East"
}
}, {
"v" : 2.25,
"tags" : {
"account" : {
"v" : "1055",
"name" : "Circle K"
}
"region" : "IL-West"
}
}]
and that we need to query on the fields in the tags collection (e.g. where account.name == "Circle K"), would there be any speed benefit to normalizing the account field to this:
accounts =
[{
_id : 507f1f77bcf86cd799439011,
v: "1055",
name : "Circle K"
}]
data =
[{
"v" : 1.44,
"tags" : {
"account" : 507f1f77bcf86cd799439011
"region" : "IL-East"
}
}, {
"v" : 2.25,
"tags" : {
"account" : 507f1f77bcf86cd799439011
"region" : "IL-West"
}
}]
I suspect I'll have to build 2 db's for this and just see what the speed looks like. The question is, is mongo better at querying on BSON IDs vs. strings? The db in question will be about 1:10 write vs. read.

The most important thing here is to make sure that you have enough RAM for your working set. That includes the space for the "tags.account.name" index and the expected query result set.
As for the key size. You use ObjectID-as-string above, which you should not do. Leave the real ObjectIDs in as their size is quite a bit smaller. If you really have a lot of small documents, then you might even want to think about shorting your field names as well.

Related

How to reduce execution time in this mongo db find query?

document sample data followed like this,
{
"_id" : ObjectId("62317ae9d007af22f984c0b5"),
"productCategoryName" : "Product category 1",
"productCategoryDescription" : "Description about product category 1",
"productCategoryIcon" : "abcd.svg",
"status" : true,
"productCategoryUnits" : [
{
"unitId" : ObjectId("61fa5c1273a4aae8d89e13c9"),
"unitName" : "kilogram",
"unitSymbol" : "kg",
"_id" : ObjectId("622715a33c8239255df084e4")
}
],
"productCategorySizes" : [
{
"unitId" : ObjectId("61fa5c1273a4aae8d89e13c9"),
"unitName" : "kilogram",
"unitSize" : 10,
"unitSymbol" : "kg",
"_id" : ObjectId("622715a33c8239255df084e3")
}
],
"attributes" : [
{
"attributeId" : ObjectId("62136ed38a35a8b4e195ccf4"),
"attributeName" : "Country of Origin",
"attributeOptions" : [],
"isRequired" : true,
"_id" : ObjectId("622715ba3c8239255df084f8")
}
]
}
This collection has been indexed in "_id". without sub-documents execution time is reduced but all document fields are required.
db.getCollection('product_categories').find({})
This collection contains 30000 records and this query takes more than 30 seconds to execute. so how to solve this issue. Anybody ask me a better solution. Thanks.
Indexing and compound indexing will make it use cache instead of scanning document every time you query it. 30.000 documents is nothing to MongoDB, it can handle millions in a second. If these fields are populated in the process that's another heavy operation for the query.
See if your schema is efficiently structured or you're throttling your connection to the server. Other thing to consider is to project only the fields that you require, using aggregation pipeline.
Although the question is not very clear you can follow this article for some best practices.

Find by sub-documents field value with case insensitive

I know this is a bit of newb question but I'm having a hard time figuring out how to write a query to find some information. I have several documents (or orders) much like the one below and I am trying to see if there is any athlete with the name I place in my query.
How do I write a query to find all records where the athleteLastName = Doe (without case sensitivity)?
{
"_id" : ObjectId("57c9c885950f57b535892433"),
"userId" : "57c9c74a0b61b62f7e071e42",
"orderId" : "1000DX",
"updateAt" : ISODate("2016-09-02T18:44:21.656Z"),
"createAt" : ISODate("2016-09-02T18:44:21.656Z"),
"paymentsPlan" :
[
{
"_id" : ObjectId("57c9c885950f57b535892432"),
"customInfo" :
{
"formData" :
{
"athleteLastName" : "Doe",
"athleteFirstName" : "John",
"selectAttribute" : ""
}
}
}
]
}
You need to use dot notation to access the embedded documents and regex because you want case insensitive.
db.collection.find({'paymentsPlan.customInfo.formData.athleteLastName': /Doe/i}

Why does MongoDB take so long to sort a result with only one object in it?

I am optimizing a fairly complicated query of the form:
db.foo.find({
"$or":[
{"bar1":ObjectID("123123"), baz:false},
{"bar2":ObjectID("123123"), baz:false}
],
"deleted":false}
).sort("modified_on")
And the sort appears to be killing my performance. I have indexes on the modified_date field like so:
{
"v" : 1,
"key" : {
"modified_on" : 1
},
"ns" : "realtalk.customer_profile",
"name" : "modified_on_1"
},
{
"v" : 1,
"key" : {
"modified_on" : -1
},
"ns" : "realtalk.customer_profile",
"name" : "modified_on_-1"
}
These do not seem to speed up the query at all. What is really driving my crazy though is the size of my result. There is only ever a handful of objects returned by this query, often only one. How is it taking mongo so long to sort one thing, and how can I speed it up?

Optimize query performance in MongoDB

I have a collection named App and need to query those active (active: true) apps that belong to a particular user (user_id) or are available to all users (by their _id). I use query like this
{
"active" : true,
"$or" : [
{
"user_id" : "111111111111111111111111"
},
{
"_id" : {
"$in" : [
ObjectId("222222222222222222222222"),
ObjectId("333333333333333333333333"),
ObjectId("444444444444444444444444")
]
}
}
]
}
However in db.currentOp(true) I see that this query is running very slowly: lockStats.timeLockedMicros.r is about 3000.
How can I optimize performance of this query? I already have the following indexes on App:
> db.App.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.App"
},
{
"v" : 1,
"key" : {
"active" : 1,
"created_at" : -1
},
"name" : "active_1_created_at_-1",
"ns" : "mydb.App",
"background" : true
},
{
"v" : 1,
"key" : {
"active" : 1,
"user_id" : 1
},
"name" : "active_1_user_id_1",
"ns" : "mydb.App",
"background" : true
}
]
Two issues I see here:
1) You would not need index on the boolean field active as it would have low selectivity and not benefiting query performance.
"If overall selectivity is low, and if MongoDB must read a number of documents to return results, then some queries may perform faster without indexes." source
2) You need an index for user_id because user_id cannot use the compound index you created for active_1_user_id_1
Edit: You can always check index efficiency by doing a explain(true) and look at which indexes are used for that query.
I would try to do the following:
remove all your indexes, your active field has a low cardinality (boolean) and does not help you at all, you are not using created_at, so there is no reason for it.
add an index only on user_id key
change your strings as numbers to numbers.

Resolving MongoDB DBRef array using Mongo Native Query and working on the resolved documents

My MongoDB collection is made up of 2 main collections :
1) Maps
{
"_id" : ObjectId("542489232436657966204394"),
"fileName" : "importFile1.json",
"territories" : [
{
"$ref" : "territories",
"$id" : ObjectId("5424892224366579662042e9")
},
{
"$ref" : "territories",
"$id" : ObjectId("5424892224366579662042ea")
}
]
},
{
"_id" : ObjectId("542489262436657966204398"),
"fileName" : "importFile2.json",
"territories" : [
{
"$ref" : "territories",
"$id" : ObjectId("542489232436657966204395")
}
],
"uploadDate" : ISODate("2012-08-22T09:06:40.000Z")
}
2) Territories, which are referenced in "Map" objects :
{
"_id" : ObjectId("5424892224366579662042e9"),
"name" : "Afghanistan",
"area" : 653958
},
{
"_id" : ObjectId("5424892224366579662042ea"),
"name" : "Angola",
"area" : 1252651
},
{
"_id" : ObjectId("542489232436657966204395"),
"name" : "Unknown",
"area" : 0
}
My objective is to list every map with their cumulative area and number of territories. I am trying the following query :
db.maps.aggregate(
{'$unwind':'$territories'},
{'$group':{
'_id':'$fileName',
'numberOf': {'$sum': '$territories.name'},
'locatedArea':{'$sum':'$territories.area'}
}
})
However the results show 0 for each of these values :
{
"result" : [
{
"_id" : "importFile2.json",
"numberOf" : 0,
"locatedArea" : 0
},
{
"_id" : "importFile1.json",
"numberOf" : 0,
"locatedArea" : 0
}
],
"ok" : 1
}
I probably did something wrong when trying to access to the member variables of Territory (name and area), but I couldn't find an example of such a case in the Mongo doc. area is stored as an integer, and name as a string.
I probably did something wrong when trying to access to the member variables of Territory (name and area), but I couldn't find an example
of such a case in the Mongo doc. area is stored as an integer, and
name as a string.
Yes indeed, the field "territories" has an array of database references and not the actual documents. DBRefs are objects that contain information with which we can locate the actual documents.
In the above example, you can clearly see this, fire the below mongo query:
db.maps.find({"_id":ObjectId("542489232436657966204394")}).forEach(function(do
c){print(doc.territories[0]);})
it will print the DBRef object rather than the document itself:
o/p: DBRef("territories", ObjectId("5424892224366579662042e9"))
so, '$sum': '$territories.name','$sum': '$territories.area' would show you '0' since there are no fields such as name or area.
So you need to resolve this reference to a document before doing something like $territories.name
To achieve what you want, you can make use of the map() function, since aggregation nor Map-reduce support sub queries, and you already have a self-contained map document, with references to its territories.
Steps to achieve:
a) get each map
b) resolve the `DBRef`.
c) calculate the total area, and the number of territories.
d) make and return the desired structure.
Mongo shell script:
db.maps.find().map(function(doc) {
var territory_refs = doc.territories.map(function(terr_ref) {
refName = terr_ref.$ref;
return terr_ref.$id;
});
var areaSum = 0;
db.refName.find({
"_id" : {
$in : territory_refs
}
}).forEach(function(i) {
areaSum += i.area;
});
return {
"id" : doc.fileName,
"noOfTerritories" : territory_refs.length,
"areaSum" : areaSum
};
})
o/p:
[
{
"id" : "importFile1.json",
"noOfTerritories" : 2,
"areaSum" : 1906609
},
{
"id" : "importFile2.json",
"noOfTerritories" : 1,
"areaSum" : 0
}
]
Map-Reduce functions should not be and cannot be used to resolve DBRefs in the server side.
See what the documentation has to say:
The map function should not access the database for any reason.
The map function should be pure, or have no impact outside of the
function (i.e. side effects.)
The reduce function should not access the database, even to perform
read operations. The reduce function should not affect the outside
system.
Moreover, a reduce function even if used(which can never work anyway) will never be called for your problem, since a group w.r.t "fileName" or "ObjectId" would always have only one document, in your dataset.
MongoDB will not call the reduce function for a key that has only a
single value