mongo db design for fast queries on ranges - mongodb

Currently, I have a mongoDb collection with documents of the following type (~1000 docs):
{ _id : "...",
id : "0000001",
gaps : [{start:100, end:110}, {start:132, end:166}...], // up to ~1k elems
bounds: [68, 88, 126, 228...], // up to 100 elems
length: 300,
seq : "ACCGACCCGAGGACCCCTGAGATG..."
}
"gaps" and "bounds" values in the array refer to coordinates in "seq" and "length" refers to "seq" length.
I have defined indexes on "id", "gaps" and "bounds".
I need to query based on coordinates ranges. For example given "from=100" and "to=200" I want to retrieve for each document a sub-array of "gaps", a sub-array of "bounds" and the "seq" substring that lay inside the range (between 100 and 200 in this case).
Right now, this is done using the following aggregation pipeline:
db.annot.aggregate(
{$match : {"id" : "000001"}},
{$unwind : "$gaps"},
{$unwind : "$bounds"},
{$match: {
$or :[
{"gaps.start" : {$gte:from, $lte:to}},
{"gaps.end" : {$gte:from, $lte:to}},
{"bounds" : {$gte:from, $lte:to}}
]
}
},
{$project:{
id:1,
gaps:1,
bounds:1,
subseq:{$substr:["$seq", from, (to-from)]}}},
{$group : {
_id : "$id",
gaps : {"$addToSet" : "$gaps"},
bounds : {"$addToSet" : "$bounds"},
subseq : {"$first" : "$subseq"}}},
)
Querying the whole collection (leaving out the first "$match" in the pipeline) takes ~14 seconds.
Querying individually all the documents sequentially takes at most 50msec each (~19 secs in total).
Querying individually all the documents in parallel takes in total ~7 secs.
Querying with a pipeline that only matches the id (ie, using the first "$match" in the pipeline) takes ~5 secs in total.
What would be the best db and query design to maximize the performance of this kind of queries?

What would be the best db and query design to maximize the performance of this kind of queries?
Since you ask for improving your code and design, i suggest you to switch to the latest version of mongodb if you have not yet. That should be a good starting point. For these type of problems, the basic idea should be to reduce the number of documents being input to each pipeline operation.
I suggest you to have a additional variable named range which contains all the numbers between from and to, inclusive of both. This allows us to apply functions like $intersection on the bounds array.
So the variables, the aggregate operation needs from the environment should be:
var from = x; var to = y; var range=[x,...,y];
The first step is to match the number of documents that have the
id,gaps sub documents and bounds value in our range. This
reduces the number of documents being input to the next stage to say 500.
The next step is to $redact the non conforming gaps sub
documents. This stage now works on the 500 documents filtered in the
previous step.
The third step is to $project our fields as our need.
Notice that we have not required to use the $unwind operator anywhere and achieved the task.
db.collection.aggregate([
{$match : {"id" : "0000001",
"gaps.start":{$gte:from},
"gaps.end":{$lte:to},
"bounds" : {$gte:from, $lte:to}}},
{$redact:{
$cond:[
{$and:[
{$gte:[{$ifNull: ["$start", from]},from]},
{$lte:[{$ifNull: ["$end", to]},to]}
]},"$$DESCEND","$$PRUNE"
]
}},
{$project: {
"bounds":{$setIntersection:[range,"$bounds"]},
"id":1,
"gaps":1,
"length":1,
"subseq":{$substr:["$seq", from, (to-from)]}}}
])

Related

Mongodb: Query on the last N documents(some portion) of a collection only

In my collection that has say 100 documents, I want to run the following query:
collection.find({"$text" : {"$search" : "some_string"})
Assume that a suitable "text" index already exists and thus my question is : How can I run this query on the last 'n' documents only?
All the question that I found on the web ask how to get the last n docs. Whereas My question is how to search on the last n docs only?
More generally my question is How can I run a mongo query on some portion say 20% of a collection.
What I tried
Im using pymongo so I tried to use skip() and limit() to get the last n documents but I didn't find a way to perform a query on cursor that the above mentioned function return.
After #hhsarh's anwser here's what I tried to no avail
# here's what I tried after initial answers
recents = information_collection.aggregate([
{"$match" : {"$text" : {"$search" : "healthline"}}},
{"$sort" : {"_id" : -1}},
{"$limit" : 1},
])
The result is still coming from the whole collection instead of just the last record/document as the above code attempts.
The last document doesn't contain "healthline" in any field therefore the intended result of the query should be empty []. But I get a documents.
Please can someone tell how this can be possible
What you are looking for can be achieved using MongoDB Aggregation
Note: As pointed out by #turivishal, $text won't work if it is not in the first stage of the aggregation pipeline.
collection.aggregate([
{
"$sort": {
"_id": -1
}
},
{
"$limit": 10 // `n` value, where n is the number of last records you want to consider
},
{
"$match" : {
// All your find query goes here
}
},
], {allowDiskUse=true}) // just in case if the computation exceeds 100MB
Since _id is indexed by default, the above aggregation query should be faster. But, its performance reduces in proportion to the n value.
Note: Replace the last line in the code example with the below line if you are using pymongo
], allowDiskUse=True)
It is not possible with $text operator, because there is a restriction,
The $match stage that includes a $text must be the first stage in the pipeline
It means we can't limit documents before $text operator, read more about $text operator restriction.
Second option this might possible if you use $regex regular expression operator instead of $text operator for searching,
And if you need to search same like $text operator you have modify your search input as below:
lets assume searchInput is your input variable
list of search field in searchFields
slice that search input string by space and loop that words array and convert it to regular expression
loop that search fields searchFields and prepare $in condition
searchInput = "This is search"
searchFields = ["field1", "field2"]
searchRegex = []
searchPayload = []
for s in searchInput.split(): searchRegex.append(re.compile(s, re.IGNORECASE));
for f in searchFields: searchPayload.append({ f: { "$in": searchRegex } })
print(searchPayload)
Now your input would look like,
[
{'field1': {'$in': [/This/i, /is/i, /search/i]}},
{'field2': {'$in': [/This/i, /is/i, /search/i]}}
]
Use that variable searchPayload with $or operator in search query at last stage using $in operator,
recents = information_collection.aggregate([
# 1 = ascending, -1 descending you can use anyone as per your requirement
{ "$sort": { "_id": 1 } },
# use any limit of number as per your requirement
{ "$limit": 10 },
{ "$match": { "$or": searchPayload } }
])
print(list(recents))
Note: The $regex regular expression search will cause performance issues.
To improve search performance you can create a compound index on your search fields like,
information_collection.createIndex({ field1: 1, field2: 1 });

How to get small amount of data rather than complete data on mongodb

Let's assume I have a collection entry of 100000.
So What is the approach to get only 50 data every time rather than 100000, Because calling the whole dataset is foolishness.
My Dataset is kind of this type:
{
"_id" : ObjectId("5a2e282417d0b91708fa83b5"),
"post" : "Hello world",
"createdate" : ISODate("2017-12-11T06:39:32.035Z"),
"__v" : 0
}
Like what are the techniques I have to append on my query?
//What filter I have to add.?
db.collection.find({}).sort({'createdate': 1}).exec(function(err, data){
console.log(data);
});
db.collection.find({}).sort({'createdate': 1}).skip(0).limit(50).exec(function(err, data){
console.log(data);
});
there are two more ways to use pagination
one is mongoose-paginate npm module link :- https://www.npmjs.com/package/mongoose-paginate
seconnd is aggregation pipeline with $skip and $limit options
eg:
//from 1 to 50 records
db.col.aggregate[{$match:{}},{$sort:{_id:-1}},{$skip:0},{$limit:50}];
//form 51 to 100 records
db.col.aggregate[{$match:{}},{$sort:{_id:-1}},{$skip:50},{$limit:50}];
First, we have to sort the data and then do limit and skip function.
db.collection.aggregate([{"$sort": {f2: -1}, {$limit : 2}, { $skip : 5 }}]);
using limit with find,
db.collection.find().limit(3)
Using limit with aggregate,
db.collection.aggregate({$limit : 2})
usually aggregate is used if we need to get the pipe lined out, for example we need to to have limit and sort together, then
// sorting happens only on the pipelined out put from limit.
db.collection.aggregate([{$limit : 50},{"$sort": {_id: -1}}]);
// . operator - sorting happening on entire values even though it comes last.
db.collection.find().limit(50).sort({_id:-1});
The same with added skip to get offset
db.collection.aggregate([{$limit : 50},{ $skip : 50 },{"$sort": {_id: -1}}]);
db.collection.find().skip(50).limit(50).sort({_id:-1});

What does $sum:1 mean in Mongo

I have a collection foo:
{ "_id" : ObjectId("5837199bcabfd020514c0bae"), "x" : 1 }
{ "_id" : ObjectId("583719a1cabfd020514c0baf"), "x" : 3 }
{ "_id" : ObjectId("583719a6cabfd020514c0bb0") }
I use this query:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:1}}})
Then I get a result:
{ "_id" : 1, "avg" : 2, "sum" : 3 }
What does {$sum:1} mean in this query?
From the official docs:
When used in the $group stage, $sum has the following syntax and returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key:
{ $sum: < expression > }
Since in your example the expression is 1, it will aggregate a value of one for each document in the group, thus yielding the total number of documents per group.
Basically it will add up the value of expression for each row. In this case since the number of rows is 3 so it will be 1+1+1 =3 . For more details please check mongodb documentation https://docs.mongodb.com/v3.2/reference/operator/aggregation/sum/
For example if the query was:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:$x}}})
then the sum value would be 1+3=4
I'm not sure what MongoDB version was there 6 years ago or whether it had all these goodies, but it seems to stand to reason that {$sum:1} is nothing but a hack for {$count:{}}.
In fact, $sum here is more expensive than $count, as it is being performed as an extra, whereas $count is closer to the engine. And even if you don't give much stock to performance, think of why you're even asking: because that is a less-than-obvious hack.
My option would be:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$count:{}}}})
I just tried this on Mongo 5.0.14 and it runs fine.
The good old "Just because you can, doesn't mean you should." is still a thing, no?

How to incorporate $lt and $divide in mongodb

I am a newbie in MongoDB but I am trying to query to identify if any of my field meets the requirements.
Consider the following:
I have a collection where EACH document is formatted as:
{
"nutrition" : [
{
"name" : "Energy",
"unit" : "kcal",
"value" : 150.25,
"_id" : ObjectId("fdsfdslkfjsdf")
}
{---then there's more in the array---}
]
"serving" : 4
"id": "Food 1"
}
My current code looks something like this:
db.recipe.find(
{"nutrition": {$elemMatch: {"name": "Energy", "unit": "kcal", "value": {$lt: 300}}}},
{"id":1, _id:0}
)
Under the array nutrition, there's a field with its name called Energy with it's value being a number. It checks if that value is less than 300 and outputs all the documents that meets this requirement (I made it only output the field called id).
Now my question is the following:
1) For each document, I have another field called "serving" and I am supposed to find out if "value"/"serving" is still less than 300. (As in divide value by serving and see if it's still less than 300)
2) Since I am using .find, I am guessing I can't use $divide operator from aggregation?
3) I been trying to play around with aggregation operators like $divide + $cond, but no luck so far.
4) Normally in other languages, I would just create a variable a = value/serving then run it through an if statement to check if it's less than 300 but I am not sure if that's possible in MongoDB
Thank you.
In case anyone was struggling with similar problem, I figured out how to do this.
db.database.aggregate([
{$unwind: "$nutrition"}, //starts aggregation
{$match:{"nutrition.name": "Energy", "nutrition.unit": "kcal"}}, //breaks open the nutrition array
{$project: {"Calories per serving": {$divide: ["$nutrition.value", "$ingredients_servings"]}, //filters out anything not named Energy and unit is not kcal, so basically 'nutrition' array now has become a field with one data which is the calories in kcal data
"id": 1, _id:0}}, //show only the food id
{$match:{"Calories per serving": {$lt: 300}}} //filters out any documents that has their calories per serving equal to or greater than 300
]}
So basically, you open the array and filter out any sub-fields you don't want in the document, then display it using project along with any of your math evaluations that needs to be done. Then you filter out any condition you had, which for me was that I don't want to see any foods that have their calories per serving more than 300.

MongoDB: Should You Pre-Allocate a Document if Using $addToSet or $push?

I've been studying up on MongoDB and I understand that it is highly recommended that documents structures are completely built-out (pre-allocated) at the point of insert, this way future changes to that document do not require the document to be moved around on the disk. Does this apply when using $addToSet or $push?
For example, say I have the following document:
"_id" : "rsMH4GxtduZZfxQrC",
"createdAt" : ISODate("2015-03-01T12:08:23.007Z"),
"market" : "LTC_CNY",
"type" : "recentTrades",
"data" : [
{
"date" : "1422168530",
"price" : 13.8,
"amount" : 0.203,
"tid" : "2435402",
"type" : "buy"
},
{
"date" : "1422168529",
"price" : 13.8,
"amount" : 0.594,
"tid" : "2435401",
"type" : "buy"
},
{
"date" : "1422168529",
"price" : 13.79,
"amount" : 0.594,
"tid" : "2435400",
"type" : "buy"
}
]
And I am using one of the following commands to add a new array of objects (newData) to the data field:
$addToSet to add to the end of the array:
Collection.update(
{ _id: 'rsMH4GxtduZZfxQrC' },
{
$addToSet: {
data: {
$each: newData
}
}
}
);
$push (with $position) to add to the front of the array:
Collection.update(
{ _id: 'rsMH4GxtduZZfxQrC' },
{
$push: {
data: {
$each: newData,
$position: 0
}
}
}
);
The data array in the document will grow due to new objects that were added from newData. So will this type of document update cause the document to be moved around on the disk?
For this particular system, the data array in these documents can grow to upwards of 75k objects within, so if these documents are indeed being moved around on disk after every $addToSet or $push update, should the document be defined with 75k nulls (data: [null,null...null]) on insert, and then perhaps use $set to replace the values over time? Thanks!
I understand that it is highly recommended that documents structures are completely built-out (pre-allocated) at the point of insert, this way future changes to that document do not require the document to be moved around on the disk. Does this apply when using $addToSet or $push?
It's recommended if it's feasible for the use case, which it usually isn't. Time series data is a notable exception. It doesn't really apply with $addToSet and $push because they tend to increase the size of the document by growing an array.
the data array in these documents can grow to upwards of 75k objects within
Stop. Are you sure you want constantly growing arrays with tens of thousands of entries? Are you going to query wanting specific entries back? Are you going to index any fields in the array entries? You probably want to rethink your document structure. Maybe you want each data entry to be a separate document with fields like market, type, createdAt replicated in each? You wouldn't be worrying about document moves.
Why will the array grow to 75K entries? Can you do less entries per document? Is this time series data? It's great to be able to preallocate documents and do in-place updates with the mmap storage engine, but it's not feasible for every use case and it's not a requirement for MongoDB to perform well.
should the document be defined with 75k nulls (data: [null,null...null]) on insert, and then perhaps use $set to replace the values over time?
No, this is not really helpful. The document size will be computed based on the BSON size of the null values in the array, so when you replace null with another type the size will increase and you'll get document rewrites anyway. You would need to preallocate the array with objects with all fields set to a default value for its type, e.g.
{
"date" : ISODate("1970-01-01T00:00:00Z") // use a date type instead of a string date
"price" : 0,
"amount" : 0,
"tid" : "000000", // assuming 7 character code - strings icky for default preallocation
"type" : "none" // assuming it's "buy" or "sell", want a default as long as longest real values
}
MongoDB uses the power of two allocation strategy to store your documents, which means it will allocate the size of the document^2 for storage. Therefore if your nested arrays don't lead to a total growth larger then the original size to the power of two, mongo will not have to reallocate the document.
See: http://docs.mongodb.org/manual/core/storage/
Bottom line here is that any "document growth" is pretty much always going to result in the "physical move" of the storage allocation unless you have "pre-allocated" by some means on the original document submission. Yes there is "power of two" allocation, but this does not always mean anything valid to your storage case.
The additional "catch" here is on "capped collections", where indeed the "hidden catch" is that such "pre-allocation" methods are likely not to be "replicated" to other members in a replica set if those instructions fall outside of the "oplog" period where the replica set entries are applied.
Growing any structure beyond what is allocated from an "initial allocation" or the general tricks that can be applied will result in that document being "moved" in storage space when it grows beyond the space it was originally supplied with.
In order to ensure this does not happen, then you always "pre-allocate" to the expected provisions of your data on the original creation. And with the obvious caveat of the condition already described.