I think I have a pretty complex one here - not sure if I can do this or not.
I have data that has an address and a data field. The data field is a hex value. I would like to run an aggregation that groups the data by address and then the length of the hex data. All of the data will come in as 16 characters long, but the length of that data should calculated in bytes.
I think I have to take the data, strip the trailing 00's (using regex 00+$), and divide that number by 2 to get the length. After that, I would have to then group by address and final byte length.
An example dataset would be:
{addr:829, data:'4100004822000000'}
{addr:829, data:'4100004813000000'}
{addr:829, data:'4100004804000000'}
{addr:506, data:'0000108000000005'}
{addr:506, data:'0000108000000032'}
{addr:229, data:'0065005500000000'}
And my desired output would be:
{addr:829, length:5}
{addr:506, length:8}
{addr:229, length:4}
Is this even possible in an aggregation query w/o having to use external code to do?
This is not too complicated if your "data" is in fact strings as you show in your sample data. Assuming data exists and is set to something (you can add error checking as needed) you can get the result you want like this:
db.coll.aggregate([
{$addFields:{lastNonZero:{$add:[2,{$reduce:{
initialValue:-2,
input:{$range:[0,{$strLenCP:"$data"},2]},
in:{$cond:{
if: {$eq:["00",{$substr:["$data","$$this",2]}]},
then: "$$value",
else: "$$this"
}}
}}]}}},
{$group:{_id:{
addr:"$addr",
length:{$divide:["$lastNonZero",2]}
}}}
])
I used two stages but of course they could be combined into a single $group if you wish. Here in $reduce I step through data 2 characters at a time, checking if they are equal to "00". Every time they are not I update the value to where I am in the sequence. Since that returns the position of the last non-"00" characters, we add 2 to it to find where the string of zeros that goes to the end starts and then later in $group we divide that by 2 to get the true length.
On your sample data, this returns:
{ "_id" : { "addr" : 229, "length" : 4 } }
{ "_id" : { "addr" : 506, "length" : 8 } }
{ "_id" : { "addr" : 829, "length" : 5 } }
You can add a $project stage to transform the field names into ones you want returned.
Related
In my collection that has say 100 documents, I want to run the following query:
collection.find({"$text" : {"$search" : "some_string"})
Assume that a suitable "text" index already exists and thus my question is : How can I run this query on the last 'n' documents only?
All the question that I found on the web ask how to get the last n docs. Whereas My question is how to search on the last n docs only?
More generally my question is How can I run a mongo query on some portion say 20% of a collection.
What I tried
Im using pymongo so I tried to use skip() and limit() to get the last n documents but I didn't find a way to perform a query on cursor that the above mentioned function return.
After #hhsarh's anwser here's what I tried to no avail
# here's what I tried after initial answers
recents = information_collection.aggregate([
{"$match" : {"$text" : {"$search" : "healthline"}}},
{"$sort" : {"_id" : -1}},
{"$limit" : 1},
])
The result is still coming from the whole collection instead of just the last record/document as the above code attempts.
The last document doesn't contain "healthline" in any field therefore the intended result of the query should be empty []. But I get a documents.
Please can someone tell how this can be possible
What you are looking for can be achieved using MongoDB Aggregation
Note: As pointed out by #turivishal, $text won't work if it is not in the first stage of the aggregation pipeline.
collection.aggregate([
{
"$sort": {
"_id": -1
}
},
{
"$limit": 10 // `n` value, where n is the number of last records you want to consider
},
{
"$match" : {
// All your find query goes here
}
},
], {allowDiskUse=true}) // just in case if the computation exceeds 100MB
Since _id is indexed by default, the above aggregation query should be faster. But, its performance reduces in proportion to the n value.
Note: Replace the last line in the code example with the below line if you are using pymongo
], allowDiskUse=True)
It is not possible with $text operator, because there is a restriction,
The $match stage that includes a $text must be the first stage in the pipeline
It means we can't limit documents before $text operator, read more about $text operator restriction.
Second option this might possible if you use $regex regular expression operator instead of $text operator for searching,
And if you need to search same like $text operator you have modify your search input as below:
lets assume searchInput is your input variable
list of search field in searchFields
slice that search input string by space and loop that words array and convert it to regular expression
loop that search fields searchFields and prepare $in condition
searchInput = "This is search"
searchFields = ["field1", "field2"]
searchRegex = []
searchPayload = []
for s in searchInput.split(): searchRegex.append(re.compile(s, re.IGNORECASE));
for f in searchFields: searchPayload.append({ f: { "$in": searchRegex } })
print(searchPayload)
Now your input would look like,
[
{'field1': {'$in': [/This/i, /is/i, /search/i]}},
{'field2': {'$in': [/This/i, /is/i, /search/i]}}
]
Use that variable searchPayload with $or operator in search query at last stage using $in operator,
recents = information_collection.aggregate([
# 1 = ascending, -1 descending you can use anyone as per your requirement
{ "$sort": { "_id": 1 } },
# use any limit of number as per your requirement
{ "$limit": 10 },
{ "$match": { "$or": searchPayload } }
])
print(list(recents))
Note: The $regex regular expression search will cause performance issues.
To improve search performance you can create a compound index on your search fields like,
information_collection.createIndex({ field1: 1, field2: 1 });
I have a collection foo:
{ "_id" : ObjectId("5837199bcabfd020514c0bae"), "x" : 1 }
{ "_id" : ObjectId("583719a1cabfd020514c0baf"), "x" : 3 }
{ "_id" : ObjectId("583719a6cabfd020514c0bb0") }
I use this query:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:1}}})
Then I get a result:
{ "_id" : 1, "avg" : 2, "sum" : 3 }
What does {$sum:1} mean in this query?
From the official docs:
When used in the $group stage, $sum has the following syntax and returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key:
{ $sum: < expression > }
Since in your example the expression is 1, it will aggregate a value of one for each document in the group, thus yielding the total number of documents per group.
Basically it will add up the value of expression for each row. In this case since the number of rows is 3 so it will be 1+1+1 =3 . For more details please check mongodb documentation https://docs.mongodb.com/v3.2/reference/operator/aggregation/sum/
For example if the query was:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:$x}}})
then the sum value would be 1+3=4
I'm not sure what MongoDB version was there 6 years ago or whether it had all these goodies, but it seems to stand to reason that {$sum:1} is nothing but a hack for {$count:{}}.
In fact, $sum here is more expensive than $count, as it is being performed as an extra, whereas $count is closer to the engine. And even if you don't give much stock to performance, think of why you're even asking: because that is a less-than-obvious hack.
My option would be:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$count:{}}}})
I just tried this on Mongo 5.0.14 and it runs fine.
The good old "Just because you can, doesn't mean you should." is still a thing, no?
I am a newbie in MongoDB but I am trying to query to identify if any of my field meets the requirements.
Consider the following:
I have a collection where EACH document is formatted as:
{
"nutrition" : [
{
"name" : "Energy",
"unit" : "kcal",
"value" : 150.25,
"_id" : ObjectId("fdsfdslkfjsdf")
}
{---then there's more in the array---}
]
"serving" : 4
"id": "Food 1"
}
My current code looks something like this:
db.recipe.find(
{"nutrition": {$elemMatch: {"name": "Energy", "unit": "kcal", "value": {$lt: 300}}}},
{"id":1, _id:0}
)
Under the array nutrition, there's a field with its name called Energy with it's value being a number. It checks if that value is less than 300 and outputs all the documents that meets this requirement (I made it only output the field called id).
Now my question is the following:
1) For each document, I have another field called "serving" and I am supposed to find out if "value"/"serving" is still less than 300. (As in divide value by serving and see if it's still less than 300)
2) Since I am using .find, I am guessing I can't use $divide operator from aggregation?
3) I been trying to play around with aggregation operators like $divide + $cond, but no luck so far.
4) Normally in other languages, I would just create a variable a = value/serving then run it through an if statement to check if it's less than 300 but I am not sure if that's possible in MongoDB
Thank you.
In case anyone was struggling with similar problem, I figured out how to do this.
db.database.aggregate([
{$unwind: "$nutrition"}, //starts aggregation
{$match:{"nutrition.name": "Energy", "nutrition.unit": "kcal"}}, //breaks open the nutrition array
{$project: {"Calories per serving": {$divide: ["$nutrition.value", "$ingredients_servings"]}, //filters out anything not named Energy and unit is not kcal, so basically 'nutrition' array now has become a field with one data which is the calories in kcal data
"id": 1, _id:0}}, //show only the food id
{$match:{"Calories per serving": {$lt: 300}}} //filters out any documents that has their calories per serving equal to or greater than 300
]}
So basically, you open the array and filter out any sub-fields you don't want in the document, then display it using project along with any of your math evaluations that needs to be done. Then you filter out any condition you had, which for me was that I don't want to see any foods that have their calories per serving more than 300.
Currently, I have a mongoDb collection with documents of the following type (~1000 docs):
{ _id : "...",
id : "0000001",
gaps : [{start:100, end:110}, {start:132, end:166}...], // up to ~1k elems
bounds: [68, 88, 126, 228...], // up to 100 elems
length: 300,
seq : "ACCGACCCGAGGACCCCTGAGATG..."
}
"gaps" and "bounds" values in the array refer to coordinates in "seq" and "length" refers to "seq" length.
I have defined indexes on "id", "gaps" and "bounds".
I need to query based on coordinates ranges. For example given "from=100" and "to=200" I want to retrieve for each document a sub-array of "gaps", a sub-array of "bounds" and the "seq" substring that lay inside the range (between 100 and 200 in this case).
Right now, this is done using the following aggregation pipeline:
db.annot.aggregate(
{$match : {"id" : "000001"}},
{$unwind : "$gaps"},
{$unwind : "$bounds"},
{$match: {
$or :[
{"gaps.start" : {$gte:from, $lte:to}},
{"gaps.end" : {$gte:from, $lte:to}},
{"bounds" : {$gte:from, $lte:to}}
]
}
},
{$project:{
id:1,
gaps:1,
bounds:1,
subseq:{$substr:["$seq", from, (to-from)]}}},
{$group : {
_id : "$id",
gaps : {"$addToSet" : "$gaps"},
bounds : {"$addToSet" : "$bounds"},
subseq : {"$first" : "$subseq"}}},
)
Querying the whole collection (leaving out the first "$match" in the pipeline) takes ~14 seconds.
Querying individually all the documents sequentially takes at most 50msec each (~19 secs in total).
Querying individually all the documents in parallel takes in total ~7 secs.
Querying with a pipeline that only matches the id (ie, using the first "$match" in the pipeline) takes ~5 secs in total.
What would be the best db and query design to maximize the performance of this kind of queries?
What would be the best db and query design to maximize the performance of this kind of queries?
Since you ask for improving your code and design, i suggest you to switch to the latest version of mongodb if you have not yet. That should be a good starting point. For these type of problems, the basic idea should be to reduce the number of documents being input to each pipeline operation.
I suggest you to have a additional variable named range which contains all the numbers between from and to, inclusive of both. This allows us to apply functions like $intersection on the bounds array.
So the variables, the aggregate operation needs from the environment should be:
var from = x; var to = y; var range=[x,...,y];
The first step is to match the number of documents that have the
id,gaps sub documents and bounds value in our range. This
reduces the number of documents being input to the next stage to say 500.
The next step is to $redact the non conforming gaps sub
documents. This stage now works on the 500 documents filtered in the
previous step.
The third step is to $project our fields as our need.
Notice that we have not required to use the $unwind operator anywhere and achieved the task.
db.collection.aggregate([
{$match : {"id" : "0000001",
"gaps.start":{$gte:from},
"gaps.end":{$lte:to},
"bounds" : {$gte:from, $lte:to}}},
{$redact:{
$cond:[
{$and:[
{$gte:[{$ifNull: ["$start", from]},from]},
{$lte:[{$ifNull: ["$end", to]},to]}
]},"$$DESCEND","$$PRUNE"
]
}},
{$project: {
"bounds":{$setIntersection:[range,"$bounds"]},
"id":1,
"gaps":1,
"length":1,
"subseq":{$substr:["$seq", from, (to-from)]}}}
])
I would like to store in mongdb some very large integers, exactly (several thousands decimal digits). This will not work of course with the standard types supported by BSON, and I am trying to think of the most elegant workaround, considering that I would like to perform range searches and similar things. This requirement excludes storing the integers as strings as it makes the range searches impractical.
One way I can think of is to encode the 2^32-expansion using (variable-length) arrays of standard ints, and add to this array a first entry for the length of the array itself. That way lexicographical ordering on these arrays corresponds to the usual ordering of arbitrarily large integers.
For instance, in a collection I could have the 5 documents
{"name": "me", "fortune": [1,1000]}
{"name": "scrooge mcduck", "fortune": [11,1,0,0,0,0,0,0,0,0,0,0]}
{"name": "bruce wayne","fortune": [2, 10,0]}
{"name": "bill gates", "fortune": [2,1,1000]}
{"name": "francis", "fortune": [0]}
Thus Bruce Wayne's net worth is 10*2^32, Bill Gates' 2^32+1000 and Scrooge McDuck's 2^320.
I can then do a sort using {"fortune":1} and on my machine (with pymongo) it returns them in the order francis < me < bill < bruce < scrooge, as expected.
However, I am making assumptions that I haven't seen documented anywhere about the way BSON arrays compare, and the range searches don't seem to work the way I think (for instance,
find({"fortune":{$gte:[2,5,0]}})
returns no document, but I would wish for bruce and scrooge).
Can anyone help me? Thanks
You can instead store left padded strings which represent exact integer equal to the fortune.
eg. "1000000" = 1 million
"0010000" = 10 thousand
"2000000" = 2 million
"0200000" = 2 hundred thousand
Left padding with zeroes will ensure that lexographical comparison of these strings directly corresponds to their comparison as numeric values also. You will have to
assume a safe MAXIMUM possible value of fortune here, say a 20 digit number, and
pad the 0s accordingly
So a sample documents would be :
{"name": "scrooge mcduck", "fortune": "00001100000000000000" }
{"name": "bruce wayne", "fortune": "00000200000000000000" }
querying:
> db.test123.find()
{ "_id" : ObjectId("4f87e142f1573cffecd0f65e"), "name" : "bruce wayne", "fortune" : "00000200000000000000" }
{ "_id" : ObjectId("4f87e150f1573cffecd0f65f"), "name" : "donald", "fortune" : "00000150000000000000" }
{ "_id" : ObjectId("4f87e160f1573cffecd0f660"), "name" : "mickey", "fortune" : "00000000000000100000" }
> db.test123.find({ "fortune" : {$gte: "00000200000000000000"}});
{ "_id" : ObjectId("4f87e142f1573cffecd0f65e"), "name" : "bruce wayne", "fortune" : "00000200000000000000" }
> db.test123.find({ "fortune" : {$lt: "00000200000000000000"}});
{ "_id" : ObjectId("4f87e150f1573cffecd0f65f"), "name" : "donald", "fortune" : "00000150000000000000" }
{ "_id" : ObjectId("4f87e160f1573cffecd0f660"), "name" : "mickey", "fortune" : "00000000000000100000" }
The querying / sorting will work naturally as mongodb compares strings lexographically.
However, to do other numeric operations on your data, you will have to write custom logic in your data processing script ( PHP,Python,Ruby etc)
For querying and data storage, this string version should do fine.
Unfortunately your assumption about array comparison is incorrect. Range queries that, for example, query for all array values smaller than 3 ({array:{$lt:3}}) will return all arrays where at least one element is less than three, regardless of the element's position. As such your approach will not work.
What does work, but is a bit less obvious, is using binary blobs for your very large integers since those are byte-order compared. That requires you set an upper bit limit for your integers but that should be fairly straightforward. You can test it in shell using BinData(subType, base64) notation :
db.col.find({fortune:{$gt:BinData(0, "e8MEnzZoFyMmD7WSHdNrFJyEk8M=")}})
So all you'd have to do is create methods to convert your big integers from, say, strings to two-complements binary and you're set. Good luck