I am new to Mongodb trying to count the number of Delta values in DELTA_GROUP. In this example there are three "Delta" values under "DELTA_GROUP". In this case, the count value is 3 for this object. I need to satisfy two conditions here though.
First of all, I need to only count data collected within the specific time range I set (ex. between start point and end point using ISODate(gte,lte, etc)).
The Second, in the data with the specific time range, I want to count the number of delta values for every object and of course, there are a handful of objects within the specified time range. Thus, if I assume that there are only three delta values for each object (from example), and 10 objects total, the count result should be 30. How can I create a query for it with conditions above?
{
"_id" : ObjectId("5f68a088135c701658c24d62"),
"DELTA_GROUP" : [
{
"Delta" : 105,
},
{
"Delta" : 108,
},
{
"Delta" : 103,
}
],
"YEAR" : 2020,
"MONTH" : 9,
"DAY" : 21,
"RECEIVE_TIME" : ISODate("2020-09-21T21:46:00.323Z")
}
What I have tried so far is shown below. In this way, I was able to list out counted value for each object, but still need to work to get totalized counted value for specified range of dates.
db.DELTA_DATA.aggregate([
{$match:
{'RECEIVE_TIME':
{
$gte:ISODate("2020-09-10T00:00:00"),
$lte:ISODate("2020-10-15T23:59:59")
}
}},
{$project: {"total": {count : {"$size":"$DELTA_GROUP"}}}}])
Related
I think I have a pretty complex one here - not sure if I can do this or not.
I have data that has an address and a data field. The data field is a hex value. I would like to run an aggregation that groups the data by address and then the length of the hex data. All of the data will come in as 16 characters long, but the length of that data should calculated in bytes.
I think I have to take the data, strip the trailing 00's (using regex 00+$), and divide that number by 2 to get the length. After that, I would have to then group by address and final byte length.
An example dataset would be:
{addr:829, data:'4100004822000000'}
{addr:829, data:'4100004813000000'}
{addr:829, data:'4100004804000000'}
{addr:506, data:'0000108000000005'}
{addr:506, data:'0000108000000032'}
{addr:229, data:'0065005500000000'}
And my desired output would be:
{addr:829, length:5}
{addr:506, length:8}
{addr:229, length:4}
Is this even possible in an aggregation query w/o having to use external code to do?
This is not too complicated if your "data" is in fact strings as you show in your sample data. Assuming data exists and is set to something (you can add error checking as needed) you can get the result you want like this:
db.coll.aggregate([
{$addFields:{lastNonZero:{$add:[2,{$reduce:{
initialValue:-2,
input:{$range:[0,{$strLenCP:"$data"},2]},
in:{$cond:{
if: {$eq:["00",{$substr:["$data","$$this",2]}]},
then: "$$value",
else: "$$this"
}}
}}]}}},
{$group:{_id:{
addr:"$addr",
length:{$divide:["$lastNonZero",2]}
}}}
])
I used two stages but of course they could be combined into a single $group if you wish. Here in $reduce I step through data 2 characters at a time, checking if they are equal to "00". Every time they are not I update the value to where I am in the sequence. Since that returns the position of the last non-"00" characters, we add 2 to it to find where the string of zeros that goes to the end starts and then later in $group we divide that by 2 to get the true length.
On your sample data, this returns:
{ "_id" : { "addr" : 229, "length" : 4 } }
{ "_id" : { "addr" : 506, "length" : 8 } }
{ "_id" : { "addr" : 829, "length" : 5 } }
You can add a $project stage to transform the field names into ones you want returned.
I am storing time-series data across multiple fixed-sized, pre-allocated documents. When one fills up, another is created. Each document has two pre-calculated values:
prevEnd (stores the value in last index of previous document's values)
nextStart (stores the value in the next document's first index)
I want to rely on these pre-aggregated values to find a range of documents when searching by a time range. The following example uses integers in place of timestamps or dates for clarity.
Question: How can I select the two documents below knowing only the time range of interest (111-114)?
{
"prevEnd"; 107,
"nextStart": 110,
"time" : [
NumberLong(107)
NumberLong(108)
NumberLong(109)
]
},
//-----------------Select Start
{
"prevEnd": 109,
"nextStart": 113,
"time" : [
NumberLong(110),
NumberLong(111),
NumberLong(112),
]
},
{
"prevEnd": 112,
"nextStart": 116,
"time" : [
NumberLong(113),
NumberLong(114),
NumberLong(115)
]
},
//-----------------Select End
{
"prevEnd": 115,
"nextStart": 99999999999999999999999999999999,
"time" : [
NumberLong(116),
NumberLong(117),
NumberLong(118)
]
}
The following find() call will work:
db.collection.find({"time": {"$elemMatch": {$gt: 111, $lt: 114}}})
because it uses the $elemMatch operator to match the documents which contain a time field with at least one element that matches both the upper and lower limits.
But since your question explicitly refers to prevEnd and nextStart I suspect you are looking for a solution which filters on those attribuites. For example:
db.collection.find({$or: [{"prevEnd": {$gt: 111}}, {"nextStart": {$gt: 111}}], "prevEnd": {$lt: 114}})
I am a newbie in MongoDB but I am trying to query to identify if any of my field meets the requirements.
Consider the following:
I have a collection where EACH document is formatted as:
{
"nutrition" : [
{
"name" : "Energy",
"unit" : "kcal",
"value" : 150.25,
"_id" : ObjectId("fdsfdslkfjsdf")
}
{---then there's more in the array---}
]
"serving" : 4
"id": "Food 1"
}
My current code looks something like this:
db.recipe.find(
{"nutrition": {$elemMatch: {"name": "Energy", "unit": "kcal", "value": {$lt: 300}}}},
{"id":1, _id:0}
)
Under the array nutrition, there's a field with its name called Energy with it's value being a number. It checks if that value is less than 300 and outputs all the documents that meets this requirement (I made it only output the field called id).
Now my question is the following:
1) For each document, I have another field called "serving" and I am supposed to find out if "value"/"serving" is still less than 300. (As in divide value by serving and see if it's still less than 300)
2) Since I am using .find, I am guessing I can't use $divide operator from aggregation?
3) I been trying to play around with aggregation operators like $divide + $cond, but no luck so far.
4) Normally in other languages, I would just create a variable a = value/serving then run it through an if statement to check if it's less than 300 but I am not sure if that's possible in MongoDB
Thank you.
In case anyone was struggling with similar problem, I figured out how to do this.
db.database.aggregate([
{$unwind: "$nutrition"}, //starts aggregation
{$match:{"nutrition.name": "Energy", "nutrition.unit": "kcal"}}, //breaks open the nutrition array
{$project: {"Calories per serving": {$divide: ["$nutrition.value", "$ingredients_servings"]}, //filters out anything not named Energy and unit is not kcal, so basically 'nutrition' array now has become a field with one data which is the calories in kcal data
"id": 1, _id:0}}, //show only the food id
{$match:{"Calories per serving": {$lt: 300}}} //filters out any documents that has their calories per serving equal to or greater than 300
]}
So basically, you open the array and filter out any sub-fields you don't want in the document, then display it using project along with any of your math evaluations that needs to be done. Then you filter out any condition you had, which for me was that I don't want to see any foods that have their calories per serving more than 300.
Currently, I have a mongoDb collection with documents of the following type (~1000 docs):
{ _id : "...",
id : "0000001",
gaps : [{start:100, end:110}, {start:132, end:166}...], // up to ~1k elems
bounds: [68, 88, 126, 228...], // up to 100 elems
length: 300,
seq : "ACCGACCCGAGGACCCCTGAGATG..."
}
"gaps" and "bounds" values in the array refer to coordinates in "seq" and "length" refers to "seq" length.
I have defined indexes on "id", "gaps" and "bounds".
I need to query based on coordinates ranges. For example given "from=100" and "to=200" I want to retrieve for each document a sub-array of "gaps", a sub-array of "bounds" and the "seq" substring that lay inside the range (between 100 and 200 in this case).
Right now, this is done using the following aggregation pipeline:
db.annot.aggregate(
{$match : {"id" : "000001"}},
{$unwind : "$gaps"},
{$unwind : "$bounds"},
{$match: {
$or :[
{"gaps.start" : {$gte:from, $lte:to}},
{"gaps.end" : {$gte:from, $lte:to}},
{"bounds" : {$gte:from, $lte:to}}
]
}
},
{$project:{
id:1,
gaps:1,
bounds:1,
subseq:{$substr:["$seq", from, (to-from)]}}},
{$group : {
_id : "$id",
gaps : {"$addToSet" : "$gaps"},
bounds : {"$addToSet" : "$bounds"},
subseq : {"$first" : "$subseq"}}},
)
Querying the whole collection (leaving out the first "$match" in the pipeline) takes ~14 seconds.
Querying individually all the documents sequentially takes at most 50msec each (~19 secs in total).
Querying individually all the documents in parallel takes in total ~7 secs.
Querying with a pipeline that only matches the id (ie, using the first "$match" in the pipeline) takes ~5 secs in total.
What would be the best db and query design to maximize the performance of this kind of queries?
What would be the best db and query design to maximize the performance of this kind of queries?
Since you ask for improving your code and design, i suggest you to switch to the latest version of mongodb if you have not yet. That should be a good starting point. For these type of problems, the basic idea should be to reduce the number of documents being input to each pipeline operation.
I suggest you to have a additional variable named range which contains all the numbers between from and to, inclusive of both. This allows us to apply functions like $intersection on the bounds array.
So the variables, the aggregate operation needs from the environment should be:
var from = x; var to = y; var range=[x,...,y];
The first step is to match the number of documents that have the
id,gaps sub documents and bounds value in our range. This
reduces the number of documents being input to the next stage to say 500.
The next step is to $redact the non conforming gaps sub
documents. This stage now works on the 500 documents filtered in the
previous step.
The third step is to $project our fields as our need.
Notice that we have not required to use the $unwind operator anywhere and achieved the task.
db.collection.aggregate([
{$match : {"id" : "0000001",
"gaps.start":{$gte:from},
"gaps.end":{$lte:to},
"bounds" : {$gte:from, $lte:to}}},
{$redact:{
$cond:[
{$and:[
{$gte:[{$ifNull: ["$start", from]},from]},
{$lte:[{$ifNull: ["$end", to]},to]}
]},"$$DESCEND","$$PRUNE"
]
}},
{$project: {
"bounds":{$setIntersection:[range,"$bounds"]},
"id":1,
"gaps":1,
"length":1,
"subseq":{$substr:["$seq", from, (to-from)]}}}
])
I have a MondoDB collection with over 5 million items. Each item has a "start" and "end" fields containing integer values.
Items don't have overlapping starts and ends.
e.g. this would be invalid:
{start:100, end:200}
{start:150, end:250}
I am trying to locate an item where a given value is between start and end
start <= VALUE <= end
The following query works, but it takes 5 to 15 seconds to return
db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1);
I've added the following indexes for testing with very little improvement
db.blocks.ensureIndex({start:1});
db.blocks.ensureIndex({end:1});
//also a compounded one
db.blocks.ensureIndex({start:1,end:1});
** Edit **
The result of explain() on the query results in:
> db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1).explain();
{
"cursor" : "BtreeCursor end_1",
"nscanned" : 1160982,
"nscannedObjects" : 1160982,
"n" : 0,
"millis" : 5779,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"end" : [
[
3232235521,
1.7976931348623157e+308
]
]
}
}
What would be the best approach to speeding this specific query up?
actually I'm working on similar problem and my friend find a nice way to solve this.
If you don't have overlapping data, you can do this:
query using start field and sort function
validate with end field
for example you can do
var x = 100;
var results = db.collection.find({start:{$lte:x}}).sort({start:-1}).limit(1)
if (results!=null) {
var result = results[0];
if (result.end > x) {
return result;
} else {
return null; // no range contain x
}
}
If you are sure that there will always range containing x, then you do not have to validate the result.
By using this piece of code, you only have to index by either start or end field and your query become a lot faster.
--- edit
I did some benchmark, using composite index takes 100-100,000ms per query, in the other hand using one index takes 1-5ms per query.
I guess compbound index should work faster for you:
db.blocks.ensureIndex({start:1, end:1});
You can also use explain to see number of scanned object, etc and choose best index.
Also if you are using mongodb < 2.0 you need to update to 2.0+, because there indexes work faster.
Also you can limit results to optimize query.
This might help: how about you introduce some redundancy. If there is not a big variance in the lengths of the intervals, then you can introduce a tag field for each record - this tag field is a single value or string that represents a large interval - say for example tag 50,000 is used to tag all records with intervals that are at least partially in the range 0-50,000 and tag 100,000 is for all intervals in the range 50,000-100,000, and so on. Now you can index on the tag as primary and one of the end points of record range as secondary.
Records on the edge of big interval would have more than one tag - so we are talking multikeys. On your query you would of course calculate the big interval tag and use it in the query.
You would roughly want SQRT of total records per tag - just a starting point for tests, then you can fine tune the big interval size.
Of course this would make writing bit slower.