I am caching data from an online resource for future use in machine learning. This data is canonical and has no missing entries.
In the event that the real-time connection is dropped or the machine rebooted, I have a safeguard in place that does a historical search for a range of ids that are missing from the cache.
What I have yet to implement, however, is a mechanism for searching through the collection and identifying ranges where id values have been skipped.
For instance:
{"entry_id": 27497713, ...}
{"entry_id": 27497761, ...}
This data has a clear gap where entries are missing between 27497713 and 27497761.
Is there a way I can find such a gap using queries? Perhaps at least narrowing it down by selecting values between two ranges and checking the count of returned entries? Given how many entries the collection contains, I am trying to avoid lots of queries for efficiency.
can you try this aggregation
$group - get $min and $max
$addFields - generate $range by $min and $max entry_id
$lookup - self lookup with generated range ids and entry ids
$project - get only non matching range ids using setDifference
pipeline
db.entries.aggregate(
[
{$group : {_id : null, min : {$min : "$entry_id"}, max : {$max : "$entry_id"}}},
{$addFields : {rangeIds : {$range : ["$min", "$max"]}}},
{$lookup : {from : "entries", localField : "rangeIds", foreignField : "entry_id", as : "entries"}},
{$project : {_id :0, missingIds : {$setDifference : ["$rangeIds", "$entries.entry_id"]}}}
]
)
collection
> db.entries.find()
{ "_id" : ObjectId("5a6fea9b7346ce591a17ad22"), "entry_id" : 27497713 }
{ "_id" : ObjectId("5a6fea9b7346ce591a17ad23"), "entry_id" : 27497761 }
{ "_id" : ObjectId("5a6fea9b7346ce591a17ad24"), "entry_id" : 27497750 }
>
aggregate result
> db.entries.aggregate( [ {$group : {_id : null, min : {$min : "$entry_id"}, max : {$max : "$entry_id"}}}, {$addFields : {rangeIds : {$range : ["$min", "$max"]}}}, {$lookup : {from : "entries", localField : "rangeIds", foreignField : "entry_id", as : "entries"}}, {$project : {_id :0, missingIds : {$setDifference : ["$rangeIds", "$entries.entry_id"]}}} ] )
{ "missingIds" : [ 27497714, 27497715, 27497716, 27497717, 27497718, 27497719, 27497720, 27497721, 27497722, 27497723, 27497724, 27497725, 27497726, 27497727, 27497728, 27497729, 27497730, 27497731, 27497732, 27497733, 27497734, 27497735, 27497736, 27497737, 27497738, 27497739, 27497740, 27497741, 27497742, 27497743, 27497744, 27497745, 27497746, 27497747, 27497748, 27497749, 27497751, 27497752, 27497753, 27497754, 27497755, 27497756, 27497757, 27497758, 27497759, 27497760 ] }
>
Related
this is my document that had stored into collection:
{
"_id" : UUID("61a2053c-1a79-4649-8793-df6c4dc1973"),
"NotificationId" : UUID("ad068e4e-10e2-528c-a74a-df6c4dd9211"),
"DistributionId" : UUID("f5445ea1-e6cb-4acd-9881-c4122df6c4d"),
"CreationDateTime" : ISODate("2016-07-13T04:20:38.697Z"),
"ExpirationDateTime" : ISODate("2099-01-01T00:00:00.000Z"),
"DeliveryType" : 1,
"DeliveryParams" : [],
"Address" : "Topics/Messages/Global",
"Payload" : "{\"Id\":\"ad067896-10e2-528c-er87-df6c4d123654\",\"CreationDateTime\":\"\\/Date(1468324824751)\\/\",\"DeviceId\":\"456987456985\",\"UserId\":\"64545678-1234-4834-4321-123456789012\",\"UserFullName\":\"test-user\",\"SystemId\":\"com.messaging\",\"SystemTitle\":\"message\",\"EventId\":\"messaging.message\",\"EventTitle\":\"ارسال پیام\",\"EventData\":[],\"BusinessCode\":\"1-2-4-4-5-6-9\",\"ProcessId\":\"55333333-4433-3333-7733-113333333399\",\"WorkItemId\":423458,\"WKT\":\"\"}",
"SendAttempts" : null,
"Sent" : ISODate("2016-11-10T10:01:22.140Z"),
"Delivered" : ISODate("1970-01-01T00:00:00.000Z")
}
My question is, How could I split \"BusinessCode\":\"1-2-4-4-5-6-9\" inside
Payload field. I just need BusinessCode:1-2-4-4-5-6-9 for store into the other field.
I used this script:
db.Messages.find().forEach(function(item)
{
id = item._id;
payload = item.Payload;
matched = payload.match(/\"BusinessCode\":\"(([1-2]?[0-9])-([0-9]*)-([0-9]*)-([0-9]*)-([0-9]*)-([0-9]*)-([0-9]*))\"/);
....
db.Messages.updateOne(
....
This payload.match return
BusinessCode":"1-2-4-4-5-6-9",1-2-4-4-5-6-9,1,2,4,4,5,6,9
this script for collection with a many document is not appropriate and has a low speed. I want to use aggregation pipeline.
How could I get exactly same response like payload.match, into aggregation pipeline ?
we don't have regexp substring in mongo as of 3.6.2
we can use substr methods to substring the business id, since the payload contains unicode characters, we need to use CodePoint (CP) to get the expected results
db.col.aggregate(
[
{$addFields : {
start : {$indexOfCP : ["$Payload", "BusinessCode"]},
end : { $indexOfCP : ["$Payload", "ProcessId"]}
}
},
{$project : {BusinessCode : {$substrCP : ["$Payload", {$sum : ["$start",15]}, {$subtract : [{$subtract : ["$end", "$start"]}, 18]}]}}}
]
)
result
{ "_id" : "61a2053c-1a79-4649-8793-df6c4dc1973", "BusinessCode" : "1-2-4-4-5-6-9" }
I'm looking to optimize the MongoDB performance by minimizing the number of records to unwind.
I do like:
unwind(injectionRecords),
match("machineID" : "machine1"),
count(counter)
But because of huge data unwind operation takes a lot of time and then it matches from unwind.
It unwinds all the 4 records then matches machineID from result and give me count of it.
Instead I would like to do something like :
match("machineID": "machine1"),
unwind(injectionRecords)
count(counter)
So, it would match records having machineID and unwind only 2 instead of 4 and give me the count of it.
Is this possible? How can I do this?
Here are sample docs,
{
"_id" : ObjectId("5981c24b90a7c215e4f166dd"),
"machineID" : "machine1",
"injectionRecords" : [
{
"startTime" : ISODate("2017-08-02T17:45:04.779+05:30"),
"endTime" : ISODate("2017-08-02T17:45:07.763+05:30"),
"counter" : 1
},
{
"startTime" : ISODate("2017-08-02T17:45:24.417+05:30"),
"endTime" : ISODate("2017-08-02T17:45:27.402+05:30"),
"counter" : 2
}
]
},
{
"_id" : ObjectId("5981c24b90a7c215e4f166de"),
"machineID" : "machine2",
"injectionRecords" : [
{
"startTime" : ISODate("2017-08-02T17:46:04.779+05:30"),
"endTime" : ISODate("2017-08-02T17:46:07.763+05:30"),
"counter" : 1
},
{
"startTime" : ISODate("2017-08-02T17:46:24.417+05:30"),
"endTime" : ISODate("2017-08-02T17:46:27.402+05:30"),
"counter" : 2
}
]
}
The following query will return a count of injectionRecords for a given machineId. I think this is what you are asking for.
db.collection.aggregate([
{$match: {machineID: 'machine1'}},
{$unwind: '$injectionRecords'},
{$group:{_id: "$_id",count:{$sum:1}}}
])
Of course, this query (where the unwind takes place before the match) is functionally equivalent:
db.collection.aggregate([
{$unwind: '$injectionRecords'},
{$match: {machineID: 'machine1'}},
{$group:{_id: "$_id",count:{$sum:1}}}
])
However, running that query with explain ...
db.collection.aggregate([
{$unwind: '$injectionRecords'},
{$match: {machineID: 'machine1'}},
{$group:{_id: "$_id",count:{$sum:1}}}
], {explain: true})
... shows that the unwind stage applies to the entire collection whereas if you match before unwinding then only the matched documents are unwound.
I am trying to return events where someone has not been invited to. However,
all my queries are returning data. Nothing should be returned when I run the query. What am I missing?
"__v" : 0,
"_id" : ObjectId("565cca79a9baa9b1522b57eb"),
"attendees" : [
{
"_id" : ObjectId("565cca79a9baa9b1522b57ec"),
"attendee" : ObjectId("557dfb4fc8c9ecbb07c2f98c"),
"statustext" : "Accepted",
"status" : 1
},
{
"attendee" : ObjectId("55dec11f38180102145d0060"),
"_id" : ObjectId("565f6bacdcbac0a6a354420c"),
"statustext" : "Pending",
"status" : 0
}
]
db.events.find({attendees:{$elemMatch:{attendee:{$ne:"55dec11f38180102145d0060"}}}}).
db.events.find({attendees:{$elemMatch:{attendee:{$ne:'55dec11f38180102145d0060'}}}})
db.events.find({attendees:{$elemMatch:{attendee:{$ne:ObjectId('55dec11f38180102145d0060')}}}})
Quoting the docs:
The $elemMatch operator matches documents that contain an array field
with at least one element that matches all the specified query
criteria.
That means the $elemMatch is not suited for this case.
db.events.find({"attendees.attendee":{$ne: ObjectId("55dec11f38180102145d0060")}})
I have a mongo document which has structure like
{
"_id" : "THIS_IS_A_DHP_USER_ID+2014-11-26",
"_class" : "weight",
"items" : [
{
"dateTime" : ISODate("2014-11-26T08:08:38.716Z"),
"value" : 98.5
},
{
"dateTime" : ISODate("2014-11-26T08:18:38.716Z"),
"value" : 95.5
},
{
"dateTime" : ISODate("2014-11-26T08:28:38.663Z"),
"value" : 90.5
}
],
"source" : "MANUAL",
"to" : ISODate("2014-11-26T08:08:38.716Z"),
"from" : ISODate("2014-11-26T08:08:38.716Z"),
"userId" : "THIS_IS_A_DHP_USER_ID",
"createdDate" : ISODate("2014-11-26T08:38:38.776Z")
}
{
"_id" : "THIS_IS_A_DHP_USER_ID+2014-11-25",
"_class" : "weight",
"items" : [
{
"dateTime" : ISODate("2014-11-25T08:08:38.716Z"),
"value" : 198.5
},
{
"dateTime" : ISODate("2014-11-25T08:18:38.716Z"),
"value" : 195.5
},
{
"dateTime" : ISODate("2014-11-25T08:28:38.716Z"),
"value" : 190.5
}
],
"source" : "MANUAL",
"to" : ISODate("2014-11-25T08:08:38.716Z"),
"from" : ISODate("2014-11-25T08:08:38.716Z"),
"userId" : "THIS_IS_A_DHP_USER_ID",
"createdDate" : ISODate("2014-11-26T08:38:38.893Z")
}
The query that want to fire on this document structure,
finding documents for a particular user id
unwinding the embedded array
Grouping the documents based over _id with -
summing the items.value of the embedded array
getting the minimum of the items.dateTime of the embedded array
Note. The sum and min, I want to get as a object i.e. { value : sum , dateTime : min of the items.dateTime} inside an array of items
Can this be achieved in an single aggregation call using push or some other technique.
When you group over a particular _id, and apply aggregation operators such as $min and $sum, there exists only one record per group(_id), that holds the sum and the minimum date for that group. So there is no way to obtain a different sum and a different minimum date for the same _id, which also logically makes no sense.
What you would want to do is:
db.collection.aggregate([
{$match:{"userId":"THIS_IS_A_DHP_USER_ID"}},
{$unwind:"$items"},
{$group:{"_id":"$_id",
"values":{$sum:"$items.value"},
"dateTime":{$min:"$items.dateTime"}}}
])
But in case when you do not query for a particular userId, then you would have multiple groups, each group having its own sum and min date. Then it makes sense to accumulate all these results together in an array using the $push operator.
db.collection.aggregate([
{$unwind:"$items"},
{$group:{"_id":"$_id",
"result":{$sum:"$items.value"},
"dateTime":{$min:"$items.dateTime"}}},
{$group:{"_id":null,"result":{$push:{"value":"$result",
"dateTime":"$dateTime",
"id":"$_id"}}}},
{$project:{"_id":0,"result":1}}
])
you should use following aggregation may it works
db.collectionName.aggregate(
{"$unwind":"$items"},
{"$match":{"userId":"THIS_IS_A_DHP_USER_ID"}},
{"$group":{"_id":"$_id","sum":{"$sum":"$items.value"},
"minDate":{"$min":"$items.dateTime"}}}
)
Given the following data set:
{ "_id" : ObjectId("510458b188ce1d16e616129b"), "codes" : [ "oxtbyr", "xstute" ], "name" : "Ciao Mambo", "permalink" : "ciaomambo", "visits" : 1 }
{ "_id" : ObjectId("510458b188ce1d16e6161296"), "codes" : [ "zpngwh", "odszfy", "vbvlgr" ], "name" : "Anthony's at Spokane Falls", "permalink" : "anthonysatspokanefalls", "visits" : 0 }
How can I convert this python/pymongo sort into something that will work with the MongoDB aggregation framework? I'm sorting results based on the number of codes within the codes array.
z = [(x['name'], len(x['codes'])) for x in restaurants]
sorted_by_second = sorted(z, key=lambda tup: tup[1], reverse=True)
for x in sorted_by_second:
print x[0], x[1]
This works in python, I just want to know how to accomplish the same goal on the MongoDB query end of things.
> db.z.aggregate({ $unwind:'$codes'},
{ $group : {_id:'$_id', count:{$sum:1}}},
{ $sort :{ count: 1}})