Mongodb aggregate query count records for large dataset

Mongodb aggregate query count records for large dataset - mongodb

I'm attempting to query all data from the errorlog collection, and in the same query grab a count of relevant irs_documents for each errorlog entry.
The problem is that there are too many records in the irs_documents collection to perform a $lookup.
Is there a performant method of doing this in one MongoDB query?
Failed attempt
db.getCollection('errorlog').aggregate(
[
{
$lookup: {
from: "irs_documents",
localField: "document.ssn",
foreignField: "ssn",
as: "irs_documents"
}
},
{
$group: {
_id: { document: "$document", error: "$error" },
logged_documents: { $sum : 1 }
}
}
]
)
Error
Total size of documents in $lookup exceeds maximum document size
Clearly this solution won't work. MongoDB is literally attempting to gather whole documents with $lookup, where I just want a count.
"errorlog" collection sample data:
/* 1 */
{
"_id" : ObjectId("56d73955ce09a5a32399f022"),
"document" : {
"ssn" : 1
},
"error" : "Error 1"
}
/* 2 */
{
"_id" : ObjectId("56d73967ce09a5a32399f023"),
"document" : {
"ssn" : 2
},
"error" : "Error 1"
}
/* 3 */
{
"_id" : ObjectId("56d73979ce09a5a32399f024"),
"document" : {
"ssn" : 3
},
"error" : "Error 429"
}
/* 4 */
{
"_id" : ObjectId("56d73985ce09a5a32399f025"),
"document" : {
"ssn" : 9
},
"error" : "Error 1"
}
/* 5 */
{
"_id" : ObjectId("56d73990ce09a5a32399f026"),
"document" : {
"ssn" : 1
},
"error" : "Error 8"
}
"irs_documents" collection sample data
/* 1 */
{
"_id" : ObjectId("56d73905ce09a5a32399f01e"),
"ssn" : 1,
"name" : "Sally"
}
/* 2 */
{
"_id" : ObjectId("56d7390fce09a5a32399f01f"),
"ssn" : 2,
"name" : "Bob"
}
/* 3 */
{
"_id" : ObjectId("56d7391ace09a5a32399f020"),
"ssn" : 3,
"name" : "Kelly"
}
/* 4 */
{
"_id" : ObjectId("56d7393ace09a5a32399f021"),
"ssn" : 9,
"name" : "Pippinpaddle-Oppsokopolis"
}

The error is self explanatory. Lookup is essentially combining two documents into single BSON document so MongoDB document size limit is biting you back.
You need to ask yourself, is it absolute necessary to perform both actions in one operation? if yes, do it the way you have to do in previous versions of MongoDB where $lookup is not supported.
Said that, perform two queries and perform merger in your client.
OPTION #1: you can aggregate on irs_documents and export computed result into another collection. Since, there will be very few objects in each document, I don't think you'll hit problem. But, you may hit memory problems and forced to use disk for aggregation framework. Try following solution and see if it works.
db.irs_documents.aggregate([
{
$group:{_id:"$ssn", count:{$sum:1}}
},
{
$out:"irs_documents_group"
}]);
db.errorlog.aggregate([
{
$lookup: {
from: "irs_documents_group",
localField: "document.ssn",
foreignField: "ssn",
as: "irs_documents"
}
},
{
$group: {
_id: { document: "$document", error: "$error" },
logged_documents: { $sum : 1 }
}
}
])
OPTION #2: If above solution is not working, you can always use map reduce, though it will not be an elegant solution but will work.

Related

Individual search result in multiple values in arrays

I have following model:
{
"_id" : ObjectId("5d61aaf8108e185191552bbb"),
"serials" : [
"e127av48-0697-4977-b096-5ce79c89a414",
"d163f80a-55ff-40fe-90b4-331ece5bebd5",
"4740021f-e9b5-4ca5-bf0e-8554c123bb94",
"320ffd42-f101-4b1d-8ff4-80bc693a29e6",
"fef5e68b-aed0-4a96-9488-7941c41d1c1f",
"2c0752ba-bf7a-4a3b-bd9f-14db4b2f8bae",
"6c5ff44d-5979-4bff-af12-9e6d282c3789",
"9c91bf91-72d7-4b71-827b-924947d6e93d",
"fb34b28e-afb1-4b6a-a3c1-5a1fe44246ee",
"91ab22ef-702f-4cbd-8919-a67a2b9a684c",
"ee1a7cb2-e088-47e6-a824-c8697df7d94c",
"0dc4c687-4db2-481e-a1a6-491320dede11",
"34612148-3e01-44ee-b262-de2035e63691",
"5ba85baf-e48a-40af-8578-55ff1a873c76",
"19fe3672-b6cb-4bb6-8d21-93412b938584",
"1d0d6f6d-1b49-461b-8661-ecbf43a6595e",
"d9a5455c-65ee-45e1-ae49-33cc15dec841",
"4a690a00-a76c-4d3e-aee3-78b2bb731b0c",
"ae331830-40b4-457c-8cc4-5d548f769c3e",
"fe3e460b-c89d-4ace-8a36-5ba2b53bf4d0",
"2cc6a2a0-e029-475f-a7fc-a46a79afb605",
"a7d07767-eada-4ce3-b083-9b048e9ae9f4"
],
"name" : "ApiCard",
"producer" : "Farmina",
"form" : "syrop",
"__v" : 0
}
I would like to retrive documents (multiple) from collection based on this serial numbers ("serials" field). For example i am finding:
[
"e127av48-0697-4977-b096-5ce79c89a414",
"d163f80a-55ff-40fe-90b4-331ece5bebd5",
"4740021f-e9b5-4ca5-bf0e-8554c123bb94",
"key that doesn't exist",
]
We have to assume that one of the serial number doesn't exist, so would like to get information for individual serial, expected output:
[
{
"serial":"e127av48-0697-4977-b096-5ce79c89a414",
"doc":{
....whole document where above serial is in array field "serials"
}
},
{
"serial":"e127av48-0697-4977-b096-5ce79c89a414",
"doc":{
....whole document where above serial is in array field "serials"
}
},
{
"serial":"e127av48-0697-4977-b096-5ce79c89a414",
"doc":{
....whole document where above serial is in array field "serials"
}
},
{
"serial":"key that doesn't exist",
"doc": null
}
]
I was trying the simplest solution - mongodb find by multiple array items, but unfortunately it'doesn't return info for individual serial number. I'am not sure it's possible to prepare this kind of query. I think some complex aggregation could perform it, but i don't even know this kind of pipelines.
Of course, i can get simple solution by using multiple aggregate or even find, but it could impact on performance, when application will be looking for 10000 records per request.

The following query can do the trick:
db.collection.aggregate([
{
$limit:1
},
{
$project:{
"_id":0,
"serialsToSearch":[
"e127av48-0697-4977-b096-5ce79c89a414",
"d163f80a-55ff-40fe-90b4-331ece5bebd5",
"4740021f-e9b5-4ca5-bf0e-8554c123bb94",
"key that doesn't exist",
]
}
},
{
$unwind:"$serialsToSearch"
},
{
$lookup:{
"from":"collection",
"let":{
"serial":"$serialsToSearch"
},
"pipeline":[
{
$match:{
$expr:{
$in:["$$serial","$serials"]
}
}
},
{
$project:{
"serials":0
}
}
],
"as":"searialsLookup"
}
},
{
$unwind:{
"path":"$searialsLookup",
"preserveNullAndEmptyArrays":true
}
},
{
$project:{
"serial":"$serialsToSearch",
"doc":{
$ifNull:["$searialsLookup",null]
}
}
}
]).pretty()
Data Set:
{
"_id" : ObjectId("5d61aaf8108e185191552bbb"),
"serials" : [
"e127av48-0697-4977-b096-5ce79c89a414",
"d163f80a-55ff-40fe-90b4-331ece5bebd5",
"4740021f-e9b5-4ca5-bf0e-8554c123bb94",
"320ffd42-f101-4b1d-8ff4-80bc693a29e6",
"fef5e68b-aed0-4a96-9488-7941c41d1c1f",
"2c0752ba-bf7a-4a3b-bd9f-14db4b2f8bae",
"6c5ff44d-5979-4bff-af12-9e6d282c3789",
"9c91bf91-72d7-4b71-827b-924947d6e93d",
"fb34b28e-afb1-4b6a-a3c1-5a1fe44246ee",
"91ab22ef-702f-4cbd-8919-a67a2b9a684c",
"ee1a7cb2-e088-47e6-a824-c8697df7d94c",
"0dc4c687-4db2-481e-a1a6-491320dede11",
"34612148-3e01-44ee-b262-de2035e63691",
"5ba85baf-e48a-40af-8578-55ff1a873c76",
"19fe3672-b6cb-4bb6-8d21-93412b938584",
"1d0d6f6d-1b49-461b-8661-ecbf43a6595e",
"d9a5455c-65ee-45e1-ae49-33cc15dec841",
"4a690a00-a76c-4d3e-aee3-78b2bb731b0c",
"ae331830-40b4-457c-8cc4-5d548f769c3e",
"fe3e460b-c89d-4ace-8a36-5ba2b53bf4d0",
"2cc6a2a0-e029-475f-a7fc-a46a79afb605",
"a7d07767-eada-4ce3-b083-9b048e9ae9f4"
],
"name" : "ApiCard",
"producer" : "Farmina",
"form" : "syrop",
"__v" : 0
}
Output:
{
"serial" : "e127av48-0697-4977-b096-5ce79c89a414",
"doc" : {
"_id" : ObjectId("5d61aaf8108e185191552bbb"),
"name" : "ApiCard",
"producer" : "Farmina",
"form" : "syrop",
"__v" : 0
}
}
{
"serial" : "d163f80a-55ff-40fe-90b4-331ece5bebd5",
"doc" : {
"_id" : ObjectId("5d61aaf8108e185191552bbb"),
"name" : "ApiCard",
"producer" : "Farmina",
"form" : "syrop",
"__v" : 0
}
}
{
"serial" : "4740021f-e9b5-4ca5-bf0e-8554c123bb94",
"doc" : {
"_id" : ObjectId("5d61aaf8108e185191552bbb"),
"name" : "ApiCard",
"producer" : "Farmina",
"form" : "syrop",
"__v" : 0
}
}
{ "serial" : "key that doesn't exist", "doc" : null }
Note: The query won't give expected output if the collection would be empty.
Aggregation stages details:
STAGE I: Limiting the records to 1, as initially, our motive is to inject the input array in aggregation. The injection would be done in no time.
STAGE II: Projecting the input array as serialsToSearch
STAGE III: Now we have the input array as a field, we can unwind it
STAGE IV: Lookup in the same collection with each field of the input array and check if the searched serial is present in serials array
STAGE V: unwinding the lookup output
STAGE VI: Projecting fields as per the response required.

Aggregation on an attribute which could be null [duplicate]

This question already has answers here:
Before $unwind check if sub document is not empty
(2 answers)
Closed 3 years ago.
I am trying to aggregate attributes from two collections, one of those contains a field which may or may not be there in a document. When the attribute is not there in the document it doesn't return any document at all. So I need to create a kind of null check, that if the attribute is not there don't consider the attribute else consider it, below is my query -
db.collection(collectionName).aggregate(
[{
$match: selector
}, {
$lookup: {
from: 'status',
localField: 'candidateId',
foreignField: 'candidateId',
as: 'profile'
}
}, {
$project: {
'_id': 0,
'currentStatus': '$profile.currentStatus',
'lastContacted': '$profile.lastContacted',
'lastWorkingDay': '$profile.lastWorkingDay',
'remarks': '$profile.remarks'
}
},{
$unwind: '$lastWorkingDay'
}
In this case lastWorkingDay if not present makes the whole query return nothing. Any pointer would be helpful.

I believe something else is wrong with your query.
This is a bit hard to analyse without any data input, so I made up my own:
I have tried this on my local box just now, and it executes the way you'd expect it.
A projection shouldn't remove any kind of results. Here is my example:
Collection c1:
/* 1 */
{
"_id" : ObjectId("5c780eea79e5bed2bd00f85e"),
"candidateId" : "id1",
"currentStatus" : "a",
"lastContacted" : "b"
}
/* 2 */
{
"_id" : ObjectId("5c780efb79e5bed2bd00f863"),
"candidateId" : "id2",
"currentStatus" : "a",
"lastContacted" : "b",
"lastWorkingDay" : "yesterday"
}
Collection C2:
/* 1 */
{
"_id" : ObjectId("5c780f0a79e5bed2bd00f874"),
"candidateId" : "id1"
}
/* 2 */
{
"_id" : ObjectId("5c780f2879e5bed2bd00f87b"),
"candidateId" : "id2"
}
Aggregation:
db.getCollection('c2').aggregate( [
{$match: {}},
{ $lookup: {
from: "c1",
localField: "candidateId",
foreignField: "candidateId",
as : "profile"
} },
{$project: {
_id: 0,
"currentStatus" : "$profile.currentStatus",
"lastWorkingDay" : "$profile.lastWorkingDay"
} }
] )
Results:
/* 1 */
{
"currentStatus" : [
"a"
],
"lastWorkingDay" : []
}
/* 2 */
{
"currentStatus" : [
"a"
],
"lastWorkingDay" : [
"yesterday"
]
}
As you can see, the lastWorkingDay is executed correctly for both values in my aggregation.
Note that the lookup is creating an array for profiles since there could be multiple results for the lookup. You may need to unwind this if you need it in more detail.
I hope this helps.

mongoDB distict problems

It's one of my data as JSON format:
{
"_id" : ObjectId("5bfdb412a80939b6ed682090"),
"accounts" : [
{
"_id" : ObjectId("5bf106eee639bd0df4bd8e05"),
"accountType" : "DDA",
"productName" : "DDA1"
},
{
"_id" : ObjectId("5bf106eee639bd0df4bd8df8"),
"accountType" : "VSA",
"productName" : "VSA1"
},
{
"_id" : ObjectId("5bf106eee639bd0df4bd8df9"),
"accountType" : "VSA",
"productName" : "VSA2"
}
]
}
I want to make a query to get all productName(no duplicate) of accountType = VSA.
I write a mongo query:
db.Collection.distinct("accounts.productName", {"accounts.accountType": "VSA" })
I expect: ['VSA1', 'VSA2']
I get: ['DDA','VSA1', 'VSA2']
Anybody knows why the query doesn't work in distinct?

Second parameter of distinct method represents:
A query that specifies the documents from which to retrieve the distinct values.
But the thing is that you showed only one document with nested array of elements so whole document will be returned for your condition "accounts.accountType": "VSA".
To fix that you have to use Aggregation Framework and $unwind nested array before you apply the filtering and then you can use $group with $addToSet to get unique values. Try:
db.col.aggregate([
{
$unwind: "$accounts"
},
{
$match: {
"accounts.accountType": "VSA"
}
},
{
$group: {
_id: null,
uniqueProductNames: { $addToSet: "$accounts.productName" }
}
}
])
which prints:
{ "_id" : null, "uniqueProductNames" : [ "VSA2", "VSA1" ] }

I need to count how many children orgs are assigned to a parent org in MongoDB

I'm new to the MongoDB world. I'm trying to figure out how to count the number of children organizations assigned to a parent organization. I have documents that have this general structure:
{
"_id" : "001",
"parentOrganization" : {
"organizationId" : "pOrg1"
},
"childOrganization" : {
"organizationId" : "cOrg1"
}
},
{
"_id" : "002",
"parentOrganization" : {
"organizationId" : "pOrg1"
},
"childOrganization" : {
"organizationId" : "cOrg2"
}
},
{
"_id" : "003",
"parentOrganization" : {
"organizationId" : "pOrg2"
},
"childOrganization" : {
"organizationId" : "cOrg3"
}
}
Each document has a parentOrganization with an associated childOrganization. There may be multiple documents with the same parentOrganization, but different childOrganizations. There may also be multiple documents with the same parent/child relationship. Additionally, there may even be a case where a child org may associate with multiple parent orgs.
I'm trying to group by parentOrganization and then count the number of unique childOrganization's associated with each parentOrganization, as well as display the unique id's.
I have tried using an aggregation framework with $match and $group, but I'm still not getting into the child organization parts to count them. Here is what I'm currently attempting:
var s1 = {$match: {"parentOrganization.organizationId": {$exists: true}}};
var s2 = {$group: {_id: "$parentOrganization.organizationId", count: {$sum: "$childOrganization.organizationId"}}};
db.collection.aggregate(s1, s2);
My results are returning the parentOrganization, but my $sum is not returning the number of associated childOrganizations:
/* 1 */
{
"_id" : "pOrg1",
"count" : 0
}
/* 2 */
{
"_id" : "pOrg2",
"count" : 0
}
I get the feeling it is a bit more complicated than my limited knowledge has access to at this time. What details am I missing in this query?

Your $sum is referencing the childOrganization.organizationId value, which is a string. When $sum references a string, it will return the value 0.
I was a unsure of exactly what you were asking for, but I believe that these aggregations can help you on your way.
This will return a count of documents groups by the parentOrganization.organizationId
db.collection.aggregate({$group: {"_id":"$parentOrganization.organizationId", "count": {"$sum": 1}}})
Output:
{ "_id" : "pOrg2", "count" : 1 }
{ "_id" : "pOrg1", "count" : 2 }
This will return a count of unique parent/child organizations:
db.collection.aggregate(
{$group: {"_id": {"parentOrganization": "$parentOrganization.organizationId", "childOrganization": "$childOrganization.organizationId"}, "count":{$sum:1}}})
Output:
{ "_id" : { "parentOrganization" : "pOrg2", "childOrganization" : "cOrg3" }, "count" : 1 }
{ "_id" : { "parentOrganization" : "pOrg1", "childOrganization" : "cOrg2" }, "count" : 1 }
{ "_id" : { "parentOrganization" : "pOrg1", "childOrganization" : "cOrg1" }, "count" : 1 }
This will return a count of unique child organizations and get the set of unique child organizations as well using $addToSet. One caveat of using $addToSet is that the MongoDB 16MB limit on document size still holds. This means that if your collection is large enough such that the size of the set will make one document greater than 16MB, the command will fail. The first $group will create a set of child organizations grouped by parent organization. The $project is used simply to add the total size of the set to the result.
db.collection.aggregate([
{$group: {"_id" : "$parentOrganization.organizationId", "childOrgs" : { "$addToSet" : "$childOrganization.organizationId"}}},
{$project: {"_id" : "$_id", "uniqueChildOrgsCount": {"$size" : "$childOrgs"}, "uniqueChildOrgs": "$childOrgs"}}])
Output:
{ "_id" : "pOrg2", "uniqueChildOrgsCount" : 1, "uniqueChildOrgs" : [ "cOrg3" ]}
{ "_id" : "pOrg1", "uniqueChildOrgsCount" : 2, "uniqueChildOrgs" : [ "cOrg2", "cOrg1" ]}
During these aggregations, I left out the $match statement you included for simplicity, but you could add that back as well.

MongoDb - How to search BSON composite key exactly?

I have a collection that stored information about devices like the following:
/* 1 */
{
"_id" : {
"startDate" : "2012-12-20",
"endDate" : "2012-12-30",
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount"]
},
"data" : {
"results" : "1"
}
}
/* 2 */
{
"_id" : {
"startDate" : "2012-12-20",
"endDate" : "2012-12-30",
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount", "noOfUsers"]
},
"data" : {
"results" : "2"
}
}
/* 3 */
{
"_id" : {
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount", "noOfUsers"]
},
"data" : {
"results" : "3"
}
}
And I am trying to query the documents using the _id field which will be unique. The problem I am having is that when I query for all the different attributes as in:
db.collection.find({$and: [{"_id.dimensions":{ $all: ["manufacturer","model"], $size: 2}}, {"_id.metrics": { $all:["noOfUsers","deviceCount"], $size: 2}}]});
This matches 2 and 3 documents (I don't care about the order of the attributes values), but I would like to only get 3 back. How can I say that there should not be any other attributes to _id than those that I specify in the search query?
Please advise. Thanks.

Unfortunately, I think the closest you can get to narrowing your query results to just unordered _id.dimensions and unordered _id.metrics requires you to know the other possible fields in the _id subdocument field, eg. startDate and endDate.
db.collection.find({$and: [
{"_id.dimensions":{ $all: ["manufacturer","model"], $size: 2}},
{"_id.metrics": { $all:["noOfUsers","deviceCount"], $size: 2}},
{"_id.startDate":{$exists:false}},
{"_id.endDate":{$exists:false}}
]});
If you don't know the set of possible fields in _id, then the other possible solution would be to specify the exact _id that you want, eg.
db.collection.find({"_id" : {
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount", "noOfUsers"]
}})
but this means that the order of _id.dimensions and _id.metrics is significant. This last query does a document match on exact BSON representation of _id.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Mongodb aggregate query count records for large dataset - mongodb

Related

Individual search result in multiple values in arrays

Aggregation on an attribute which could be null [duplicate]

mongoDB distict problems

I need to count how many children orgs are assigned to a parent org in MongoDB

MongoDb - How to search BSON composite key exactly?

Categories

Resources