Complex Sort on multiple very large MongoDB Collections - mongodb

I have a mongodb database with currently about 30 collections ranging from 1.5gb to 2.5gb and I need to reformat and sort the data into nested groups and dump them to a new collection. This database will eventually have about 2000 collections of the same type and formatting of data.
Data is currently available like this:
{
"_id" : ObjectId("598392d6bab47ec75fd6aea6"),
"orderid" : NumberLong("4379116282"),
"regionid" : 10000068,
"systemid" : 30045305,
"stationid" : 60015036,
"typeid" : 7489,
"bid" : 0,
"price" : 119999.91,
"minvolume" : 1,
"volremain" : 6,
"volenter" : 8,
"issued" : "2015-12-31 09:12:29",
"duration" : "14 days, 0:00:00",
"range" : 65535,
"reportedby" : 0,
"reportedtime" : "2016-01-01 00:22:42.997926"} {...} {...}
I need to group these by regionid > typeid > bid like this:
{"regionid": 10000176,
"orders": [
{
"typeid": 34,
"buy": [document, document, document, ...],
"sell": [document, document, document, ...]
},
{
"typeid": 714,
"buy": [document, document, document, ...],
"sell": [document, document, document, ...]
}]
}
Here's more verbose a sample of my ideal output format: https://gist.github.com/BatBrain/cd3426c29ce8ca8152efd1fa06ca1392
I have been trying to use the db.collection.aggregate() to do this, running this command as an initial test step:
db.day_2016_01_01.aggregate( [{ $group : { _id : "$regionid", entries : { $push: "$$ROOT" } } },{ $out : "test_group" }], { allowDiskUse:true, cursor:{} })
But I have been getting this message, "errmsg" : "BufBuilder attempted to grow() to 134217728 bytes, past the 64MB limit."
I tried looking into how to use the cursor object, but I'm pretty confused about how to apply it in this situation, or even if that is a viable option. Any advice or solutions would be great.

Related

Mongodb multiple subdocument

I need a collection with structure like this:
{
"_id" : ObjectId("5ffc3e2df14de59d7347564d"),
"name" : "MyName",
"pays" : "de",
"actif" : 1,
"details" : {
"pt" : {
"title" : "MongoTime PT",
"availability_message" : "In stock",
"price" : 23,
"stock" : 1,
"delivery_location" : "Portugal",
"price_shipping" : 0,
"updated_date" : ISODate("2022-03-01T20:07:20.119Z"),
"priority" : false,
"missing" : 1,
},
"fr" : {
"title" : "MongoTime FR",
"availability_message" : "En stock",
"price" : 33,
"stock" : 1,
"delivery_location" : "France",
"price_shipping" : 0,
"updated_date" : ISODate("2022-03-01T20:07:20.119Z"),
"priority" : false,
"missing" : 1,
}
}
}
How can i create an index for each subdocument in 'details' ?
Or maybe it's better to do an array ?
Doing a query like this is currently very long (1 hour). How can I do ?
query = {"details.pt.missing": {"$in": [0, 1, 2, 3]}, "pays": 'de'}
db.find(query, {"_id": false, "name": true}, sort=[("details.pt.updated_date", 1)], limit=300)
An array type would be better, as there are advantages.
(1) You can include a new field which has values like pt, fr, xy, ab, etc. For example:
details: [
{ type: "pt", title : "MongoTime PT", missing: 1, other_fields: ... },
{ type: "fr", title : "MongoTime FR", missing: 1, other_fields: ... },
{ type: "xy", title : "MongoTime XY", missing: 2, other_fields: ... },
// ...
]
Note the introduction of the new field type (this can be any name representing the field data).
(2) You can also index on the array sub-document fields, which can improve query performance. Array field indexes are referred as Multikey Indexes.
The index can be on a field used in a query filter. For example, "details.missing". This key can also be part of a Compound Index. This can help a query filter like below:
{ pays: "de", "details.type": "pt", "details.missing": { $in: [ 0, 1, 2, 3 ] } }
NOTE: You can verify the usage of an index in a query by generating a Query Plan, applying the explain method on the find.
(3) Also, see Embedded Document Pattern as explained in the Model One-to-Many Relationships with Embedded Documents.

count the subdocument field and total amount in mongodb

I've a collection with below documents:
{
"_id" : ObjectId("54acfb67a81bf9509246ed81"),
"Billno" : 1234,
"details" : [
{
"itemcode" : 12,
"itemname" : "Paste100g",
"qty" : 2,
"price" : 50
},
{
"itemcode" : 14,
"itemname" : "Paste30g",
"qty" : 4,
"price" : 70
},
{
"itemcode" : 12,
"itemname" : "Paste100g",
"qty" : 4,
"price" : 100
}
]
}
{
"_id" : ObjectId("54acff86a81bf9509246ed82"),
"Billno" : 1237,
"details" : [
{
"itemcode" : 12,
"itemname" : "Paste100g",
"qty" : 3,
"price" : 75
},
{
"itemcode" : 19,
"itemname" : "dates100g",
"qty" : 4,
"price" : 170
},
{
"itemcode" : 22,
"itemname" : "dates200g",
"qty" : 2,
"price" : 160
}
]
}
I need to display below output. Please help
Required Output:
---------------------------------------------------------------------------------
itemcode itemname totalprice totalqty
---------------------------------------------------------------------------------
12 Paste100g 225 9
14 Paste30g 70 4
19 dates100g 170 4
22 dates200g 160 2
The MongoDB aggregation pipeline is available to solve your problem. You get details out of an array my processing with $unwind and then using $group to "sum" the totals:
db.collection.aggregate([
// Unwind the array to de-normalize as documents
{ "$unwind": "$details" },
// Group on the key you want and provide other values
{ "$group": {
"_id": "$details.itemcode",
"itemname": { "$first": "$details.itemname" },
"totalprice": { "$sum": "$details.price" },
"totalqty": { "$sum": "$details.qty" }
}}
])
Ideally you want a $match stage in there to filter out any irrelevant data first. This is basically MongoDB query and takes all the same arguments and operators.
Most here is simple really. The $unwind is sort of like a "JOIN" in SQL except that in an embedded structure the "join" is already made, so you are just "de-normalizing" like a join would do between "one to many" table relationships but just within the document itself. It basically "repeats" the "parent" document parts to the array for each array member as a new document.
Then the $group works of a key, as in "GROUP BY", where the "key" is the _id value. Everything there is "distinct" and all other values are gathered by "grouping operators".
This is where operations like $first come in. As described on the manual page, this takes the "first" value from the "grouping boundary" mentioned in the "key" earlier. You want this because all values of this field are "likely" to be the same, so this is a logical choice to just pick the "first" match.
Finally there is the $sum grouping operator which does what should be expected. All supplied values under the "key" are "added" or "summed" together to provide a total. Just like SQL SUM().
Also note that all the $ prefixed names there is how the aggregation framework deals with variables for "field/property" names within the current document being processed. "Dot notation" is used to reference the embedded "fields/properties" nested within a parent property name.
It is useful to learn aggregation in MongoDB. It is to general queries what anything beyond a basic "SELECT" statement is to SQL. Not just for "grouping" but for other manipulation as well.
Read through the documentation of all aggregation operators and also take a look a SQL to Aggregation Mapping in the documentation as a general guide if you have some familiarity with SQL to begin with. It helps explain concepts and shows some things that can be done.

Update an array element with inc mongo update

HI All I have this Data in mongo,
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 1
}
],
"count" : NumberLong(1),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
I want to update it using this new data
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
}
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
What i need in the output is
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 4
}
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(7),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
Could you please suggest me how can i achieve this this using update operation
My update query will have tags field as query parameter.
You'll never get this in a single query operation as presently there is no way for MongoDB updates to refer to the existing values of fields. The exception of course is operators such as $inc, but this has a bit more going on than can be really handled by this.
You need multiple updates, but there is a consistent model to follow and the Bulk Operations API can at least help with sending all of those updates in a single request:
var updoc = {
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
},
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
};
var bulk = db.collection.initializeOrderedBulkOp();
// Inspect the document variable for update
// For each array entry
updoc.articleId.forEach(function(doc) {
// First try to match the document and array entry to update
bulk.find({
"tags": updoc.tags,
"articleId.articleId": doc.articleId
}).update({
"$inc": { "articleId.$.articleCount": doc.articleCount }
});
// Then try to "push" the array entry where it does not exist
bulk.find({
"tags": updoc.tags,
"articleId.articleId": { "$ne": doc.articleId }
}).update({
"$push": { "articleId": doc }
});
})
// Finally increment the overall count
bulk.find({ "tags": updoc.tags }).update({
"$inc": { "count": updoc.count }
});
bulk.execute();
Now that is not "truly" atomic and there is a very small chance that the modified document could be read without all of the modifications in place. And the Bulk API sends these over to the server to process all at once, then that is a lot better than individual operations between the client and server where the chance of the document being read in a non-consistent state would be much higher.
So for each array member in the document to "merge" you want to both try to $inc where the
member is matched in the query and to $push a new member where it was not. Finally you just want to $inc again for the total count on the merged document with the existing one.
For this sample that is a total of 5 update operations but all sent in one package. Note that the response though will confirm that only 3 operations where applied here as 2 of the operations would not actually match a document due to the conditions specified:
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 0,
"nUpserted" : 0,
"nMatched" : 3,
"nModified" : 3,
"nRemoved" : 0,
"upserted" : [ ]
})
So that is one way to handle it. Another may be to just submit each document individually and then periodically "merge" the data into grouped documents using the aggregation framework. It depends on how "real time" you want to do this. The above is as close to "real time" updates as you can generally get.
Delayed Processing
As mentioned, there is another approach to this where you can consider a "delayed" processing of this "merging" where you do not need the data to be updated in real time. The approach considers the use of the aggregation framework to perform the "merge", and you could even use the aggregation as the general query for the data, but you probably want to accumulate in a collection instead.
The basic premise of the aggregation is that you store each "change" document as a separate document in the collection, rather than merge in real time. So two documents in the collection would be represented like this:
{
"_id" : ObjectId("548fe1c78ad2c25d4c952eee"),
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 1
}
],
"count" : NumberLong(1),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
},
{
"_id" : ObjectId("548fe2286032bac607405eb3"),
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
},
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
In order to "merge" these results for a given "tags" value, you want an aggregation pipeline like this:
db.collection.aggregate([
// Unwinds the array members to de-normalize
{ "$unwind": "$articleId" },
// Group the elements by "tags" value and "articleId"
{ "$group": {
"_id": {
"tags": "$tags",
"articleId": "$articleId.articleId",
},
"articleCount": { "$sum": "$articleId.articleCount" },
"timeStamp": { "$max": "$timeStamp" },
"interval": { "$max": "$interval" },
}},
// Now group again creating the array of "merged" items
{ "$group": {
"_id": "$tags",
"articleId": {
"$push": {
"articleId": "$_id.articleId",
"articleCount": "$articleCount"
}
},
"count": { "$sum": "$articleCount" },
"timeStamp": { "$max": "$timeStamp" },
"interval": { "$max": "$interval" },
}}
])
So using "tags" and "articleId" ( the inner value ) you group the results together, taking the $sum of the "articleCount" fields where both of those fields are the same and the $max value for the rest of the fields, which makes sense.
In a second $group pass you then just break the result documents down to "tags", pushing each matching "articleId" value under that into an array. To avoid any duplication the document "count" is summed at this stage and the other values are just taken from the same groupings.
The result is the same "merged" document, which you could either use the above aggregation query to simply return your results from such a collection, or use those results to either just create a new collection for the merged documents ( see the $out operator for one option ) or use a similar process to the first example to "merge" these "merged" results with an existing "merged" collection.
Accumulating data like this is generally a wide topic, even though a common use case for many. There is a reference project maintained but MongoDB solutions architecture called HVDF or High Volume Data Feed. It is aimed at providing a framework or at least a reference example of handling volume feeds ( for which change document accumulation is a case ) and aggregating these in a series manner for analysis.
The actual approaches depend on the overall needs of your application. Concepts such as these are employed internally by a framework like HVDF, it's just a matter of how much complexity you need and the approach that suits your application best for how you need to access the data.

Mongodb Update/Upsert array exact match

I have a collection :
gStats : {
"_id" : "id1",
"criteria" : ["key1":"value1", "key2":"value2"],
"groups" : [
{"id":"XXXX", "visited":100, "liked":200},
{"id":"YYYY", "visited":30, "liked":400}
]
}
I want to be able to update a document of the stats Array of a given array of criteria (exact match).
I try to do this on 2 steps :
Pull the stat document from the array of a given "id" :
db.gStats.update({
"criteria" : {$size : 2},
"criteria" : {$all : [{"key1" : "2096955"},{"value1" : "2015610"}]}
},
{
$pull : {groups : {"id" : "XXXX"}}
}
)
Push the new document
db.gStats.findAndModify({
query : {
"criteria" : {$size : 2},
"criteria" : {$all : [{"key1" : "2015610"}, {"key2" : "2096955"}]}
},
update : {
$push : {groups : {"id" : "XXXX", "visited" : 29, "liked" : 144}}
},
upsert : true
})
The Pull query works perfect.
The Push query gives an error :
2014-12-13T15:12:58.571+0100 findAndModifyFailed failed: {
"value" : null,
"errmsg" : "exception: Cannot create base during insert of update. Cause
d by :ConflictingUpdateOperators Cannot update 'criteria' and 'criteria' at the
same time",
"code" : 12,
"ok" : 0
} at src/mongo/shell/collection.js:614
Neither query is working in reality. You cannot use a key name like "criteria" more than once unless under an operator such and $and. You are also specifying different fields (i.e groups) and querying elements that do not exist in your sample document.
So hard to tell what you really want to do here. But the error is essentially caused by the first issue I mentioned, with a little something extra. So really your { "$size": 2 } condition is being ignored and only the second condition is applied.
A valid query form should look like this:
query: {
"$and": [
{ "criteria" : { "$size" : 2 } },
{ "criteria" : { "$all": [{ "key1": "2015610" }, { "key2": "2096955" }] } }
]
}
As each set of conditions is specified within the array provided by $and the document structure of the query is valid and does not have a hash-key name overwriting the other. That's the proper way to write your two conditions, but there is a trick to making this work where the "upsert" is failing due to those conditions not matching a document. We need to overwrite what is happening when it tries to apply the $all arguments on creation:
update: {
"$setOnInsert": {
"criteria" : [{ "key1": "2015610" }, { "key2": "2096955" }]
},
"$push": { "stats": { "id": "XXXX", "visited": 29, "liked": 144 } }
}
That uses $setOnInsert so that when the "upsert" is applied and a new document created the conditions specified here rather than using the field values set in the query portion of the statement are used instead.
Of course, if what you are really looking for is truly an exact match of the content in the array, then just use that for the query instead:
query: {
"criteria" : [{ "key1": "2015610" }, { "key2": "2096955" }]
}
Then MongoDB will be happy to apply those values when a new document is created and does not get confused on how to interpret the $all expression.

Upsert with pymongo and a custom _id field

I'm attempting to store pre-aggregated performance metrics in a sharded mongodb according to this document.
I'm trying to update the minute sub-documents in a record that may or may not exist with an upsert like so (self.collection is a pymongo collection instance):
self.collection.update(query, data, upsert=True)
query:
{ '_id': u'12345CHA-2RU020130304',
'metadata': { 'adaptor_id': 'CHA-2RU',
'array_serial': 12345,
'date': datetime.datetime(2013, 3, 4, 0, 0, tzinfo=<UTC>),
'processor_id': 0}
}
data:
{ 'minute': { '16': { '45': 1.6693091}}}
The problem is that in this case the 'minute' subdocument always only has the last hour: { minute: metric} entry, the minute subdocument does not create new entries for other hours, it's always overwriting the one entry.
I've also tried this with a $set style data entry:
{ '$set': { 'minute': { '16': { '45': 1.6693091}}}}
but it ends up being the same.
What am I doing wrong?
In both of the examples listed you are simply setting a field ('minute')to a particular value, the only reason it is an addition the first time you update is because the field itself does not exist and so must be created.
It's hard to determine exactly what you are shooting for here, but I think what you could do is alter your schema a little so that 'minute' is an array. Then you could use $push to add values regardless of whether they are already present or $addToSet if you don't want duplicates.
I had to alter your document a little to make it valid in the shell, so my _id (and some other fields) are slightly different to yours, but it should still be close enough to be illustrative:
db.foo.find({'_id': 'u12345CHA-2RU020130304'}).pretty()
{
"_id" : "u12345CHA-2RU020130304",
"metadata" : {
"adaptor_id" : "CHA-2RU",
"array_serial" : 12345,
"date" : ISODate("2013-03-18T23:28:50.660Z"),
"processor_id" : 0
}
}
Now let's add a minute field with an array of documents instead of a single document:
db.foo.update({'_id': 'u12345CHA-2RU020130304'}, { $addToSet : {'minute': { '16': {'45': 1.6693091}}}})
db.foo.find({'_id': 'u12345CHA-2RU020130304'}).pretty()
{
"_id" : "u12345CHA-2RU020130304",
"metadata" : {
"adaptor_id" : "CHA-2RU",
"array_serial" : 12345,
"date" : ISODate("2013-03-18T23:28:50.660Z"),
"processor_id" : 0
},
"minute" : [
{
"16" : {
"45" : 1.6693091
}
}
]
}
Then, to illustrate the addition, add a slightly different entry (since I am using $addToSet this is required for a new field to be added:
db.foo.update({'_id': 'u12345CHA-2RU020130304'}, { $addToSet : {'minute': { '17': {'48': 1.6693391}}}})
db.foo.find({'_id': 'u12345CHA-2RU020130304'}).pretty()
{
"_id" : "u12345CHA-2RU020130304",
"metadata" : {
"adaptor_id" : "CHA-2RU",
"array_serial" : 12345,
"date" : ISODate("2013-03-18T23:28:50.660Z"),
"processor_id" : 0
},
"minute" : [
{
"16" : {
"45" : 1.6693091
}
},
{
"17" : {
"48" : 1.6693391
}
}
]
}
I ended up setting the fields like this:
query:
{ '_id': u'12345CHA-2RU020130304',
'metadata': { 'adaptor_id': 'CHA-2RU',
'array_serial': 12345,
'date': datetime.datetime(2013, 3, 4, 0, 0, tzinfo=<UTC>),
'processor_id': 0}
}
I'm setting the metrics like this:
data = {"$set": {}}
for metric in csv:
date_utc = metric['date'].astimezone(pytz.utc)
data["$set"]["minute.%d.%d" % (date_utc.hour,
date_utc.minute)] = float(metric['metric'])
which creates data like this:
{"$set": {'minute.16.45': 1.6693091,
'minute.16.46': 1.566343,
'minute.16.47': 1.22322}}
So that when self.collection.update(query, data, upsert=True) is run it updates those fields.