Here is the dataset
// Data 1
{ name : 111,
factors : [
{name:"f1", value:"dog", unit : "kg"},
{name:"f2", value:"0"}
]
},// data2
{ name : 112,
factors :
[
{name:"f1", value:"cat", unit : "g"},
{name:"f2", value:"13"}
]
}
// 100,000 more data ...
I would like to convert the value of factor f2 to be number.
db.getCollection('cases').find({
factors : {
$elemMatch : {
name : "f2",
value : {$type : 2}
}
}
}).forEach(function(doc, i){
doc.factors.forEach(function(factor){
if(factor.name == "f2"){
factor.value = !isNaN(factor.value) ? parseInt(factor.value) : factor.value;
}
});
db.cases.save(factor);
});
However, it can only update about 75~77 data for each execution. I am not sure why and I guess the problem is that the save() is async, so we can not initiate too many save() at the same time.
What should I do?
The concept here is to loop through your collection with a cursor and for each document within the cursor, gather data about the index position of the factors array elements.
You will then use this data later on in the loop as the update operation parameters to correctly identify the desired field to update.
Supposing your collection is not that humongous, the intuition above can be implemented using the forEach() method of the cursor as you have done in your attempts to do the iteration and getting the index data for all the arrays involved.
The following demonstrates this approach for small datasets:
db.cases.find({"factors.value": { "$exists": true, "$type": 2 }}).forEach(function(doc){
var factors = doc.factors,
updateOperatorDocument = {};
for (var idx = 0; idx < factors.length; idx++){
var val;
if(factors[idx].name == "f2"){
val = !isNaN(factors[idx].value) ? parseInt(factors[idx].value) : factors[idx].value;
updateOperatorDocument["factors."+ idx +".value"] = val;
}
};
db.cases.updateOne(
{ "_id": doc._id },
{ "$set": updateOperatorDocument }
);
});
Now for improved performance especially when dealing with large collections, take advantage of using a Bulk() API for updating the collection in bulk.
This is quite effecient as opposed to the above operations because with the bulp API you will be sending the operations to the server in batches (for example, say a batch size of 1000) which gives you much better
performance since you won't be sending every request to the server but just once in every 1000 requests, thus making your updates more efficient and quicker.
The following examples demonstrate using the Bulk() API available in MongoDB versions >= 2.6 and < 3.2.
var bulkUpdateOps = db.cases.initializeUnOrderedBulkOp(),
counter = 0;
db.cases.find({"factors.value": { "$exists": true, "$type": 2 }}).forEach(function(doc){
var factors = doc.factors,
updateOperatorDocument = {};
for (var idx = 0; idx < factors.length; idx++){
var val;
if(factors[idx].name == "f2"){
val = !isNaN(factors[idx].value) ? parseInt(factors[idx].value) : factors[idx].value;
updateOperatorDocument["factors."+ idx +".value"] = val;
}
};
bulkUpdateOps.find({ "_id": doc._id }).update({ "$set": updateOperatorDocument })
counter++; // increment counter for batch limit
if (counter % 1000 == 0) {
// execute the bulk update operation in batches of 1000
bulkUpdateOps.execute();
// Re-initialize the bulk update operations object
bulkUpdateOps = db.cases.initializeUnOrderedBulkOp();
}
})
// Clean up remaining operation in the queue
if (counter % 1000 != 0) { bulkUpdateOps.execute(); }
The next example applies to the new MongoDB version 3.2 which has since deprecated the Bulk() API and provided a newer set of apis using bulkWrite().
It uses the same cursors as above but creates the arrays with the bulk operations using the same forEach() cursor method to push each bulk write document to the array. Because write commands can accept no more than 1000 operations, you will need to group your operations to have at most 1000 operations and re-intialise the array when loop hit the 1000 iteration:
var cursor = db.cases.find({"factors.value": { "$exists": true, "$type": 2 }}),
bulkUpdateOps = [];
cursor.forEach(function(doc){
var factors = doc.factors,
updateOperatorDocument = {};
for (var idx = 0; idx < factors.length; idx++){
var val;
if(factors[idx].name == "f2"){
val = !isNaN(factors[idx].value) ? parseInt(factors[idx].value) : factors[idx].value;
updateOperatorDocument["factors."+ idx +".value"] = val;
}
};
bulkUpdateOps.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$set": updateOperatorDocument }
}
});
if (bulkUpdateOps.length == 1000) {
db.cases.bulkWrite(bulkUpdateOps);
bulkUpdateOps = [];
}
});
if (bulkUpdateOps.length > 0) { db.cases.bulkWrite(bulkUpdateOps); }
Write Result for Sample data
{
"acknowledged" : true,
"deletedCount" : 0,
"insertedCount" : 0,
"matchedCount" : 2,
"upsertedCount" : 0,
"insertedIds" : {},
"upsertedIds" : {}
}
Related
I've got a collection consisting of millions of documents that resemble the following:
{
_id: ObjectId('...'),
value: "0.53"
combo: [
{
h: 0,
v: "0.42"
},
{
h: 1,
v: "1.32"
}
]
}
The problem is that the values are stored as strings and I need to convert them to float/double.
I'm trying this and it's working but this'll take days to complete, given the volume of data:
db.collection.find({}).forEach(function(obj) {
if (typeof(obj.value) === "string") {
obj.value = parseFloat(obj.value);
db.collection.save(obj);
}
obj.combo.forEach(function(hv){
if (typeof(hv.value) === "string") {
hv.value = parseFloat(hv.value);
db.collection.save(obj);
}
});
});
I came across bulk update reading the Mongo docs and I'm trying this:
var bulk = db.collection.initializeUnorderedBulkOp();
bulk.find({}).update(
{
$set: {
"value": parseFloat("value"),
}
});
bulk.execute();
This runs... but I get a NAN as a value, which is because it thinks I'm trying to convert "value" to a float. I've tried different variations like this.value and "$value" but to no avail. Plus this approach only attempts to correct the value in the other object, not the ones in the array.
I'd appreciate any help. Thanks in advance!
Figured it out the following way:
1) To convert at the document level, I came across this post and the reply by Markus paved the way to my solution:
var bulk = db.collection.initializeUnorderedBulkOp()
var myDocs = db.collection.find()
var ops = 0
myDocs.forEach(
function(myDoc) {
bulk.find({ _id: myDoc._id }).updateOne(
{
$set : {
"value": parseFloat(myDoc.value),
}
}
);
if ((++ops % 1000) === 0){
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
}
)
bulk.execute();
2) The second part involved updating the array object values and I discovered the syntax to do so in the accepted answer on this post. In my case, I knew that there were 24 values in I ran this separately from the first query and the result looked like:
var bulk = db.collection.initializeUnorderedBulkOp()
var myDocs = db.collection.find()
var ops = 0
myDocs.forEach(
function(myDoc) {
bulk.find({ _id: myDoc._id }).update(
{
$set : {
"combo.0.v": parseFloat(myDoc.combo[0].v),
"combo.1.v": parseFloat(myDoc.combo[1].v),
"combo.2.v": parseFloat(myDoc.combo[2].v),
"combo.3.v": parseFloat(myDoc.combo[3].v),
"combo.4.v": parseFloat(myDoc.combo[4].v),
"combo.5.v": parseFloat(myDoc.combo[5].v),
"combo.6.v": parseFloat(myDoc.combo[6].v),
"combo.7.v": parseFloat(myDoc.combo[7].v),
"combo.8.v": parseFloat(myDoc.combo[8].v),
"combo.9.v": parseFloat(myDoc.combo[9].v),
"combo.10.v": parseFloat(myDoc.combo[10].v),
"combo.11.v": parseFloat(myDoc.combo[11].v),
"combo.12.v": parseFloat(myDoc.combo[12].v),
"combo.13.v": parseFloat(myDoc.combo[13].v),
"combo.14.v": parseFloat(myDoc.combo[14].v),
"combo.15.v": parseFloat(myDoc.combo[15].v),
"combo.16.v": parseFloat(myDoc.combo[16].v),
"combo.17.v": parseFloat(myDoc.combo[17].v),
"combo.18.v": parseFloat(myDoc.combo[18].v),
"combo.19.v": parseFloat(myDoc.combo[19].v),
"combo.20.v": parseFloat(myDoc.combo[20].v),
"combo.21.v": parseFloat(myDoc.combo[21].v),
"combo.22.v": parseFloat(myDoc.combo[22].v),
"combo.23.v": parseFloat(myDoc.combo[23].v)
}
}
);
if ((++ops % 1000) === 0){
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
}
}
)
bulk.execute();
Just to give an idea regarding performance, the forEach was going through around 900 documents a minute, which for 15 million records would have taken days, literally! Not only that but this was only converting the types at the document level, not the array level. For that, I would have to loop through each document and loop through each array (15 million x 24 iterations)! With this approach (running both queries side by side), it completed both in under 6 hours.
I hope this helps someone else.
I have a collection that looks like this:
{'flags': {'flag_1': True, 'flag_2': False: 'flag_3': True}
'other_data': {....}}
In a single operation, I want to add a list of flags to the existing flags. If the flag already existed, I want to leave its value as is, otherwise it should be False.
For example after adding ['flag_3', 'flag_4'], the collection should look like this.
{'flags': {'flag_1': True, 'flag_2': False: 'flag_3': True, 'flag_4':False}
'other_data': {....}}
Thanks
You can use the Bulk API as a way of streamlining your updates with some logic to get the flags which need to be added. Something like this:
var bulk = db.collection.initializeOrderedBulkOp(),
counter = 0,
flagList = ['flag_3', 'flag_4'];
db.collection.find().forEach(function(doc){
var existingFlags = Object.keys(doc.flags), // get the existing flags in the document
newFlags = flagList.filter(function(n) { // use filter to return an array of flags which do not exist
return existingFlags.indexOf(n) < 0;
}),
update = newFlags.reduce(function(obj, k) { // set the update object
obj["flags."+ k] = false;
return obj;
}, { });
bulk.find({ "_id": doc._id }).updateOne({
"$set": update
});
counter++;
if (counter % 1000 == 0) {
// Execute per 1000 operations and re-initialize every 1000 update statements
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp();
}
})
// Clean up queues
if (counter % 1000 != 0){
bulk.execute();
}
I have some tweets downloaded to my mongodb.
The tweet document looks something like this:
{
"_id" : NumberLong("542499449474273280"),
"retweeted" : false,
"in_reply_to_status_id_str" : null,
"created_at" : ISODate("2014-12-10T02:02:02Z"),
"hashtags" : [
"Canucks",
"allhabs",
"GoHabsGo"
]
...
}
I want a construct a query/aggregation/map-reduce that will give me the count of tweets that have the same two hash tags. For every pair of nonequal hashtags it gives me the count of tweets eg.:
{'count': 12, 'pair': ['malaria', 'Ebola']}
{'count': 1, 'pair': ['Nintendo', '8bit']}
{'count': 1, 'pair': ['guinea', 'Ebola']}
{'count': 1, 'pair': ['fitness', 'HungerGames']}
...
I've made a python script to do this:
hashtags = set()
tweets = db.tweets.find({}, {'hashtags':1})
#gather all hashtags from every tweet
for t in tweets:
hashtags.update(t['hashtags'])
hashtags = list(hashtags)
hashtag_count = []
for i, h1 in enumerate(hashtags):
for j, h2 in enumerate(hashtags):
if i > j:
count = db.tweets.find({'hashtags' : {'$all':[h1,h2]}}).count()
if count > 0:
pair = {'pair' : [h1, h2], 'count' : count}
print(couple)
db.hashtags_pairs.insert(pair)
But I want to make it just with a query or JS functions to use the map-reduce.
Any ideas?
There's no aggregation pipeline or query that can compute this from your given document structure, so you'll have to use map/reduce if you don't want to drastically change the collection structure or construct a secondary collection. The map/reduce, however, is straightforward: in the map phase, emit a pair (pair of hashtags, 1) for each pair of hashtags in the document, then sum the values for each key in the reduce phase.
var map = function() {
var tags = this.tags;
var k = tags.length;
for (var i = 0; i < k; i++) {
for (var j = 0; j < i; j++) {
if (tags[i] != tags[j]) {
var ts = [tags[i], tags[j]].sort();
emit({ "t0" : ts[0], "t1" : ts[1] }, 1)
}
}
}
}
var reduce = function(key, values) { return Array.sum(values) }
I wrote a mapreduce function where the records are emitted in the following format
{userid:<xyz>, {event:adduser, count:1}}
{userid:<xyz>, {event:login, count:1}}
{userid:<xyz>, {event:login, count:1}}
{userid:<abc>, {event:adduser, count:1}}
where userid is the key and the remaining are the value for that key.
After the MapReduce function, I want to get the result in following format
{userid:<xyz>,{events: [{adduser:1},{login:2}], allEventCount:3}}
To acheive this I wrote the following reduce function
I know this can be achieved by group by.. both in aggregation framework and mapreduce, but we require a similar functionality for a complex scenario. So, I am taking this approach.
var reducefn = function(key,values){
var result = {allEventCount:0, events:[]};
values.forEach(function(value){
var notfound=true;
for(var n = 0; n < result.events.length; n++){
eventObj = result.events[n];
for(ev in eventObj){
if(ev==value.event){
result.events[n][ev] += value.allEventCount;
notfound=false;
break;
}
}
}
if(notfound==true){
var newEvent={}
newEvent[value.event]=1;
result.events.push(newEvent);
}
result.allEventCount += value.allEventCount;
});
return result;
}
This runs perfectly, when I run for 1000 records, when there are 3k or 10k records, the result I get is something like this
{ "_id" : {...}, "value" :{"allEventCount" :30, "events" :[ { "undefined" : 1},
{"adduser" : 1 }, {"remove" : 3 }, {"training" : 1 }, {"adminlogin" : 1 },
{"downgrade" : 2 } ]} }
Not able to understand where this undefined came from and also the sum of the individual events is less than allEventCount. All the docs in the collection has non-empty field event so there is no chance of undefined.
Mongo DB version -- 2.2.1
Environment -- Local machine, no sharding.
In the reduce function, why should this operation fail result.events[n][ev] += value.allEventCount; when the similar operation result.allEventCount += value.allEventCount; passes?
The corrected answer as suggested by johnyHK
Reduce function:
var reducefn = function(key,values){
var result = {totEvents:0, event:[]};
values.forEach(function(value){
value.event.forEach(function(eventElem){
var notfound=true;
for(var n = 0; n < result.event.length; n++){
eventObj = result.event[n];
for(ev in eventObj){
for(evv in eventElem){
if(ev==evv){
result.event[n][ev] += eventElem[evv];
notfound=false;
break;
}
}}
}
if(notfound==true){
result.event.push(eventElem);
}
});
result.totEvents += value.totEvents;
});
return result;
}
The shape of the object you emit from your map function must be the same as the object returned from your reduce function, as the results of a reduce can get fed back into reduce when processing large numbers of docs (like in this case).
So you need to change your emit to emit docs like this:
{userid:<xyz>, {events:[{adduser: 1}], allEventCount:1}}
{userid:<xyz>, {events:[{login: 1}], allEventCount:1}}
and then update your reduce function accordingly.
I'm designing system that should be able to process millions of documents and report on them in different ways.
mongoDb map\reduce task is what I'm trying to implement (currently doing some investigation on that).
The very basic document structure is
db.test.insert(
{
"_id" : ObjectId("4f6063601caf46303c36eb27"),
"verbId" : NumberLong(1506281),
"sentences" : [
{
"sId" : NumberLong(2446630),
"sentiment" : 2,
"categories" : [
NumberLong(3257),
NumberLong(3221),
NumberLong(3291)
]
},
{
"sId" : NumberLong(2446631),
"sentiment" : 0,
"categories" : [
NumberLong(2785),
NumberLong(2762),
NumberLong(2928),
NumberLong(2952)
]
},
{
"sId" : NumberLong(2446632),
"sentiment" : 0,
"categories" : [
NumberLong(-2393)
]
},
{
"sId" : NumberLong(2446633),
"sentiment" : 0,
"categories" : [
NumberLong(-2393)
]
}
]
})
So that each document contains sentences, that could belong to different categories.
The report I'm trying to get is number of sentences in category (with percent of verbatims).
I'm doing next map-reduce jobs with finalize method to count different averages.
var map = function() {
var docCategories = new Array();
var catValues = new Array();
for (var i = 0; i < this.sentences.length; i++) { //iterate over sentences.
sentence = this.sentences[i];
for (var j = 0; j < sentence.categories.length; j++) {//iterate over categories
catId= sentence.categories[j].toNumber();
if (docCategories.indexOf(catId) < 0) {
docCategories.push(catId);
catValues.push({sentiment : sentence.sentiment, sentenceCnt: 1});
} else {
categoryIdx = docCategories.indexOf(catId);
catValue = catValues[categoryIdx];
catValue.sentiment = catValue.sentiment + sentence.sentiment;
catValue.sentenceCnt = catValue.sentenceCnt + 1;
}
}
}
totalCount++; //here we do try to count distinctCases see scope.
for (var i = 0; i < docCategories.length; i ++) {
emit(docCategories[i], {count: 1, sentenceCnt: catValues[i].sentenceCnt, sentiment: catValues[i].sentiment, totalCnt : totalCount});
}
};
var reduce = function(key, values) {
var res = {count : 0, sentenceCnt : 0, sentiment : 0};
for ( var i = 0; i < values.length; i ++ ) {
res.count += values[i].count;
res.sentenceCnt += values[i].sentenceCnt;
res.sentiment += values[i].sentiment;
}
return res;
};
var finalize = function(category, values) {
values.sentimentAvg = values.sentiment / values.sentenceCnt;
values.percentOfVerbatim = values.count / totalCount //scope variable (global)
return values;
};
var res = db.runCommand( { mapreduce:'test',
map:map,
reduce:reduce,
out: 'cat_volume',
finalize:finalize,
scope:{totalCount : 0},
});
The most interesting part here is that I'm using totalCount - to count number of verbatims I'm emitting. totalCount is the scope (global) variable.
Everything went well on One mongoDb installation, but when going to a shard instances I'm getting "Infinity" for percentOfVerbatim.
Actually in that case totalCount would be just db.test.count() (number of documents) but in future I'm going to add different conditions for documents to be count.
Doing any other query is very undesirable since db is very heavy.
Are there any other approaches to using global (scope) variables on multi-instance mongodb installation? Or should I use something else?
The scope variables are not shared among the shards. You can treat it as a global constant. Updates to the value won't be visible to map or reduce functions running on different shards.
Finally I've found the way how to count number of documents I'm emitting.
The only way that worked for me is emitting documentId, and puting ids into the array on reduce.
On client side (I'm writing java program) I have to count just all distinct Ids.
So, while doing map I do emit
emit(docCategories[i], {verbIds : [this.verbId.toNumber()], count: 1, sentenceCnt: catValues[i].sentenceCnt, sentiment: catValues[i].sentiment, totalCnt : totalCount});
Reduce function is the following:
var reduce = function(key, values) {
var res = {verbIds : [], count : 0, sentenceCnt : 0, sentiment : 0};
for ( var i = 0; i < values.length; i ++ ) {
// res.verbIds = res.verbIds.concat(values[i].verbIds); //works slow
for ( var j = 0; j < values[i].verbIds.length; j ++ ) {
res.verbIds.push(values[i].verbIds[j]);
}
res.count += values[i].count;
res.sentenceCnt += values[i].sentenceCnt;
res.sentiment += values[i].sentiment;
}
return res;
};
Java side program just count distinct Ids over all of the results.
Actually for 1.1M documents execution slows down significantly