performance issue for updating mongo documents - mongodb

I have a mongo collections with about 4M records, 2 of my document fields of this collection are date as string, and I need to change them to ISODate, so arote this small script to do that:
db.vendors.find().forEach(function(el){
el.lastEdited = new Date(el.lastEdited);
el.creationDate = new Date(el.creationDate)
db.vendors.save(el)
})
but it takes foreverrrr...and I added indexes to those fields, can't I do this in some other way which will be reasonable time to complete?

Currently you do a find against whole collection which could take a while, also for each save call it has to do a network roundtrip from the client (shell) to the server.
EDIT: Removed suggestion to use $match and do this in batches, because $out in fact replaces the collection on each run as noted in #Stennie's comment.
(Needless to say, test it in a sample dataset first in a test environment. I haven't tested behavior of new Date() as I don't know your data format)
db.vendors.aggregate([
{
$project: {
_id: 1,
lastEdited: { $add : [new Date('$lastEdited')]},
creationDate: { $add : [new Date('$creationDate')]},
field1: 1,
field2: 1,
//.. (important to repeat all necessary fields)
},
},
{
$out: 'vendors',
},
]);

Related

Meteor collection get last document of each selection

Currently I use the following find query to get the latest document of a certain ID
Conditions.find({
caveId: caveId
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
How can I use the same using multiple ids with $in for example
I tried it with the following query. The problem is that it will limit the documents to 1 for all the found caveIds. But it should set the limit for each different caveId.
Conditions.find({
caveId: {$in: caveIds}
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
One solution I came up with is using the aggregate functionality.
var conditionIds = Conditions.aggregate(
[
{"$match": { caveId: {"$in": caveIds}}},
{
$group:
{
_id: "$caveId",
conditionId: {$last: "$_id"},
diveDate: { $last: "$diveDate" }
}
}
]
).map(function(child) { return child.conditionId});
var conditions = Conditions.find({
_id: {$in: conditionIds}
},
{
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
You don't want to use $in here as noted. You could solve this problem by looping through the caveIds and running the query on each caveId individually.
you're basically looking at a join query here: you need all caveIds and then lookup last for each.
This is a problem of database schema/denormalization in my opinion: (but this is only an opinion!):
You could as mentioned here, lookup all caveIds and then run the single query for each, every single time you need to look up last dives.
However I think you are much better off recording/updating the last dive inside your cave document, and then lookup all caveIds of interest pulling only the lastDive field.
That will give you immediately what you need, rather than going through expensive search/sort queries. This is at the expense of maintaining that field in the document, but it sounds like it should be fairly trivial as you only need to update the one field when a new event occurs.

Aggregation with meteorhacks:aggregate (why would I ever use $out)?

This is a significant edit to this question, as I have changed the publication to a method and narrowed the scope of the question.
I am using meteorhacks:aggregate to calculate and publish the average and median of company valuation data for a user-selected series of companies. The selections are saved in a Valuations collection for reference and the data for aggregation comes from the Companies collection.
The code below works fine for one-time use (although it's not reactive). However, users will rerun this aggregation for thousands of valuationIds. Since $out will first clear out the new collection before inserting the new results, I can't use that here, I need to retain the results of each instance. I don't understand why $out would ever be used.
Is there any way to just add update the existing Valuation document with the aggregation results and then subscribe to that document?
server/methods
Meteor.methods({
valuationAggregate: function(valuationId, valuationSelections) {
//Aggregate and publish average of company valuation data for a user-selected series of companies./
//Selections are saved in Valuations collection for reference and data for aggregation comes from Companies collection.//
check(valuationId, String);
check(valuationSelections, Array);
var pipelineSelections = [
//Match documents in Companies collection where the 'ticker' value was selected by the user.//
{$match: {ticker: {$in: valuationSelections}}},
{
$group: {
_id: null,
avgEvRevenueLtm: {$avg: {$divide: ["$capTable.enterpriseValue", "$financial.ltm.revenue"]}},
avgEvRevenueFy1: {$avg: {$divide: ["$capTable.enterpriseValue", "$financial.fy1.revenue"]}},
avgEvRevenueFy2: {$avg: {$divide: ["$capTable.enterpriseValue", "$financial.fy2.revenue"]}},
avgEvEbitdaLtm: {$avg: {$divide: ["$capTable.enterpriseValue", "$financial.ltm.ebitda"]}},
//more//
}
},
{
$project: {
_id: 0,
valuationId: {$literal: valuationId},
avgEvRevenueLtm: 1,
avgEvRevenueFy1: 1,
avgEvRevenueFy2: 1,
avgEvEbitdaLtm: 1,
//more//
}
}
];
var results = Companies.aggregate(pipelineSelections);
console.log(results);
}
});
The code above works, as far as viewing the results on the server. In my terminal, I see:
I20150926-23:50:27.766(-4)? [ { avgEvRevenueLtm: 3.988137239679733,
I20150926-23:50:27.767(-4)? avgEvRevenueFy1: 3.8159564713187155,
I20150926-23:50:27.768(-4)? avgEvRevenueFy2: 3.50111769838031,
I20150926-23:50:27.768(-4)? avgEvEbitdaLtm: 11.176476895728268,
//more//
I20150926-23:50:27.772(-4)? valuationId: 'Qg4EwpfJ5uPXyxe62' } ]
I was able to resolve this with the following. Needed to add the forEach to unwind the array in the same way as $out.
lib/collections
ValuationResults = new Mongo.Collection('valuationResults');
server/methods
var results = Companies.aggregate(pipelineSelections);
results.forEach(function(valuationResults) {
ValuationResults.update({'result.valuationId': valuationId}, {result:valuationResults}, {upsert: true});
});
console.log(ValuationResults.find({'result.valuationId': valuationId}).fetch());
}
});

Publish all fields in document but just part of an array in the document

I have a mongo collection in which the documents have a field that is an array. I want to be able to publish everything in the documents except for the elements in the array that were created more than a day ago. I suspect the answer will be somewhat similar to this question.
Meteor publication: Hiding certain fields in an array document field?
Instead of limiting fields in the array, I just want to limit the elements in the array being published.
Thanks in advance for any responses!
EDIT
Here is an example document:
{
_id: 123456,
name: "Unit 1",
createdAt: (datetime object),
settings: *some stuff*,
packets: [
{
_id: 32412312,
temperature: 70,
createdAt: *datetime object from today*
},
{
_id: 32412312,
temperature: 70,
createdAt: *datetime from yesterday*
}
]
}
I want to get everything in this document except for the part of the array that was created more than 24 hours ago. I know I can accomplish this by moving the packets into their own collection and tying them together with keys as in a relational database but if what I am asking were possible, this would be simpler with less code.
You could do something like this in your publish method:
Meteor.publish("pubName", function() {
var collection = Collection.find().fetch(); //change this to return your data
_.each(collection, function(collectionItem) {
_.each(collectionItem.packets, function(packet, index) {
var deadline = Date.now() - 86400000 //should equal 24 hrs ago
if (packet.createdAt < deadline) {
collectionItem.packets.splice(index, 1);
}
}
}
return collection;
}
Though you might be better off storing the last 24 hours worth of packets as a separate array in your document. Would probably be less taxing on the server, not sure.
Also, code above is untested. Good luck.
you can use the $elemMatch projection
http://docs.mongodb.org/manual/reference/operator/projection/elemMatch/
So in your case, it would be
var today = new Date();
var yesterday = new Date(today);
yesterday.setDate(today.getDate() - 1);
collection.find({}, //find anything or specifc
{
fields: {
'packets': {
$elemMatch: {$gt : {'createdAt' : yesterday /* or some new Date() */}}
}
}
});
However, $elemMatch only returns the FIRST element matching your condition. To return more than 1 element, you need to use the aggregation framework, which will be more efficient than _.each or forEach, particularly if you have a large array to loop through.
collection.rawCollection().aggregate([
{
$match: {}
},
{
$redact: {
$cond: {
if : {$or: [{$gt: ["$createdAt",yesterday]},"$packets"]},
then: "$$DESCEND",
else: "$$PRUNE"
}
}
}], function (error, result ){
});
You specify the $match in a way similar to find({}). Then all the documents that match your conditions get pipped into the $redact which is specified by the $cond.
$redact scans the document from top level to bottom. At the top level, you have _id, name, createdAt, settings, packets; hence {$or: [***,"$packets"]}
The presence of $packets in the $or allows the $redact to scan the second level which contain the _id, temperature and createdAt; hence {$gt: ["$createdAt",yesterday]}
This is async, you can use Meteor.wrapAsync to wrap around the function.
Hope this help

Mongo: Ensuring latest nested attribute has a value between given arguments

I have a mongo collection 'books'. Here's a typical book:
BOOK
name: 'Test Book'
author: 'Joe Bloggs'
print_runs: [
{publisher: 'OUP', year: 1981},
{publisher: 'Penguin', year: 1987},
{publisher: 'Harper-Collins', year: 1992}
]
I'd like to be able to filter books to return only books whose last print run was after a given date, and/or before a given date...and I've been struggling to find a feasible query. Any suggestions appreciated.
There are a few options, as getting access to the "last" element in the array and only filtering on that is difficult/impossible with the normal find options in MongoDB queries. (Unfortunately, you can't $slice with find).
Store the most recent published publisher and year in the print_runs array and in a special (denormalized/copy) of the data directly on the book object. Book.last_published_by and Book.last_published_date for example. Queries would be simple and super fast.
MapReduce. This would be simple enough to emit the last element in the array and then "reduce" it to just that. You'd need to do incremental updates on the MapReduce to keep it accurate.
Write a relatively complex aggregation framework expression
The aggregation might look like:
db.so.aggregate({ $project :
{ _id: 1, "print_run_year" : "$print_runs.year" }},
{ $unwind: "$print_run_year" },
{ $group : { _id : "$_id", "newest" : { $max : "$print_run_year" }}},
{ $match : { "newest" : { $gt : 1991, $lt: 2000 } }
})
As it may require a bit of explanation:
It projects and unwinds the year of the print runs for each book.
Then, group on the _id (of the book, and create a new computed field called, newest which contains the highest print run year (from the projection).
Then, filter on newest using a $gt and $lt
I'd suggest option #1 above would be the best from an efficiency perspective, followed by the MapReduce, and then a distant third, option #3.

mongodb mapreduce function does not provide skip functionality, is their any solution to this?

Mongodb mapreduce function does not provide any way to skip record from database like find function. It has functionality of query, sort & limit options. But I want to skip some records from the database, and I am not getting any way to it. please provide solutions.
Thanks in advance.
Ideally a well-structured map-reduce query would allow you to skip particular documents in your collection.
Alternatively, as Sergio points out, you can simply not emit particular documents in map(). Using scope to define a global counter variable is one way to restrict emit to a specified range of documents. As an example, to skip the first 20 docs that are sorted by ObjectID (and thus, sorted by insertion time):
db.collection_name.mapReduce(map, reduce, {out: example_output, sort: {id:-1}, scope: "var counter=0")};
Map function:
function(){
counter ++;
if (counter > 20){
emit(key, value);
}
}
I'm not sure since which version this feature is available, but certainly in MongoDB 2.6 the mapReduce() function provides query parameter:
query : document
Optional. Specifies the selection criteria using query operators for determining the documents input to the map
function.
Example
Consider the following map-reduce operations on a collection orders that contains documents of the following prototype:
{
_id: ObjectId("50a8240b927d5d8b5891743c"),
cust_id: "abc123",
ord_date: new Date("Oct 04, 2012"),
status: 'A',
price: 25,
items: [ { sku: "mmm", qty: 5, price: 2.5 },
{ sku: "nnn", qty: 5, price: 2.5 } ]
}
Perform the map-reduce operation on the orders collection using the mapFunction2, reduceFunction2, and finalizeFunction2 functions.
db.orders.mapReduce( mapFunction2,
reduceFunction2,
{
out: { merge: "map_reduce_example" },
query: { ord_date:
{ $gt: new Date('01/01/2012') }
},
finalize: finalizeFunction2
}
)
This operation uses the query field to select only those documents with ord_date greater than new Date(01/01/2012). Then it output the results to a collection map_reduce_example. If the map_reduce_example collection already exists, the operation will merge the existing contents with the results of this map-reduce operation.