I'm trying to query all documents from a mongodb collection of which the criteria is inside a file.
criteria-file.txt:
value1
value2
value3
...
Currently I'm building the query like this
built-test.js.sh:
#!/bin/bash
echo 'db.collection.find({keyfield: {$in:[' > test.js
cat criteria-file.txt| while read i
do
echo "\"$i\"," >> test.js
done
echo ']}})' >> test.js
The query document is way under 16MB in size, but I wonder if there's a better way which is more elegant and efficient, especially, because over time I will most probably be over 16MB for the query document. I'm eager to get your suggestions.
BTW, I was wondering, for those 25K criteria values seeking in a collection with currently 200 million entries, the query time is only a bit over a minute and the CPU load doesn't seem to be too bad.
Thanks!
Read the file into an array using the cat() native shell method. Then, loop over the array of criteria values to find the matching documents and store all the documents in an array; this will be your list of matches.
var criteria_file = cat("criteria-file.txt");
var criteria_array = criteria_file.split("\n");
var result_ids_arr = [ ];
for (let value of criteria_array) {
let id_arr = db.collection.find( { keyfield: value }, { _id: 1} ).toArray();
result_ids_arr = result_ids_arr.concat(id_arr);
}
The result array of the _id values, e.g.: [ { "_id" : 11 }, { "_id" : 34 }, ... ]
All this JavaScript can be run from the command prompt or the mongo shell, using the load().
Split the criteria file into different chunks ensuring that the chunks don't exceed 16MB.
Run the same query on each chunk only now you can run the queries in parallel.
if you want to get extra fancy you can use the aggregation pipeline to do a $match query and send all the output results from each query to a single results collection using $mergeenter link description here.
Related
var product = db.GetCollection<Product>("Product");
var lookup1 = new BsonDocument(
"$lookup",
new BsonDocument {
{ "from", "Variant" },
{ "localField", "Maincode" },
{ "foreignField", "Maincode" },
{ "as", "variants" }
}
);
var pipeline = new[] { lookup1};
var result = product.Aggregate<Product>(pipeline).ToList();
The data of collection a is very large so it takes me 30 seconds to put the data in the list.
What should I do to make a faster lookup?
What that query is doing is retrieving every document from the Product collection, and then for each document found, perform a find query in the Variant collection. If there is no index on the Maincode field in the Variant collection, it will be reading the entire collection for each document.
This means that if there are, say, 1000 total products, with 3000 total variants (3 per product, on average), this query will be reading all 1000 documents from Product, and if that index isn't there, it would read all 3000 documents from Variant 1000 times, i.e. it will be examining 3 million documents.
Some ways to possibly speed this up:
create an index on {Maincode:1} in the Variant collection
This will reduce the number of documents that must be read in order to complete the lookup
change the schema
If the variants are stored in the same document with the product, there is no need for a lookup
filter the products prior to lookup
Again, reducing the documents read during the lookup
use a cursor to retrieve the documents in batches
If you perform any necessary sorting first, and the lookup last, you can return the documents to the application in batches, which would allow the application to display or begin processing the first batch before the second batch is available. This doen't make the query itself faster, but it can reduced the perceived wait in the application.
I have a mongo DB collection that looks something like this:
{
{
_id: objectId('aabbccddeeff'),
objectName: 'MyFirstObject',
objectLength: 0xDEADBEEF,
objectSource: 'Source1',
accessCounter: {
'firstLocationCode' : 283,
'secondLocationCode' : 543,
'ThirdLocationCode' : 564,
'FourthLocationCode' : 12,
}
}
...
}
Now, assuming that this is not the only record in the collection and that most/all of the documents contain the accessCounter subdocument/field how will I go with selecting the x first documents where I have the most access from a specific location.
A sample "query" will be something like:
"Select the first 10 documents From myCollection where the accessCounter.firstLocationCode are the highest"
So a sample result will be X documents where the accessCounter. will be the greatest is the database.
Thank your for taking the time to read my question.
No need for an aggregation, that is a basic query:
db.collection.find().sort({"accessCounter.firstLocation":-1}).limit(10)
In order to speed this up, you should create a subdocument index on accessCounter first:
db.collection.ensureIndex({'accessCounter':-1})
assuming the you want to do the same query for all locations. In case you only want to query firstLocation, create the index on accessCounter.firstLocation.
You can speed this up further in case you only need the accessCounter value by making this a so called covered query, a query of which the values to return come from the index itself. For example, when you have the subdocument indexed and you query for the top secondLocations, you should be able to do a covered query with:
db.collection.find({},{_id:0,"accessCounter.secondLocation":1})
.sort("accessCounter.secondLocation":-1).limit(10)
which translates to "Get all documents ('{}'), don't return the _id field as you do by default ('_id:0'), get only the 'accessCounter.secondLocation' field ('accessCounter.secondLocation:1'). Sort the returned values in descending order and give me the first ten."
I'm currently using MongoDB 2.6 through MongoHQ. I've several mapreduces jobs which crunch raw data from a collection (c1) to produce a new collection (c2).
I've also an aggregation pipeline which parses (c2) to generate a new collection (c3) with the great $out operator.
However, I need to add extra fields to (c3) outside of the aggregation pipeline and keep them even after a new run of the aggregation but it seems that aggregation, based on the _id key just overwrite the content without updating it. So if I've previously add an extra field like foo : 'bar' to (c3) and I re-run the aggregation, I will loose the foo field.
Based on documentation (http://docs.mongodb.org/manual/reference/operator/aggregation/out/#pipe._S_out)
Replace Existing Collection
If the collection specified by the $out operation already exists, then upon completion of the aggregation, the $out stage atomically replaces the existing collection with the new results collection. The $out operation does not change any indexes that existed on the previous collection. If the aggregation fails, the $out operation makes no changes to the pre-existing collection.
Is there a better way or a tricky one :-) to update the $out collection instead of overwriting records with same _id ? I could write a python script or javascript to do the job but I would to avoid doing many database calls and in a smarter way as aggregation. May be it is not possible, so I will look for a different and more 'classical' path.
Thanks for your help
Well, not directly with the $out operator as much with the mapReduce output this is pretty much an "overwrite" operation (though mapReduce does have "merge" and "reduce" modes as well).
But since you have a MongoDB 2.6 version you do actually return a "cursor". So while the "client/server" interaction may not be as optimal as you want but you also have "bulk update" operations so you can do something along the lines of:
var cursor = db.collection.aggregate([
// pipeline here
]);
var batch = [];
while ( cursor.hasNext() ) {
var doc = cursor.next();
var updoc = {
"q": { "_id": doc._id },
"u": {
// only new fields except for
"$setOnInsert": {
// the fields you expect to add from before
},
"upsert": true
}
};
batch.push(updoc);
// try to do sensible under 16MB updates, number may vary
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand({
"update": "newcollection",
"updates": batch
});
batch = []; // reset the content
}
}
db.runCommand({
"update": "newcollection",
"updates": batch
});
And of course, though there will be many naysayers, and not without reason because you really need to weigh up the consequences ( which are very real ), you can always wrap what is essentially a JavaScript call with db.eval() in order to get the full server side execution.
But where possible ( and that is unless you have a completely remote database solution ), then it is generally advised to take the "client/server" option, but keep the process as "close" ( in networking terms ) to the server as possible.
Unlike Map reduce it seems as though the $out operator in the aggregation framework has a very specific set of pre-defined behaviours ( http://docs.mongodb.org/manual/reference/operator/aggregation/out/#behaviors ), however, it does seem that the $out option could change, I did not find a JIRA relating to this specific case however others have posted changes ( https://jira.mongodb.org/browse/SERVER-13201 ).
As for solving your problem now, you either are forced to revert back to Map Reduce (I don't know the scenario from where this is being run) or aggregate in a certain manner that allows you to feed in the new data and the old data you need.
Most common way of achieving this might be to update the original rows with the new data, maybe by aggregating the original row back down to itself.
Thanks for all your messages.
As I do not want to use cursor (requests consuming) I try to get the job by combining 2 map reduces jobs and one aggregation. It is quite 'fat' but it works and could give some idea for others.
Of course, I would be very pleased hearing from you other great alternatives.
So, I have a collection c1 which is the result of a previous mapreduce job as you could see by the value object.
c1 : { id:'xxxx', value:{ language:'...', keyword: '...', params: '...', field1: val1, field2: val2}}
the xxxx unique ID key is the concatenation of the value.language , value.keyword and value.params as follow :
*xxxx = _*
I've got another collection c2 : { _id : ObjectID, language:'...', keyword:'...', field1: val1, field2: val2, labels: 'yyyyy'} which is quite a
projection of the c1 collection but with an extra field labels which is a string with different labels comma separated. This c2 collection is a central repository for all combination of language and keywords with their attached field values.
Target
The target is to group all records from the c1 collection based on the
group key _, make some calculations on
other fields and store the result to the c2 collection but by keeping
the old 'labels' field from c2 with the same key. So fields1 & 2 of
this c2 collection will be recalculated each time we launch the whole
batch but the labels field will stay unchanged.
As described in my first message, by using aggregation or mapreduce jobs you could not reach this target as the 'labels' field will be removed.
As I do not want to use cursors and other foreach loop which are very network and database resquests consuming (I have a big collection and I use a MongoHQ service)
I try to solve the problem by using mapreduce and aggregation jobs.
1st Phase
So, firstly I run a mapreduce job (m1) which is a sort of copy of the c2 collection but clearing the value of field1 & 2 to 0. The result will be store in a c3 collection.
function m1Map(){
language = this['value']['language'];
keyword = this['value']['keyword'];
labels = this['labels'];
key = language + '_' + keyword;
emit(key,{'language':language,'keyword':keyword,'field1': 0, 'field2': 0.0, 'labels' : labels});
}
function m1Reduce(key,values){
language = values[0]['language'];
keyword = values[0]['keyword'];
labels = values[0]['labels'];
return {'language':language,'keyword':keyword,'field1': 0, 'field2': 0.0, 'labels' : labels}};
}
So now, c3 is a copy of c2 collection with field1&2 set to 0. Here is the shape of this collection :
c3 : { id:'', value:{ language:'...', keyword: '...', field1: 0, field2: 0.0, labels: '...'}}
2nd Phase
In a second step I run a mapreduce job (m2) which group the c1 collection value by the key _ and I project an extra field 'labels' with a fixed value 'x' in my example. This 'x' value is never used on the c2 collection, that is a special value. The output of this m2 mapreduce job will be stored in the same previous c3 collection with a 'reduce' option in the out directive. The python script will be described further.
function m2Map(){
language = this['value']['language'];
keyword = this['value']['keyword'];
field1 = this['value']['field1'];
field2 = this['value']['field2'];
key = language + '_' + keyword;
emit(key,{'language':language,'keyword':keyword,'field1': field1, 'field2': field2, 'labels' : 'x'});
}
Then I make some calculations on the Reduce function :
function m2Reduce(key,values){
// Init
language = values[0]['language'];
keyword = values[0]['keyword'];
field1 = 0;
field2 = 0;
bLabel = 0;
for (var i = 0; i < values.length; i++){
if (values[i]['labels'] == 'x') {
// We know these emit values are coming from the map and not from previous value on the c2 collection
// 'x' is never used on the c2 collection
field1 += parseInt(values[i]['field1']);
field2 += parseFloat(values[i]['field2']);
} else {
// these values are from the c2 collection
if (bLabel == 0) {
// we keep the former value for the 'labels' field
labels = values[i]['labels'];
bLabel = 1;
} else {
// we concatenate the 'labels' field if we have 2 records but theorytically it is impossible as c2 has only one record by unique key
// anyway, a good check afterwards :-)
labels += ','+values[i]['labels'];
}
}
}
if (bLabel == 0) {
// if values are only coming from the map emit, we force again the 'x' value for labels, it these values are re-used in another reduce call
labels = 'x';
}
return {'language':language,'keyword':keyword, 'field1': field1, 'field2': field2, 'labels' : labels};
}
The Python mapreduce script which calls the two m1 & m2 mapreduce jobs
(see pymongo for import : http://api.mongodb.org/python/2.7rc0/installation.html)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from pymongo import MongoClient
from pymongo import MongoReplicaSetClient
from bson.code import Code
from bson.son import SON
# MongoHQ
uri = 'mongodb://user:passwd#url_node1:port,url_node2:port/mydb'
client = MongoReplicaSetClient(uri,replicaSet='set-xxxxxxx')
db = client.mydb
coll1 = db.c1
coll2 = db.c2
#Load map and reduce functions
m1_map = Code(open('m1Map.js','r').read())
m1_reduce = Code(open('m1Reduce.js','r').read())
m2_map = Code(open('m2Map.js','r').read())
m2_reduce = Code(open('m2Reduce.js','r').read())
#Run the map-reduce queries
results = coll2.map_reduce(m1_map,m1_reduce,"c3",query={})
results = coll1.map_reduce(m2_map,m2_reduce,out=SON([("reduce", "c3")]),query={})
3rd Phase
At this point, we have a c3 collection which is complete with all field 1 & 2 computed values and the labels kept. So now, we have to run a last aggregation pipeline to copy the c3 content (in a mapreduce form with a compound value) to a more classical collection c2 with flatten fields without the value object.
db.c3.aggregate([{$project : { _id: 0, keyword: '$value.keyword', language: '$value.language', field1: '$value.field1', field2 : '$value.field2', labels : '$value.labels'}},{$out:'c2'}])
Et voilĂ ! The target is reached. This solution is quite long with 2 mapreduce jobs and one aggregation pipeline but this is an alternative solution for those who do not want to use consuming cursor or external loop.
Thanks.
I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?
Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.
I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.
You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.
You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});
Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)
I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});
Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});