MongoDB remove mapreduce collection - mongodb

Due to error in client code, mongodb have created many "mr.mapreduce...." collections, how to remove them all (by mask maybe).

I run script in interactive shell:
function f() {
var names = db.getCollectionNames();
for(var i = 0; i < names.length; i++){
if(names[i].indexOf("mr.") == 0){
db[names[i]].drop();}}};
f();
It resolved my problem.

Temporary map-reduce table should be cleaned up when the connection which created them is closed:
map/reduce is invoked via a database
command. The database creates a
temporary collection to hold output of
the operation. The collection is
cleaned up when the client connection
closes, or when explicitly dropped.
Alternatively, one can specify a
permanent output collection name. map
and reduce functions are written in
JavaScript and execute on the server.
-- MongoDB docs
If not, you could delete them using them same method you would delete any other collection. It might get a bit repetitive though.

Another way to achieve the same thing is this snippet:
db.system.namespaces.find({name: /tmp.mr/}).forEach(function(z) {
try{
db.getMongo().getCollection( z.name ).drop();
} catch(err) {}
});
Pro: It won't try to collect all your namespaces into a JavaScript Array. MongoDB segfaults on too many namespaces.

Temporary map-reduce collections should be cleaned up when the connection which created them is closed. However sometimes they remains there and increase database size. You can remove them using the following script:
var database = db.getSiblingDB('yourDatabaseName');
var tmpCollections = database.getCollectionInfos(
{
name: {$regex: /tmp\.mr*/},
'options.temp': true,
});
tmpCollections.forEach(function (collection) {
database.getCollection(collection.name).drop();
});
print(`There was ${tmpCollections.length} tmp collection deleted`);
The drop.tmp.js script can be executed from command line as follow:
mongo --quiet mongodb://mongodb:27137 drop.tmp.js

Related

MongoDB Atlas - Trigger returning empty object

I've been trying to search through documentation, and other various stack overflow posts, but I can't seem to find out how to do this. I'm trying setup a Trigger for MongoDB Atlas with the following function:
exports = function() {
const mongodb = context.services.get("test_cluster");
const db = mongodb.db("test_database");
const test_collection = db.collection("test_collection");
var tasks = test_collection.find();
console.log(JSON.stringify(tasks));
};
but the logs return "{}" every time I run it, is there something basic that I'm missing here? There is data in the collection with just some dummy data with a basic _id and some values.
I figured out my problem, Atlas does not output the information from console.log into their console through the browser. I setup the variables correctly, but to get the information I wanted I needed to use a return statement instead.

find_one() finds duplicates that are not there

I am trying to copy a remote mongodb atlas server to a local one. I do this by a python script which also checks if the record is already there. I see that eventhough the local database is empty my script find duplicates, which are not in the remote mongodb atlas (at least i cannot find them). I am not so experienced with mongodb and pymongo but I connot see what I am doing wrong. Sometimes Find_one() finds exactly the same record as before (all the fields are the same even the _id) ?
I removed the collection completely from my local server and tried again, but still the same result.
UserscollectionRemote = dbRemote['users']
UserscollectionNew = dbNew['users']
LogcollectionRemote = dbRemote['events']
LogcollectionNew = dbNew['events']
UsersOrg = UserscollectionRemote.find()
for document in UsersOrg: # loop over all users
print(document)
if UserscollectionNew.find_one({'owner_id': document["owner_id"]}) is None: # check if already there
UserscollectionNew.insert_one(document)
UserlogsOrg = LogcollectionRemote.find({'owner_id': document["owner_id"]}) # get all logs from this user
for doc in UserlogsOrg:
try:
if LogcollectionNew.find_one({'date': doc["date"]}) is None: # there was no entry yet with this date
LogcollectionNew.insert_one(doc)
else:
print("duplicate");
print (doc);
except:
print("an error occured finding the document");
print(doc);
You have the second for loop inside the first; that could be trouble.
On a separate note, you should investigate mongodump and mongorestore for copying collections; unless you need to be doing it in code, these tools are better suited for your use case.

MongoDB aggregation crashes when cursor or explain is used

Since version 3.6 MongoDB requires the use of cursor or explain in aggregate queries. It's a breaking change so I have to modify my earlier code.
But when I add cursor or explain to my query, the request simply enters an endless loop and MongoDB never responds. It doesn't even seem to time out.
This simple aggregation just hangs the code and never responds:
db.collection('users').aggregate([{ $match: { username: 'admin' }}],
{ cursor: {} },
(err, docs) => {
console.log('Aggregation completed.');
});
I can replace { cursor: {} } with { explain: true } and the result is the same. It works perfectly under older MongoDB versions without this one parameter.
Without cursor or explain I get this error message:
The 'cursor' option is required, except for aggregate with the explain argument
I'm not the only one who ran into this:
https://github.com/nosqlclient/nosqlclient/issues/419
OK, this was a little tricky, but finally it works. Looks like there are some major breaking changes in MongoDB's Node.js driver which nobody bothered to tell me.
1. The Node.js MongoDB driver has to be upgraded. My current version is 3.0.7.
2. The way how MongoDB connects has been changed, breaking any old code. The client connection now returns a client object, not merely the db. It has to be handled differently. There is a SO answer explaining it perfectly:
db.collection is not a function when using MongoClient v3.0
3. Aggregations now return an AggregationCursor object, not an array of data. Instead of a callback now you have to iterate through it.
var cursor = collection.aggregate([ ... ],
{ cursor: { batchSize: 1 } });
cursor.each((err, docs) => {
...
});
So it seems you have to rewrite ALL your db operations after upgrading to MongoDB 3.6. Yay, thanks for all the extra work, MongoDB Team! Guess that's where I'm done with your product.

call custom python function on every document in a collection Mongo DB

I want to call a custom python function on some existing attribute of every document in the entire collection and store the result as a new key-value pair in that (same) document. May I know if there's any way to do that (since each call is independent of others) ?
I noticed cursor.forEach but can't it be done just using python efficiently ?
A simple example would be to split the string in text and store the no. of words as a new attribute.
def split_count(text):
# some complex preprocessing...
return len(text.split())
# Need something like this...
db.collection.update_many({}, {'$set': {"split": split_count('$text') }}, upsert=True)
But it seems like setting a new attribute in a document based on the value of another attribute in the same document is not possible this way yet. This post is old but the issues seem to be still open.
I found a way to call any custom python function on a collection using parallel_scan in PyMongo.
def process_text(cursor):
for row in cursor.batch_size(200):
# Any complex preprocessing here...
split_text = row['text'].split()
db.collection.update_one({'_id': row['_id']},
{'$set': {'split_text': split_text,
'num_words': len(split_text) }},
upsert=True)
def preprocess(num_threads=4):
# Get up to max 'num_threads' cursors.
cursors = db.collection.parallel_scan(num_threads)
threads = [threading.Thread(target=process_text, args=(cursor,)) for cursor in cursors]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This is not really faster than cursor.forEach (but not that slow either), but it helps me execute any arbitrarily complex python code and save the results from within Python itself.
Also if I have an array of ints in one of the attributes, doing cursor.forEach converts them to floats which I don't want. So I preferred this way.
But I would be glad to know if there're any better ways than this :)
It is quite unlikely that it will ever be efficient to do this kind of thing in python. This is because the document would have to make a round trip and go through the python function on the client machine.
In your example code, you are passing the result of a function to a mongodb update query, which won't work. You can't run any python code inside mongodb queries on the db server.
As the answer to you linked question suggests, this type of action has to be performed in the mongo shell. e.g:
db.collection.find().snapshot().forEach(
function (elem) {
splitLength = elem.text.split(" ").length
db.collection.update(
{
_id: elem._id
},
{
$set: {
split: splitLength
}
}
);
}
);

use cursor as iterator with a query

I was reading about mongodb. Came across this part http://www.mongodb.org/display/DOCS/Tutorial It says -
> var cursor = db.things.find();
> printjson(cursor[4]);
{ "_id" : ObjectId("4c220a42f3924d31102bd858"), "x" : 4, "j" : 3 }
"When using a cursor this way, note that all values up to the highest accessed (cursor[4] above) are loaded into RAM at the same time. This is inappropriate for large result sets, as you will run out of memory. Cursors should be used as an iterator with any query which returns a large number of elements."
How to use cursor as iterator with a query?Thanks for the help
You've tagged that you're using pymongo, so I'll give you two pymongo examples using the cursor as an iterator:
import pymongo
cursor = pymongo.Connection().test_db.test_collection.find()
for item in cursor:
print item
#this will print the item as a dictionary
and
import pymongo
cursor = pymongo.Connection().test_db.test_collection.find()
results = [item['some_attribute'] for item in cursor]
#this will create a list comprehension containing the value of some_attribute
#for each item in the collection
In addition, you can set the size of batches returned to the pymongo driver by doing this:
import pymongo
cursor = pymongo.Connection().test_db.test_collection.find()
cursor.batchsize(20) #sets the size of batches of items the cursor will return to 20
It is usually unnecessary to mess with the batch size, but if the machine you are running the driver on is having memory issues and page faulting while you are manipulating results from the query, you might have to set this to achieve better performance (this really seems like a painful optimization to me and I've always left the default).
As far as the javascript driver (the driver that loads when you launch the "shell") that part of the documentation is cautioning you not to use "array mode". From the online manual:
Array Mode in the Shell
Note that in some languages, like JavaScript, the driver supports an
"array mode". Please check your driver documentation for specifics.
In the db shell, to use the cursor in array mode, use array index []
operations and the length property.
Array mode will load all data into RAM up to the highest index
requested. Thus it should not be used for any query which can return
very large amounts of data: you will run out of memory on the client.
You may also call toArray() on a cursor. toArray() will load all
objects queries into RAM.
Using the MongoDB Java driver it should be something like:
DBCursor cursor = collection.find( query );
while( cursor.hasNext() ) {
DBObject obj = cursor.next();
// do something tih obj
}
In the mongo console you can do something like:
var cursor = db.things.find();
while(cursor.hasNext()) { printjson(cursor.next()); }
MongoDB returns results in batches. To see how many objects are left in a batch, we use objLeftInBatch() like this:
var c = db.Schools.find();
var doc = function() {return c.hasNext()? c.next : null;}
c.objLeftInBatch();
To iterate through this batch we can use doc() that we setup in the above code block. More learning on cursors can be found at https://docs.mongodb.com/