find_one() finds duplicates that are not there - mongodb

I am trying to copy a remote mongodb atlas server to a local one. I do this by a python script which also checks if the record is already there. I see that eventhough the local database is empty my script find duplicates, which are not in the remote mongodb atlas (at least i cannot find them). I am not so experienced with mongodb and pymongo but I connot see what I am doing wrong. Sometimes Find_one() finds exactly the same record as before (all the fields are the same even the _id) ?
I removed the collection completely from my local server and tried again, but still the same result.
UserscollectionRemote = dbRemote['users']
UserscollectionNew = dbNew['users']
LogcollectionRemote = dbRemote['events']
LogcollectionNew = dbNew['events']
UsersOrg = UserscollectionRemote.find()
for document in UsersOrg: # loop over all users
print(document)
if UserscollectionNew.find_one({'owner_id': document["owner_id"]}) is None: # check if already there
UserscollectionNew.insert_one(document)
UserlogsOrg = LogcollectionRemote.find({'owner_id': document["owner_id"]}) # get all logs from this user
for doc in UserlogsOrg:
try:
if LogcollectionNew.find_one({'date': doc["date"]}) is None: # there was no entry yet with this date
LogcollectionNew.insert_one(doc)
else:
print("duplicate");
print (doc);
except:
print("an error occured finding the document");
print(doc);

You have the second for loop inside the first; that could be trouble.
On a separate note, you should investigate mongodump and mongorestore for copying collections; unless you need to be doing it in code, these tools are better suited for your use case.

Related

Is there something similar to journal for read operations in mongodb?

I am currently developing a program that read documents from mongo and write them in a file... something like this:
for doc in db.col.find(field=="bla"):
file.write(doc)
My problem is that something can happen while doing this process (its gonna take a week to do all the writes), for example, a shutdown or network problem. My question is: Is there something similar to journal for write operations to recover from a checkpoint? So I don't need to do all the write to file all over again.
this program will do what you want it writes IDs to a seperate log file. If everything runs fine it will just work. If it fails then the lopfile will ensure you start after the last write. Will also work if the dataset is changing as long as only inserts are done.
It needs Python 3.6 or better for the fStrings and pymongo.
import pymongo
import bson.json_util
import pathlib
import os
log_file = "logfile.txt"
output_file = "zip_codes.json"
host = "mongodb+srv://readonly:readonly#demodata.rgl39.mongodb.net/test?retryWrites=true&w=majority"
log_set = frozenset()
# Do we have a log file of previous writes
if os.path.isfile(log_file):
with open(log_file, "r") as input:
log_set = frozenset([x.strip() for x in input.readlines()])
print(f"{log_file} contains {len(log_set)} items")
else: # lets create one that is empty
pathlib.Path(log_file).touch()
print(f"creating {log_file}")
# connect to MongoDB we are using a readonly dataset for testing.
client = pymongo.MongoClient(host)
db = client["demo"]
zipcodes=db["zipcodes"]
count = 0
# Note we use bson to dump the file rather than json.dumps. This ensures
# that we can read this file back into MongoDB.
with open(output_file, "w") as data_output:
with open(log_file, "w") as log_output:
for doc in zipcodes.find():
if doc["_id"] not in log_set: # did we write this record already
count = count + 1
data_output.write(f"{bson.json_util.dumps(doc)}\n")
log_output.write(f"{doc['_id']}\n")
print(f"inserted {count} docs")

Search for MongoDB records containing a certain string

I have a MongoDB database full of tweets that I've gathered using the tweepy API, and I want to be able to search on a web application a hashtag and it shows the tweets containing that hashtag.
Currently, I have created a list with the DB records in, and iterating through that list to display them all, but now I want to refine the search so the user can choose what they see. I have the users search saved into a variable and I have tried the following ideas, but none seem to be working.
My first idea was to just pass in the variable and hope for the best
def display():
input = request.form['input'] # setting variable for user input
# Set up Mongo Client
client = MongoClient("mongo.user/pass")
# Accessing database
db = client.tweets
# acessing a collection
posts = db.posts
data = list(posts.find(input))
return render_template('results.html', posts_info=data)
With this, I get a TypeError, which I somewhat expected as I didn't expect it to be this easy
After some reading online, I tried using a regex.
def display():
input = request.form['input'] # setting variable for user input
# Set up Mongo Client
client = MongoClient("mongo.user/pass")
# Accessing database
db = client.tweets
# acessing a collection
posts = db.posts
data = list(posts.find({Tweet: {$regex: input}}))
return render_template('results.html', posts_info=data)
This also didn't work so I tried to hard code the regex to see if it was the user input variable creating issues
def display():
input = request.form['input'] # setting variable for user input
# Set up Mongo Client
client = MongoClient("mongo.user/pass")
# Accessing database
db = client.tweets
# acessing a collection
posts = db.posts
data = list(posts.find({Tweet: {$regex: "GoT"}}))
return render_template('results.html', posts_info=data)
With both these methods, I get syntax errors at the start of the regex expression, and flags the $ before regex
Error message that I get reads:
File "pathToWebApp/webApp.py", line 71
data = list(posts.find({Tweet: {$regex: input}}))
^
SyntaxError: invalid syntax
I've never worked with MongoDB or used regexs so I'm at a complete loss here. I've scoured the mongo docs but nothing I've tried works so any help from anyone would be greatly appreciated
By the looks of it you might need to wrap the $regex key in quotes as well:
data = list(posts.find({Tweet: {"$regex": input}}))

mongodb looping collection + save, objects returned several times

I'm writing a pretty big migration and had this code (coffeescript):
db.users.find().forEach (user)->
try
#some code changing the user depending on the old state
db.users.save(user)
print "user_ok: #{user._id}"
catch error
print "user_error: #{user._id}, error was: #{error}"
Some errors occured. But they occured on already processed users:
user_ok: user_1234
#many logs
user_error: user_1234 ...
How come the loop takes already processed objects?
I ended up doing:
backup = { users: [] }
db.users.find().forEach (user)->
try
#some code changing the user depending on the old state
backup.users.push user
print "user_ok: #{user._id}"
catch error
print "user_error: #{user._id}, error was #{error}"
#loop backup and save
And it works nice now, but it seems really weird. What's the point behind all that please?
When you modify an object, it might be moved by the database. The database needs to take additional care to remember which objects have been visited already. This feature is called snapshotting, you can ask for a snapshotted query using
db.collection.find().snapshot()
However, even this doesn't make guarantees about objects that were inserted or deleted during the cursor iteration. A few more caveats are explained in the link to the documentation.
Another option is to perform an $orderby on an invariable unique index. Ideally, that index is also monotonic, so if you are using ObjectIds as primary keys then the _id field comes in pretty handy, like
db.collection.find().sort({"_id" :1});

Can not delete collection from mongodb

Can not delete the collection from the shell,
The thing that the collection is available and my php script is accessing it (selecting|updating)
But when I used:
db._registration.drop()
it gives me an error:
Date, JS Error: TypeErrorL db._registration has no properties (shell): 1
The problem is not with deleting the collection. The problem is with accessing the collection. So you would not be able to update, find or do anything with it from the shell. As it was pointed in mongodb JIRA, this is a bug when a collection has characters like _, - or .
Nevertheless this type of names for collections is acceptable, but it cause a problem in shell.
You can delete it in shell with this command:
db.getCollection("_registration").drop()
or this
db['my-collection'].drop()
but I would rather rename it (of course if it is possible and will not end up with a lot of changing).
You can also use:
db["_registration"].drop()
which syntax works in JS as well.
For some reason the double quotes "_registration" did not workfor me .. but single quote '_registration' worked

MongoDB remove mapreduce collection

Due to error in client code, mongodb have created many "mr.mapreduce...." collections, how to remove them all (by mask maybe).
I run script in interactive shell:
function f() {
var names = db.getCollectionNames();
for(var i = 0; i < names.length; i++){
if(names[i].indexOf("mr.") == 0){
db[names[i]].drop();}}};
f();
It resolved my problem.
Temporary map-reduce table should be cleaned up when the connection which created them is closed:
map/reduce is invoked via a database
command. The database creates a
temporary collection to hold output of
the operation. The collection is
cleaned up when the client connection
closes, or when explicitly dropped.
Alternatively, one can specify a
permanent output collection name. map
and reduce functions are written in
JavaScript and execute on the server.
-- MongoDB docs
If not, you could delete them using them same method you would delete any other collection. It might get a bit repetitive though.
Another way to achieve the same thing is this snippet:
db.system.namespaces.find({name: /tmp.mr/}).forEach(function(z) {
try{
db.getMongo().getCollection( z.name ).drop();
} catch(err) {}
});
Pro: It won't try to collect all your namespaces into a JavaScript Array. MongoDB segfaults on too many namespaces.
Temporary map-reduce collections should be cleaned up when the connection which created them is closed. However sometimes they remains there and increase database size. You can remove them using the following script:
var database = db.getSiblingDB('yourDatabaseName');
var tmpCollections = database.getCollectionInfos(
{
name: {$regex: /tmp\.mr*/},
'options.temp': true,
});
tmpCollections.forEach(function (collection) {
database.getCollection(collection.name).drop();
});
print(`There was ${tmpCollections.length} tmp collection deleted`);
The drop.tmp.js script can be executed from command line as follow:
mongo --quiet mongodb://mongodb:27137 drop.tmp.js