Iterating through collections and counting the number of same value appearances pymongo

Iterating through collections and counting the number of same value appearances pymongo - mongodb

I have similar data in 5 collections in mongodb as follows (documents)
{
"_id" : ObjectId("53490030cf3b942d63cfbc7b"),
"uNr" : "abdc123abcd",
}
I want to iterate through each collection and check if there is uNr match in any collection. If there is then add that uNr and count +1 to new table. For example if there is a match in 3 collections, that it should show {"uNr" : "abcd123", "count": "3"}

If your total number of uNr values is small enough to fit in memory (at most a few million of them), you can total them client-side with a Counter and store them in a MongoDB collection:
from collections import Counter
from pymongo import MongoClient, InsertOne
db = MongoClient().my_database
counts = Counter()
for collection in [db.collection1,
db.collection2,
db.collection3]:
for doc in collection.find():
counts[doc['uNr']] += 1
# Empty the target collection.
db.counts.delete_many({})
inserts = [InsertOne({'_id': uNr, 'n': cnt}) for uNr, cnt in counts.items()]
db.counts.bulk_write(inserts)
Otherwise, query a thousand uNr values at a time and update counts in a separate collection:
from pymongo import MongoClient, UpdateOne, ASCENDING
db = MongoClient().my_database
# Empty the target collection.
db.counts.delete_many({})
db.counts.create_index([('uNr', ASCENDING)])
for collection in [db.collection1,
db.collection2,
db.collection3]:
cursor = collection.find(no_cursor_timeout=True)
# "with" statement helps ensure cursor is closed, since the server will
# never auto-close it.
with cursor:
updates = []
for doc in cursor:
updates.append(UpdateOne({'_id': doc['uNr']},
{'$inc': {'n': 1}},
upsert=True))
if len(updates) == 1000:
db.counts.bulk_write(updates)
updates = []
if updates:
# Last batch.
db.counts.bulk_write(updates)

Related

Mongodb LookUp Poor Performance

var product = db.GetCollection<Product>("Product");
var lookup1 = new BsonDocument(
"$lookup",
new BsonDocument {
{ "from", "Variant" },
{ "localField", "Maincode" },
{ "foreignField", "Maincode" },
{ "as", "variants" }
}
);
var pipeline = new[] { lookup1};
var result = product.Aggregate<Product>(pipeline).ToList();
The data of collection a is very large so it takes me 30 seconds to put the data in the list.
What should I do to make a faster lookup?

What that query is doing is retrieving every document from the Product collection, and then for each document found, perform a find query in the Variant collection. If there is no index on the Maincode field in the Variant collection, it will be reading the entire collection for each document.
This means that if there are, say, 1000 total products, with 3000 total variants (3 per product, on average), this query will be reading all 1000 documents from Product, and if that index isn't there, it would read all 3000 documents from Variant 1000 times, i.e. it will be examining 3 million documents.
Some ways to possibly speed this up:
create an index on {Maincode:1} in the Variant collection
This will reduce the number of documents that must be read in order to complete the lookup
change the schema
If the variants are stored in the same document with the product, there is no need for a lookup
filter the products prior to lookup
Again, reducing the documents read during the lookup
use a cursor to retrieve the documents in batches
If you perform any necessary sorting first, and the lookup last, you can return the documents to the application in batches, which would allow the application to display or begin processing the first batch before the second batch is available. This doen't make the query itself faster, but it can reduced the perceived wait in the application.

Bulk update is too slow

I am using pymongo to do a bulk update.
The names list below is a distinct list of names (each name might have mutiple documents in the collection)
Code 1:
bulk = db.collection.initialize_unordered_bulk_op()
for name in names:
bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
print bulk.execute()
Code 2:
bulk = db.collection.initialize_unordered_bulk_op()
counter = 0
for name in names:
bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
counter =counter + 1
if (counter % 100 == 0):
print bulk.execute()
bulk = db.collection.initialize_unordered_bulk_op()
if (counter % 100 != 0):
print bulk.execute()
I have 50000 documents in my collection.
If I get rid of the counter and if statement (Code 1), the code gets stuck!
With the if statement (Code 2), I am assuming this operation shouldn't take more than a couple of minutes but it is taking way more than that! Can you please help me make it faster or am I wrong in my assumption?!

You most likely forgot to add indexes to support your queries!
This will trigger full collection scans for each of your operations which is boring slow (as you did realize).
The following code does test using update_many, and the bulk-stuff without and with indexes on the 'name' and 'A' field. The numbers you get speak for themselves.
Remark, I was not passionate enough to do this for 50000 without the indexes but for 10000 documents.
Results for 10000 are:
without index and update_many: 38.6 seconds
without index and bulk update: 28.7 seconds
with index and update_many: 3.9 seconds
with index and bulk update: 0.52 seconds
For 50000 documents with added index it takes 2.67 seconds. I did run the test on a windows machine and mongo running on the same host in docker.
For more information about indexes see https://docs.mongodb.com/manual/indexes/#indexes. In short: Indexes are kept in RAM and allow for fast query and lookup of documents. Indexes have to specifically choose to match your queries.
from pymongo import MongoClient
import random
from timeit import timeit
col = MongoClient()['test']['test']
col.drop() # erase all documents in collection 'test'
docs = []
# initialize 10000 documents use a random number between 0 and 1 converted
# to a string as name. For the documents with a name > 0.5 add the key A
for i in range(0, 10000):
number = random.random()
if number > 0.5:
doc = {'name': str(number),
'A': True}
else:
doc = {'name': str(number)}
docs.append(doc)
col.insert_many(docs) # insert all documents into the collection
names = col.distinct('name') # get all distinct values for the key name from the collection
def update_with_update_many():
for name in names:
col.update_many({'A': {'$exists': False}, 'Name': name},
{'$set': {'B': 1, 'C': 2, 'D': 3}})
def update_with_bulk():
bulk = col.initialize_unordered_bulk_op()
for name in names:
bulk.find({'A': {'$exists': False}, 'Name': name}).\
update({'$set': {'B': 1, 'C': 2, 'D': 3}})
bulk.execute()
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))
col.create_index('A') # this adds an index on key A
col.create_index('Name') # this adds an index on key Name
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))

How to count the number of documents in a mongodb collection

I would like to know how to count the number of documents in a collection. I tried the follow
var value = collection.count();
&&
var value = collection.find().count()
&&
var value = collection.find().dataSize()
I always get method undefined.
Can you let me know what is the method to use to find the total documents in a collection.

Traverse to the database where your collection resides using the command:
use databasename;
Then invoke the count() function on the collection in the database.
var value = db.collection.count();
and then print(value) or simply value, would give you the count of documents in the collection named collection.
Refer: http://docs.mongodb.org/v2.2/tutorial/getting-started-with-the-mongo-shell/

If you want the number of documents in a collection, then use the count method, which returns a Promise. Here's an example:
let coll = db.collection('collection_name');
coll.count().then((count) => {
console.log(count);
});
This assumes you're using Mongo 3.
Edit: In version 4.0.3, count is deprecated. use countDocument to achieve the goal.

Since v4.0.3 you can use for better performance:
db.collection.countDocuments()

Execute Mongo shell commands in single line for better results.
To get count of all documents in mongodb
var documentCount = 0; db.getCollectionNames().forEach(function(collection) { documentCount++; }); print("Available Documents count: "+ documentCount);
Output: Available Documents count: 9
To get all count of document results in a collection
db.getCollectionNames().forEach(function(collection) { resultCount = db[collection].count(); print("Results count for " + collection + ": "+ resultCount); });
Output:
Results count for ADDRESS: 250
Results count for APPLICATION_DEVELOPER: 950
Results count for COUNTRY: 10
Results count for DATABASE_DEVELOPER: 1
Results count for EMPLOYEE: 4500
Results count for FULL_STACK_DEVELOPER: 2000
Results count for PHONE_NUMBER: 110
Results count for STATE: 0
Results count for QA_DEVELOPER: 100
To get all dataSize of documents in a collection
db.getCollectionNames().forEach(function(collection) { size = db[collection].dataSize(); print("dataSize for " + collection + ":"+ size); });
Output:
dataSize for ADDRESS: 250
dataSize for APPLICATION_DEVELOPER: 950
dataSize for COUNTRY: 10
dataSize for DATABASE_DEVELOPER: 1
dataSize for EMPLOYEE: 4500
dataSize for FULL_STACK_DEVELOPER: 2000
dataSize for PHONE_NUMBER: 110
dataSize for STATE: 0
dataSize for QA_DEVELOPER: 100

Simply you can use
Model.count({email: 'xyz#gmail.com'}, function (err, count) {
console.log(count);
});

db.collection('collection-name').count() is now deprecated.
Instead of count(), we should now use countDocuments() and estimatedDocumentCount().
Here is an example:
`db.collection('collection-name').countDocuments().then((docs) =>{
console.log('Number of documents are', docs);
};`
To know more read the documentation:
https://docs.mongodb.com/manual/reference/method/db.collection.countDocuments/
https://docs.mongodb.com/manual/reference/method/db.collection.estimatedDocumentCount/

Through MongoDB Console you can see the number of documents in a collection.
1.Go to mongoDB console and issue command "use databasename".
To start the console go up to the bin folder of where MongoDB is installed and click on mongo.exe
to start the mongoDB console
e.g If the database is myDB then command is "use myDB"
2.Execute this command db.collection.count()
collection is like table in RDBMS.
e.g if your collection name is myCollection then the command is db.myCollection.count();
this command will print the size of the collection in the console.

db.collection.count(function(err,countData){
//you will get the count of number of documents in mongodb collection in the variable
countdata
});
In place of collection give your mongodb collection name

How to delete documents by query efficiently in mongo?

I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.

You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.

You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});

Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)

I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});

Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});

Jumbled up ids in mongodb

We have around 20 million records in our mongodb. In my collection called 'posts' there is a field called 'id' which was supposed to be unique but now it has gotten all messed up. We just want it to be unique and there are many many duplicates now.
We just wanted to do something like iterating over every reocrd and assigning it a unique id in a loop from 1 to 20million.
What would be the easiest way to do this?

There are not many options here, really.
Pick your language and driver of choice.
Fetch N documents.
Assign unique ids to them (several options here: 1) copy _id; 2) assign new ObjectID; 3) assign plain integer)
Save those documents.
Fetch next N documents. Go to step 3.
To fetch next N documents, you should note the last processed document's _id and do this:
db.collection.find({_id: {$gt: last_processed_id}}).sort({_id: 1}).limit(N);
Do not use skip here. It will be too slow.
And, of course, you can always truncate the collection, create unique index on id and populate it again.

You can use a simple script like this:
db.posts.dropIndex("*id index name here*"); // Drop Unique index
counter = 0;
page = 1;
slice = 1000;
total = db.posts.count();
conditions = {};
while (counter < total) {
cursor = db.posts.find(conditions, {_id: true}).sort({_id: 1}).limit(slice);
while (cursor.hasNext()) {
row = cursor.next();
db.posts.update({_id: row._id}, {$set: {id: ++counter}});
}
conditions['_id'] = {$gt: row._id};
print("Processed " + counter + " rows");
}
print('Adding id index');
db.posts.ensureIndex({id: 1}, {unique: true, background: false});
print("done");
save it to assignids.js, and run as
$ mongo dbname assignids.js
the outer-while selects 1000 rows as a time, and prevents cursor timeouts; the inner while assigns each row a new incremental id.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Iterating through collections and counting the number of same value appearances pymongo - mongodb

Related

Mongodb LookUp Poor Performance

Bulk update is too slow

How to count the number of documents in a mongodb collection

How to delete documents by query efficiently in mongo?

Jumbled up ids in mongodb

Categories

Resources