Map reduce to delete duplicates (mongodb)

Map reduce to delete duplicates (mongodb) - mongodb

I have created map reduce function to get all documents along with their count.
I need to remove all the duplicates now. How should I do it?
res = col.map_reduce(map,reduce,"my_results");
Gives output like:
{u'_id': u'http://www.hardassetsinvestor.com/features/5485-soft-commodity-q4-report-low-inventories-buoy-cocoa-growing-stocks-weigh-on-coffee-cotton-a-sugar.html', u'value': 2.0}
{u'_id': u'http://www.hardassetsinvestor.com/market-monitor-archive/5490-week-in-review-gold-a-silver-kick-off-2014-strongly-oil-a-natgas-stall.html', u'value': 2.0}

Assuming you don't care which duplicate gets removed, an easy approach is to ensure a unique index with dropDups:true.
For example, assuming a field name of url:
db.collection.ensureIndex( { url: 1 }, { unique: true, dropDups: true } )
Important note from the dropDups documentation:
As in all unique indexes, if a document does not have the indexed field, MongoDB will include it in the index with a “null” value.
If subsequent fields do not have the indexed field, and you have set {dropDups: true}, MongoDB will remove these documents from the collection when creating the index. If you combine dropDups with the sparse option, this index will only include documents in the index that have the value, and the documents without the field will remain in the database.

You would write a small application to do this, i.e. in the shell:
db.my_results.find().forEach(function(doc){
if(doc.value > 1)
db.realCollection.remove({_id: doc._id}, true);
});
The end true makes remove only remove once
Edit
Adding Python since the above code is hard to translate:
for doc in db.my_results.find():
if doc.value > 1:
for i in range(0, doc.value):
db.realCollection.remove({'_id': doc._id}, true);

Related

Mongo - How can a narrower query be slower than a generic one?

This is executed immediately:
db.mycollection.find({ strField: 'AAA'}).count()
And this takes a lot to finish:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $exists: true }}).count()
This is how I created my index:
db.mycollection.createIndex({strField: 1, dateTimeField: 1}, { sparse: true })
But it doesn't work even using hint(indexName)
Why this happens and how to fix it?

The { $exists: true } query predicate is problematic, especially if there are documents in the collection for which that field does not exist.
When MongoDB creates an index entry for a document, it collects all of the field values according to the index spec, and concatenates them.
If a field is not present in the document, the index stores null in that field's position.
If the field is explicitly set to null, it also stores null in that field's position.
This means that these 2 documents will have identical entries in the index:
{ strField: 'AAA', dateTimeField: null}
{ strField: 'AAA'}
Note that even with the index being sparse, both documents will be indexed since at least one of the indexes fields exists in each document.
When testing {dateTimeFied:{$exists:true}}, the first document will match, while the second will not.
When processing a count query using an index, if the query can be satisfied by scanning a single range of the index, the query executor can use a count_scan stage, and get the correct result without loading a single document from disk.
Because the executor cannot definitively tell from the index whether or not the field exists, it cannot use a count_scan, and must instead use an ordinary ixscan followed by a fetch stage, and load all of the matching documents from disk in order to arrive at the correct count.
In the case of the first query, the executor would have been able to use a count_scan, while the second would have had to examine all of the documents. You should be able to see this by running explain with the executionStats option on each query.
One way to avoid this pitfall is to take advantage of the fact that MongoDB query operators are type-sensitive. This means that this query will match any document where dateTimeField is greater than epoch 0, and a timestamp:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $gte: new ISODate("1970-01-01T00:00:00Z") }}).count()
This will allow the query executor to count all of the documents that have the matching string and contain a date, but will exclude documents that contain a dateTimeField with a numeric or string value.

How to optimize mongo query with two parallel array?

I have a query like this:
xml_db.find(
{
'high_performer': {
'$nin': [some_value]
},
'low_performer': {
'$nin': [some_value]
},
'expiration_date': {
'$gte': datetime.now().strftime('%Y-%m-%d')
},
'source': 'some_value'
}
)
I have tried to create an index with those fields but getting error:
pymongo.errors.OperationFailure: cannot index parallel arrays [low_performer] [high_performer]
So, how to efficiently run this query?

Compound indexing ordering should follow the equality --> sort --> range rule. A good description of this can be found in this response.
This means that the first field in the index would be source, followed by the range filters (expiration_date, low_performer and high_performer).
As you noticed, one of the "performer" fields cannot be included in the index since only a single array can be indexed. You should use your knowledge of the data set to determine which filter (low_performer or high_performer) would be more selective and choose that filter to be included in the index.
Assuming that high_performer is more selective, the only remaining step would be to determine the ordering between expiration_date and high_performer. Again, you should use your knowledge of the data set to make this determination based on selectivity.
Assuming expiration_date is more selective, the index to create would then be:
{ "source" : 1, "expiration_date" : 1, "high_performer" : 1 }

Finding if the document is unique

I am querying 3 collections in MongoDB and then creating a new document by taking some fields from the documents of the 3 separate collections. For example: I am taking field 'A' from first collection, field 'B' from second and field 'C' from third.
Using them i am creating a json document like
var uploadDoc = {
'A' : <value of A>,
'B' : <value of B>,
'C' : <value of C>,
}
This uploadDoc is being uploaded to another collection.
Question: I wish to upload only distinct values of uploadDoc. By default MongoDB gives each uploadDoc a unique id. How do I insert uplodDocs to the collection only when another document with the same A, B and C values hasn't been inserted before?
I am using javascript to query the collections and create docs.

Two ways are simple:
use "upserts"
db.collection.update(uploadDoc,uploadDoc,{ "upsert": true })
Use a unique index
db.collection.ensureIndex({ "A": 1, "B": 1, "C": 1 },{ "unique": true });
db.collection.insert(uploadDoc); // Same thing fails :(
Both work. Choose one.

You should use Unique Indexes: doc
You shouldn't use upsert without unique indexes:
To avoid inserting the same document more than once, only use upsert: true if the query field is uniquely indexed.
because
Consider when multiple clients issue the following update with an upsert parameter at the same time:
[cut]
If all update() operations complete the query portion before any client successfully inserts data, and there is no unique index on the name field, then each update operation may result in an insert.
from here

MongoDB MongoEngine index declaration

I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?

Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.

How to delete documents by query efficiently in mongo?

I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.

You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.

You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});

Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)

I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});

Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});