Duplicate Documents in MongoDB

Duplicate Documents in MongoDB - mongodb

I'm running into an issue where I'm sporadically having duplicate documents inserted into my MongoDB collection. It's only happened a handful of times and, in all cases, the duplicates are created within the same second as the original. My original guess was that I needed to add a unique index on a field, but I'm not sure that would necessarily prevent the duplicates being created nearly simultaneously, though maybe I'm overthinking that.
Is there any possible reason I could be seeing duplicate documents in MongoDB other than the lack of a unique index?

Related

MongoDB duplicate key in single document

relative noob.
I have a MongoDB collection, where it appears one of the keys has been duplicated within a document. The document count is 13603, but aggregating by the key and counting results in 13604. I have run this 3 times, 30 minutes apart, so I know it's not a timing issue. I am trying to find the document with the duplicate key, but don't understand aggregations enough to find it.
I found a similar thread here but I see no solution for finding the "corrupt" document within a collection.
This is NOT a duplicate key across documents or a duplicate document issue; it is a duplicate key within the same document issue. Any help is appreciated.
screen-shot comparing document count to key-aggregation count

It is most probably incorrect:
db.collection.count()
Check this ticket here
Try to count with this:
db.collection.countDocuments({})
The db.collection.count() is just reading the collection count metadata , it is fast but inacurate especially in sharded cluster since sometimes there is orphaned documents not updated in collection metadata , you need to clean orphans and then try again.
From the documentation:
Avoid using the db.collection.count() method without a query predicate since without the query predicate, the method returns results based on the collection's metadata, which may result in an approximate count. In particular,
On a sharded cluster, the resulting count will not correctly filter out orphaned documents.
After an unclean shutdown, the count may be incorrect.
For counts based on collection metadata, see also collStats pipeline stage with the count option.

Purge documents in MongoDB without impacting the working set

We have a collection of documents and each document has a time window associated to it. (For example, fields like 'fromDate' and 'toDate'). Once the document is expired (i.e. toDate is in the past), the document isn't accessed anymore by our clients.
So we wanted to purge these documents to reduce the number of documents in the collection and thus making our queries faster. However we later realized that this past data could be important to analyze the pattern of data changes, so we decided to archive it instead of purging it completely. So this is what I've come up with.
Let's say we have a "collectionA" which has past versions of documents
Query all the past documents in "collectionA". (queries are made on secondary server)
Insert them to a separate collection called "collectionA-archive"
Delete the documents from collectionA that are successfully inserted in the archive
Delete documents in "collectionA-archive" that meet a certain condition. (we do not want to keep a huge archive)
My question here is, even though I'm making the queries on Secondary server, since the insertions are happening in Primary, does the documents inserted in archive collection make it to the working set of Primary ? The last thing we need is these past documents getting stored in RAM of Primary which could affect the performance of our live API.
I know, one solution could be to insert the past documents into a separate DB server. But acquiring another server is a bit of hassle. So would like to know if this is achievable within one server.

MONGODB - Add duplicate field with different value

Is there a way to write a script that updates a document by adding a duplicate field with a different value? I cannot use set as that replaces the existing value. I cannot use push as the field is in an object, not an array. I even tried creating the new field with a different name and renaming it which also replaces the existing field.

You cannot have duplicate fields in a Mongo record. A Mongo collection is a collection of documents, otherwise known as objects. You cannot have a duplicate field in an object and Mongo is no different.

MongoDB (and any other database that I have come across so far) is built around the idea that individual fields are identifiable so they can be filtered by, grouped by, sorted by, etc... That also explains why MongoDB does not provide support for the scenario you're facing. That being said, MongoDB can be used as a dumb datastore for arbitrary JSON data. And the JSON specification does not say anything about duplicate field names which is probably why you can actually store such a document in MongoDB in the first place.
Anyway, there is no way to achieve what you want without loading the entire document, changing it (by adding the duplicate field(s)) and then replacing the whole document. That, however, will work.
I personally cannot think of a reasonable scenario where this sort of document could make sense, though. So I would strongly suggest you revisit your document structure.

Get nth item from a collection

I'm in the learning phase of mongodb.
I have a test website project where each step of a story is a domain.com/step
for instance, step 14 is accessed through domain.com/14
In other words, for the above case, I will need to access 14th document in my collection to serve it.
I've been using find().skip(n).limit(1) method for this so far to return nth document however it becomes extremely slow when there are too many documents to skip. So I need a more efficient way to get the nth document in my collection.
Any ideas are appreciated.

Add a field to your documents which tells you which step it is, add an index to that field and query by it.
Document:
{
step:14
text:"text",
date:date,
imageurl:"imageurl"
}
Index:
db.collection.createIndex({step:1});
Query:
db.collection.find({step:14});
Relying on natural order in the collection is not just slow (as you found out), it is also unreliable. When you start a new collection and insert a bunch of documents, you will usually find them in the order you inserted them. But when you change documents after they were inserted, it can happen that the order gets messed up in unpredictable ways. So never rely on insertion order being consistent.
Exception: Capped Collections guarantee that insertion order stays consistent. But there are very few use-cases where these are useful, and I don't think you have such a case here.

Mongodb - add if not exists (non id field)

I have a large number of records to iterate over (coming from an external data source) and then insert into a mongo db.
I do not want to allow duplicates. How can this be done in a way that will not affect performance.
The number of records is around 2 million.

I can think of two fairly straightforward ways to do this in mongodb, although a lot depends upon your use case.
One, you can use the upsert:true option to update, using whatever you define as your unique key as the query for the update. If it does not exist it will insert it, otherwise it will update it.
http://docs.mongodb.org/manual/reference/method/db.collection.update/
Two, you could just create a unique index on that key and then insert ignoring the error generated. Exactly how to do this will be somewhat dependent on language and driver used along with version of mongodb. This has the potential to be faster when performing batch inserts, but YMMV.

2 million is not a huge number that will affect performance, split your records fields into diffent collections will be good enough.
i suggest create a unique index on your unique key before insert into the mongodb.
unique index will filter redundant data and lose some records and you can ignore the error.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse