Find duplicates amongst mongodb collections

Find duplicates amongst mongodb collections - mongodb

Objective is to find field values which exists in more than one collection in a single mongodb database. Assume, each collection has similar document model on basis of type or number of fields within.
Note . There is a unique id field in every collection whose value may or may not differ in fellow collections. Aim is to deduce all those collections which have these unique id values in common.
One solution is that if I follow brute force technique.
Solution.. traverse entire collection one by one and match every unique id values with each of those in other collections...
Are there any better solutions available?

There is no solution to this in MongoDB. Things are supposed to be embedded and there's usually no real correlation between collections. Even $lookup was introduced with some reluctance. I believe your solution is already the best there is.

Related

How to maintain order of a Mongo collection by sorting on an indexed field efficiently

ObjectId _id <--- index
String UserName
int Points <--- Descending index
Using this document structure as a simple example, we have a collection of users, each with a name and a "points" value. The collection has the usual _id index but also a "descending index" on Points.
Problem
The sample use case would be to maintain a ranking scoreboard (something like the League of Legends/DOTA ranking system or chess elo system). Each users' Points field would be constantly changing but the scoreboard is viewed very frequently and thus needs to be accurately maintained.
My current unoptimized solution
I'm not sure what "ascending/descending sort order means" in the mongo docs, but apparently it doesn't matter for single-field indices anyways.
So currently I'm just doing a very brute force solution of sorting the collection each time a user's Points field gets updated. At least it's indexed so for a smaller userbase this shouldn't be too bad. However, sorting the entire userbase on each update/insertion just seems wrong in general.
Other things I'm considering
There are data structures traditionally used for maintaining order during insert/update such as search trees but implementing that without putting the entire collection in memory seems like a huge project in itself.
I tried to search for some built-in functionality of Mongo indices that automatically maintains order in the collection for you but I couldn't really find anything like that.
Maybe some logic to only re-sort only some chunk of documents directly above and below the insertion/update? This solution seems pretty dependent on the expected spread of Points across the userbase and the use cases of this system.

You don't need to sort additionally already created indices , when you create indices in mongoDB you specify in what direction they need to be sorted(ascending(1) or descending(-1)) , so when you search multiple documents based on some field the result will be already sorted based on this field index order.
Afcourse you can specify explicitly if you need the result in reverse order or sorted by other field.

MongoDB database design - contest application

I'm building a contest application. Which have 4 collections so far:
contest
questions
matches
users
I want to store every user score for every match he's assigned into. But I really can't find a proper way to achieve this.
All what I've came up with, Is to replace matches in users with an array in which each element contains a reference to matches collection and score field. But I think this is not very efficient.
EDIT
I was thinking about another solution. A separate collection called scores that contains three fields user, match and score.
Here's my schema structure:
Contests:
Questions:
Matches:
Users:
Note Any recommended adjustments on the current design is welcomed too.

Since mongodb is not designed to support collections relationships you migth end up with some duplicated work, I would suggest you to find a way of storing as much data as you can in a single document.
Your scores would go in each match document, probably the users array would have this structure {'users':[{user_id:'xxx',score:xxx}{user_id:'xxx',score:xxx}]}
The other solution, would be what you say, to have in each user doccument, a matches array with a structure like this: {'matches':[{match_id:'xxx',score:xxx}{match_id:'xxx',score:xxx}]}
You can have both also, this migth be more efficient depending the kind of queries you will need to do. You can also have a field in the subdocuments that stores the user/match name/title
Note: As you can see, you have two solutions, or you optimize for doccument size(so you can store more) or you optimize for performance (so you can read faster/with less resources)
Hope this be of any help.

In MongoDB, how likely is it two documents in different collections in the same database will have the same Id?

According to the MongoDB documentation, the _id field (if not specified) is automatically assigned a 12 byte ObjectId.
It says a unique index is created on this field on the creation of a collection, but what I want to know is how likely is it that two documents in different collections but still in the same database instance will have the same ID, if that can even happen?
I want my application to be able to retrieve a document using just the _id field without knowing which collection it is in, but if I cannot guarantee uniqueness based on the way MongoDB generates one, I may need to look for a different way of generating Id's.

Short Answer for your question is : Yes that's possible.
below post on similar topic helps you in understanding better:
Possibility of duplicate Mongo ObjectId's being generated in two different collections?

You are not required to use a BSON ObjectId for the id field. You could use a hash of a timestamp and some random number or a field with extremely high cardinality (an US SSN for example) in order to make it close to impossible that two objects in the world will share the same id
The _id_index requires the idto be unique per collection. Much like in an RDBMS, where two objects in two tables may very likely have the same primary key when it's an auto incremented integer.
You can not retrieve a document solely by it's _id. Any driver I am aware of requires you to explicitly name the collection.
My 2 cents: The only thing you could do is to manually iterate over the existing collections and query for the _id you are looking for. Which is... ...inefficient, to put it polite. I'd rather semantically distinguish the documents in question by an additional field than by the collection they belong to. And remember, mongoDB uses dynamic schemas, so there is no reason to separate documents which semantically belong together but have a different set of fields. I'd guess there is something seriously, dramatically wrong with you schema. Please elaborate so that we can help you with that.

How to do global search in MongoDB?

What I mean as global search is searching for documents in specified collections, for example, searching for a name in both User and Organization collections and will return both user and organization documents that match the criteria.
Is it possible to simply copy the documents in User and Organization into another collection and do a search in it?

No, it is not possible to do a multi-collection search automatically. There's no reason however that you couldn't perform the same query on multiple collections and combine the results.
While you could duplicate the data into another collection for query purposes, if you need to be guaranteed that the source collection's values matches identically with the "index" collection, you'll need to implement your own multi-phase transaction (example) as MongoDb doesn't have a multi-collection atomic commit. Or, you can accept the fact that the "index" table may be out of sync. Of course, it could be periodically updated through custom code. Further, it means your working set has increased as you're double storing data. Also, if you then need to grab data from individual collections (to grab more of the source document), you've likely not gained anything and made things worse when compared to doing multiple queries in the first place.
You could store related documents in the same collection and take advantage of the built-in indexing offered. Of course, this comes with the caveat that if your documents are now typed, you may find it more challenging to build MongoDb indexes that are efficient. Every changing/new document must go through the indexing pipeline, which may introduce significant overhead.
If it's only a few collections, I'd just do multiple searches without understanding more deeply your requirements. If not, the second best would be to combine documents into a single collection. Last choice would be to copy the data.

MongoDB embedded documents vs. referencing by unique ObjectIds for a system user profile

I'd like to code a web app where most of the sections are dependent on the user profile (for example different to-do lists per person etc) and I'd love to use MongoDB. I was thinking of creating about 10 embedabble documents for the main profile document and keep everything related to one user inside his own document.
I don't see a clear way of using foreign keys for mongodb, the only way would be to create a field to_do_id with the type of ObjectId for example, but they would be totally unrelated internally, just happen to have the same Ids I'd have to query for.
Is there a limit on the number of embedded document types inside a top level document that could degrade performance?
How do you guys solve the issue of having a central profile document that most of the documents have to relate to in presenting a view per person?
Do you use semi foreign keys inside MongoDb and have fields with ObjectId types that would have some other document's unique Id instead of embedding them?
I cannot feel what approach should be taken when. Thank you very much!

There is no special limit with respect to performance. The larger the document, though, the longer it takes to transmit over the wire. The whole document is always retrieved.
I do it with references. You can choose between simple manual references and the database DBRef as per this page: http://www.mongodb.org/display/DOCS/Database+References
The link above documents how to have references in a document in a semi-foreign key way. The DBRef might be good for what you are trying to do, but the simple manual reference is very efficient.
I am not sure a general rule of thumb exists for which reference approach is best. Since I use Java or Groovy mostly, I like the fact that I get a DBRef object returned. I can check for this datatype and use that to decide how to handle the reference in a generic way.
So I tend to use a simple manual reference for references to different documents in the same collection, and a DBRef for references across collections.
I hope that helps.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse