Mongodb database with multiple unique indexes in sharded config - mongodb

I have a "users" collection in a database in mongodb. I will be using sharding.
Other than the standard _id, there are 3 other fields that need to be indexed and unique: username, email, and account number. All 3 of these fields will not necessarily exist for every user document, in some cases none will exist.
I'd like the fields to be indexed because users will be looked up frequently by one of these fields. I'd like the fields to be unique because this is a requirement and I'd rather not handle this logic client-side.
I understand that mongodb does have limitations, just like any other database, but I'm hoping there's a solution because this is a fairly common setup for web applications.
Is there an elegant solution for this scenario?
Not sure if it matters for this question (because the question pertains to database structure), but I am using the official mongodb C# driver.

Mongodb official documentation says, that sharded collection must have only one unique index, and requires the unique field(s) to exist. But it also says that you also have the option to have other unique indices if and only if the shard key is a prefix of their attributes. So you can try this, but aware, that unique key must always exist.
I don't understand your business logic where no information about the user would exist. In this case you can shard by _id and perform uniqueness checks manually.

Related

MongoDB best practices for choosing _id field type

I'm about to write a webapp based on Node.js and MongoDB. This webapp will have users stored in a table Users. My question is quite simple: since the username will be unique, is it a good idea to change the _id field type from an ObjectId to a String that would contain the username?
I think it would be more convenient and would somehow reduce the number of requests needed to perform several tasks. Typically if I use the username as an _id, references to a User in other tables will directly give me the username and not an ObjectId that I'd need to convert into an intelligible name before displaying it.
Moreover having two different fields (ObjectId as _id + username) would require two indexes to speed up lookups either on ids or on usernames and would require more space in database.
Last, if I understood correctly the documentation, on a sharded environment, MongoDB will only be able to ensure uniqueness of the _id field.
BUT, I may miss things that I'd realize (and deeply regret) later.
Can you please shed some light on this?
Thanks!

In MongoDB, how likely is it two documents in different collections in the same database will have the same Id?

According to the MongoDB documentation, the _id field (if not specified) is automatically assigned a 12 byte ObjectId.
It says a unique index is created on this field on the creation of a collection, but what I want to know is how likely is it that two documents in different collections but still in the same database instance will have the same ID, if that can even happen?
I want my application to be able to retrieve a document using just the _id field without knowing which collection it is in, but if I cannot guarantee uniqueness based on the way MongoDB generates one, I may need to look for a different way of generating Id's.
Short Answer for your question is : Yes that's possible.
below post on similar topic helps you in understanding better:
Possibility of duplicate Mongo ObjectId's being generated in two different collections?
You are not required to use a BSON ObjectId for the id field. You could use a hash of a timestamp and some random number or a field with extremely high cardinality (an US SSN for example) in order to make it close to impossible that two objects in the world will share the same id
The _id_index requires the idto be unique per collection. Much like in an RDBMS, where two objects in two tables may very likely have the same primary key when it's an auto incremented integer.
You can not retrieve a document solely by it's _id. Any driver I am aware of requires you to explicitly name the collection.
My 2 cents: The only thing you could do is to manually iterate over the existing collections and query for the _id you are looking for. Which is... ...inefficient, to put it polite. I'd rather semantically distinguish the documents in question by an additional field than by the collection they belong to. And remember, mongoDB uses dynamic schemas, so there is no reason to separate documents which semantically belong together but have a different set of fields. I'd guess there is something seriously, dramatically wrong with you schema. Please elaborate so that we can help you with that.

mongodb indexing user-defined schemas

We are currently using MongoDB to allow tenants in a SaaS application to define entities that they can use in the application. We do not know know how each tenant is going to define the fields for the entities that they are creating upfront. Each entity will have a collection dynamically created for it in a separate database that belongs to the tenant.
For example, One tenant might define a Customer as First Name, Last Name, Email. Another tenant might define Shipment as Shipment Ref, Ship Date, Owner etc... Each tenant will have many entities/collections in their tenant database.
We have one field (ID) which we will always force the user to include in each entity/collection. We will index this field upfront when creating the collection.
However, how do we handle the case where we want to allow the tenant to sort/search/order/query large collections/entities quickly when/if the dataset becomes too large?
That is, since we do not know upfront what fields the user will be sorting/filtering/ordering by, what is the indexing strategy to use in this case with Mongo?
First of all Mongo requires you to have _id for each document and it indexes it automatically. You should take advantage of this and not create yet another ID field in case you require your clients to have ID field. I'm not sure if that's the case in your application.
What you are asking for can't have a perfect solution or even the most optimal one, but I can suggest couple options:
Create single field index for each field in the document. Let Mongo query optimizer decide which index to use depending on query. Disadvantages - takes lots of space on disk and in memory. Makes inserts slower. Mongo can use only 1 index in condition clause, so it will not be able to use compound index. You can easily extract schema with a tool like this. I wrote this little prototype that analyzes and prints Mongo schema.
Let your application learn what indexes to create. Get slow queries from Mongo profiler (in Mongo log), analyze common parts (automatically?) and create indexes on most commonly used fields. That's not so easy to implement and efficiency might change with time if your client changes queries or data. Application will be slow in the start until it learns about itself :).
Would just like to emphasize in choosing your design that the ID and not _id field you mention is actually some unique entity identifier then you are better of putting this in _id.
The reason here is that the performance trade-off for using another unique index over the required _id is a considerable overhead. Thinking about this, since _id is required it is the first thing that MongoDB looks for when determining which index to use. Otherwise consider a compound _id field containing your entity information and some other useful uniqueness.
As for the user defined fields, which is kind of the essence of mongo documents, for my money I would make it part of the API to set up indexes as required. Depending on the type of searching that is happening you'll probably want compound indexes and generated queries that make sense to these.
Simply indexing every field will probably have limited use as only one index is going to be picked for the find anyhow, and the query optimizer is going to try all of them. As has been mentioned, a long option could be to set indexes according to the usage patterns. But it could take some work to do.

Is there a NoSQL database that can effectively detect duplicate items?

I'm looking to implement a system that searches for duplicate entries before saving new entries, mostly by IP address. Since NoSQL databases have eventual consistency, this doesn't seem like a natural use case. Is there a way to make it work?
CouchDB enforces uniqueness within _id field of the document. Here's and excerpt from http://guide.couchdb.org
Within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.
Every relational database and MongoDB supports unique indexes on data tables/collections preventing the insertion of duplicate data...why isn't that good enough?
Creating a unique index in MongoDB is straight forward. Trying to insert duplicate entries will raise an error (if you use safe mode enabled or checking the result
of the insertion operation).

Mongodb: object id as short primary key within a collection

How to make better use of objectId generate by MongoDB. I am not an expert user, but so far i ended up creating seperate id for my object (userid, postid) etc because the object id is too long and makes the url ugly if use as the main ID. I keep the _id intact as it help indexing etc. I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key. I read the key was a combination of date etc, so any of the part can be used unique within a collection for this purpose.
thanks,
bsr/
If you have an existing ID (say from an existing data set), then it's perfectly OK to override _id with the one you have.
...keeo the _id intact as it help indexing etc
MongoDB indexes the _id field by default. If you start putting integers in the _id field, they will be indexed like everything else.
So most RDBMs provide an "auto-increment" ID. This is nice for small datasets, but really poor in terms of scalability. If you're trying to insert data to 20 servers at once, how do you keep the "auto-increment" intact?
The normal answer is that you don't. Instead, you end up using things like GUIDs for those IDs. In the case of MongoDB, the ObjectId is already provided.
I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key
So the problem here is that "easy to remember" ID doesn't really mesh with "highly scalable database". When you have a billion documents, the IDs are not really "easy to remember".
So you have to make the trade-off here. If you have a table that can get really big, I suggest using the ObjectId. If you have a table that's relatively small and doesn't get updated often, (like a "lookup" table) then you can build your own auto-increment.
The choice is really up to you.
You can overwrite the _id yourself. There is no obligation for using the auto-generated object id. What is the problem with overriding _id inside your app according to your own needs?