Is there a NoSQL database that can effectively detect duplicate items? - mongodb

I'm looking to implement a system that searches for duplicate entries before saving new entries, mostly by IP address. Since NoSQL databases have eventual consistency, this doesn't seem like a natural use case. Is there a way to make it work?

CouchDB enforces uniqueness within _id field of the document. Here's and excerpt from http://guide.couchdb.org
Within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.

Every relational database and MongoDB supports unique indexes on data tables/collections preventing the insertion of duplicate data...why isn't that good enough?
Creating a unique index in MongoDB is straight forward. Trying to insert duplicate entries will raise an error (if you use safe mode enabled or checking the result
of the insertion operation).

Related

Generating shard key field for multi tenant mongodb app

I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?
If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.

MongoDB and composite primary keys

I'm trying to determine the best way to deal with a composite primary key in a mongo db. The main key for interacting with the data in this system is made up of 2 uuids. The combination of uuids is guaranteed to be unique, but neither of the individual uuids is.
I see a couple of ways of managing this:
Use an object for the primary key that is made up of 2 values (as suggested here)
Use a standard auto-generated mongo object id as the primary key, store my key in two separate fields, and then create a composite index on those two fields
Make the primary key a hash of the 2 uuids
Some other awesome solution that I currently am unaware of
What are the performance implications of these approaches?
For option 1, I'm worried about the insert performance due to having non sequential keys. I know this can kill traditional RDBMS systems and I've seen indications that this could be true in MongoDB as well.
For option 2, it seems a little odd to have a primary key that would never be used by the system. Also, it seems that query performance might not be as good as in option 1. In a traditional RDBMS a clustered index gives the best query results. How relevant is this in MongoDB?
For option 3, this would create one single id field, but again it wouldn't be sequential when inserting. Are there any other pros/cons to this approach?
For option 4, well... what is option 4?
Also, there's some discussion of possibly using CouchDB instead of MongoDB at some point in the future. Would using CouchDB suggest a different solution?
MORE INFO: some background about the problem can be found here
You should go with option 1.
The main reason is that you say you are worried about performance - using the _id index which is always there and already unique will allow you to save having to maintain a second unique index.
For option 1, I'm worried about the insert performance do to having
non sequential keys. I know this can kill traditional RDBMS systems
and I've seen indications that this could be true in MongoDB as well.
Your other options do not avoid this problem, they just shift it from the _id index to the secondary unique index - but now you have two indexes, once that's right-balanced and the other one that's random access.
There is only one reason to question option 1 and that is if you plan to access the documents by just one or just the other UUID value. As long as you are always providing both values and (this part is very important) you always order them the same way in all your queries, then the _id index will be efficiently serving its full purpose.
As an elaboration on why you have to make sure you always order the two UUID values the same way, when comparing subdocuments { a:1, b:2 } is not equal to { b:2, a:1 } - you could have a collection where two documents had those values for _id. So if you store _id with field a first, then you must always keep that order in all of your documents and queries.
The other caution is that index on _id:1 will be usable for query:
db.collection.find({_id:{a:1,b:2}})
but it will not be usable for query
db.collection.find({"_id.a":1, "_id.b":2})
I have an option 4 for you:
Use the automatic _id field and add 2 single field indexes for both uuid's instead of a single composite index.
The _id index would be sequential (although that's less important in MongoDB), easily shardable, and you can let MongoDB manage it.
The 2 uuid indexes let you to make any kind of query you need (with the first one, with the second or with both in any order) and they take up less space than 1 compound index.
In case you use both indexes (and other ones as well) in the same query MongoDB will intersect them (new in v2.6) as if you were using a compound index.
I'd go for the 2 option and there is why
Having two separate fields instead of the one concatenated from both uuids as suggested in 1st, will leave you the flexibility to create other combinations of indexes to support the future query requests or if turns out, that the cardinality of one key is higher then another.
having non sequential keys could help you to avoid the hotspots while inserting in sharded environment, so its not such a bad option. Sharding is the best way, for my opinion, to scale inserts and updates on the collections, since the write locking is on database level (prior to 2.6) or collection level (2.6 version)
I would've gone with option 2. You can still make an index that handles both the UUID fields, and performance should be the same as a compound primary key, except it'll be much easier to work with.
Also, in my experience, I've never regretted giving something a unique ID, even if it wasn't strictly required. Perhaps that's an unpopular opinion though.

mongodb indexing user-defined schemas

We are currently using MongoDB to allow tenants in a SaaS application to define entities that they can use in the application. We do not know know how each tenant is going to define the fields for the entities that they are creating upfront. Each entity will have a collection dynamically created for it in a separate database that belongs to the tenant.
For example, One tenant might define a Customer as First Name, Last Name, Email. Another tenant might define Shipment as Shipment Ref, Ship Date, Owner etc... Each tenant will have many entities/collections in their tenant database.
We have one field (ID) which we will always force the user to include in each entity/collection. We will index this field upfront when creating the collection.
However, how do we handle the case where we want to allow the tenant to sort/search/order/query large collections/entities quickly when/if the dataset becomes too large?
That is, since we do not know upfront what fields the user will be sorting/filtering/ordering by, what is the indexing strategy to use in this case with Mongo?
First of all Mongo requires you to have _id for each document and it indexes it automatically. You should take advantage of this and not create yet another ID field in case you require your clients to have ID field. I'm not sure if that's the case in your application.
What you are asking for can't have a perfect solution or even the most optimal one, but I can suggest couple options:
Create single field index for each field in the document. Let Mongo query optimizer decide which index to use depending on query. Disadvantages - takes lots of space on disk and in memory. Makes inserts slower. Mongo can use only 1 index in condition clause, so it will not be able to use compound index. You can easily extract schema with a tool like this. I wrote this little prototype that analyzes and prints Mongo schema.
Let your application learn what indexes to create. Get slow queries from Mongo profiler (in Mongo log), analyze common parts (automatically?) and create indexes on most commonly used fields. That's not so easy to implement and efficiency might change with time if your client changes queries or data. Application will be slow in the start until it learns about itself :).
Would just like to emphasize in choosing your design that the ID and not _id field you mention is actually some unique entity identifier then you are better of putting this in _id.
The reason here is that the performance trade-off for using another unique index over the required _id is a considerable overhead. Thinking about this, since _id is required it is the first thing that MongoDB looks for when determining which index to use. Otherwise consider a compound _id field containing your entity information and some other useful uniqueness.
As for the user defined fields, which is kind of the essence of mongo documents, for my money I would make it part of the API to set up indexes as required. Depending on the type of searching that is happening you'll probably want compound indexes and generated queries that make sense to these.
Simply indexing every field will probably have limited use as only one index is going to be picked for the find anyhow, and the query optimizer is going to try all of them. As has been mentioned, a long option could be to set indexes according to the usage patterns. But it could take some work to do.

how do non-ACID RethinkDB or MongoDB maintain secondary indexes for non-equal queries

This is more of 'inner workings' undestanding question:
How do noSQL databases that do not support *A*CID (meaning that they cannot update/insert and then rollback data for more than one object in a single transaction) -- update the secondary indexes ?
My understanding is -- that in order to keep the secondary index in sync (other wise it will become stale for reads) -- this has to happen withing the same transaction.
furthermore, if it is possible for index to reside on a different host than the data -- then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.
But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host) , what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?
This is a great question.
RethinkDB always stores secondary indexes on the same host as the primary index/data for the table. Even in case of joins, RethinkDB brings the query to the data, so the secondary indexes, primary indexes, and data always reside on the same node. As a result, there is no need for distributed locking protocols such as two phase commit.
RethinkDB does support a limited set of transactional functionality -- single document transactions. Changes to a single document are recorded atomically. Relevant secondary index changes are also recorded as part of that transaction, so either the entire change is recorded, or nothing is recorded at all.
It would be easy to extend the limited transactional functionality to support multiple documents in a single shard, but it would be hard to do it across shards (for the distributed locking reasons you brought up), so we decided not to implement transactions for multiple documents yet.
Hope this helps.
This is a MongoDB answer.
I am not quite sure what your logic here is. Updating a secondary index has nothing to do with being able to rollback multi statement transactions such as a multiple update.
MongoDB has transcactions per a single document, and that is what matters for updating indexes. These operations can be reversed using the journal if the need arises.
this has to happen withing the same transaction.
Yes, much like a RDBMS would. The more indexes you apply the slower your writes will be, and it seems to me you know why.
As the write occurs MongoDB will update all indexes which apply to that collection with the fields that apply to specific indexes.
furthermore, if it is possible for index to reside on a different host than the data
I am unsure if MongoDB allows that, I believe there is a JIRA for it; however, I cannot find that JIRA currently.
then a distributed lock needs to be present and/or two-phase commit for such an update to work atomically.
Most likely. Allowing this feature would be...well, let's just say creating a hairball.
Even in a sharded setup the index of each range resides on the shard itself, not on the config servers.
But if these databases do not support the multi-object transactions (which means they do not do two-phase commit on data across multiple host)
That is not what a two phase commit means. I believe you need to brush up on what a two phase commit is: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/
I suppose if you are talking about a transaction covering more than one shard then, hmm ok.
what method do they use to guarantee that secondary indices that reside in B-trees structures separate from the data are not stale ?
Agan I am unsure why a multi document transaction would effect whether an index would be stale or not, your not grouping across documents. The exception to that is a unique index but that works on single document updates as well; note that its uniqueness gets kinda hairy in sharded setups and cannot be guaranteed.
In an index you are creating, normally, one entry per document prefix key, uless it is a multikey index on the docment then you can make more than one index, however, either way index updating is done per single object, not by multi document transactions and I am unsure what you logic here is aas such this is the answer I have placed.
RethinkDB always stores secondary index data on the same machine as the data it's indexing. This allows it to be updated within the same transaction. Rethink promises to be ACIDy with single document operations and considers the indexing of a document to be part of the document itself.

Mongodb database with multiple unique indexes in sharded config

I have a "users" collection in a database in mongodb. I will be using sharding.
Other than the standard _id, there are 3 other fields that need to be indexed and unique: username, email, and account number. All 3 of these fields will not necessarily exist for every user document, in some cases none will exist.
I'd like the fields to be indexed because users will be looked up frequently by one of these fields. I'd like the fields to be unique because this is a requirement and I'd rather not handle this logic client-side.
I understand that mongodb does have limitations, just like any other database, but I'm hoping there's a solution because this is a fairly common setup for web applications.
Is there an elegant solution for this scenario?
Not sure if it matters for this question (because the question pertains to database structure), but I am using the official mongodb C# driver.
Mongodb official documentation says, that sharded collection must have only one unique index, and requires the unique field(s) to exist. But it also says that you also have the option to have other unique indices if and only if the shard key is a prefix of their attributes. So you can try this, but aware, that unique key must always exist.
I don't understand your business logic where no information about the user would exist. In this case you can shard by _id and perform uniqueness checks manually.