Convention for the attribute used for optimistic locking? - nosql

I am doing optimistic locking on some rows in dynamodb. I'm wondering is there a convention for the attribute key name used for the OL version number? I'm thinking
_v: 1

In the AWS documentation they use a field called version. I also use that field name, because we just copied AWS.
You can use any field name you want. A very short field name has the benefit of saving you RCUs on Scans and Queries. It probably won't matter but keeping field names short can save you some expense, especially if your values are quite short (and therefore the field names make up a lot of the table data).

Related

email as _id in a MongoDB user collection

I have a user collection in a MongoDB. The _id is currently the standard MongoDB generated ObjectId. I also have a unique key constraint against a required 'email' field. This seems like a waste.
Is there any reason why I should not ditch the 'email' field and make that data the _id field?
I have read Neil's answer and I partially agree with it (also I am really skeptical about 'significant performance gains'). One thing I have not found in your question is 'what are you going to do with this email'. Are you going to search by it or it is just saved there? And one of the most important things which was not addressed in previous answer: is it going to be changed?
It is not uncommon that people who would use your system will be going to change their email (lost / is not used anymore). If you will put your _id as their email you will not be able to change it easily (you can not modify _id in mongo). You will need to copy, remove add new element in this case (which will not be atomic).
So I would put this as one big reason not to do so. But you need to decide whether you will allow people to change email addresses.
Generally speaking, no there is no real reason and in fact there are significant performance gains to be realized if you actually do use your "email" as a primary key.
Where most of your lookup's are actually on that primary key. Even creating a unique key for a different field, MongoDB is optimized so that "finding" the _id field index is a no-brainer. It's always there.
No additional space used for an index. So again where you are looking up your primary key there is not need to pull in anything other than the default index, as well as this naturally saving on disk space in addition to the I/O cost that would be incurred otherwise.
Perhaps the only real relevant consideration would be with sharding. And that would only be if your use case was better suited to some different form of "bucketed" distribution of "high/low" volume users for example. In that case some other form of Primary key would be required in order to facilitate that.
The default ObjectId type that generally occupies the _id field is great as it maintains a natural insertion order and also even makes it possible to do such things as general range based queries or even time based queries (within reason). So where there is a need for a natural insertion order it is generally be best choice and is highly collision safe.
But if you are generally looking for efficient lookup of Primary key values, then anything that serves as a natural primary key is ideally put in the _id field of the collection, as long as it is reasonably guaranteed to be unique.

mongodb indexing user-defined schemas

We are currently using MongoDB to allow tenants in a SaaS application to define entities that they can use in the application. We do not know know how each tenant is going to define the fields for the entities that they are creating upfront. Each entity will have a collection dynamically created for it in a separate database that belongs to the tenant.
For example, One tenant might define a Customer as First Name, Last Name, Email. Another tenant might define Shipment as Shipment Ref, Ship Date, Owner etc... Each tenant will have many entities/collections in their tenant database.
We have one field (ID) which we will always force the user to include in each entity/collection. We will index this field upfront when creating the collection.
However, how do we handle the case where we want to allow the tenant to sort/search/order/query large collections/entities quickly when/if the dataset becomes too large?
That is, since we do not know upfront what fields the user will be sorting/filtering/ordering by, what is the indexing strategy to use in this case with Mongo?
First of all Mongo requires you to have _id for each document and it indexes it automatically. You should take advantage of this and not create yet another ID field in case you require your clients to have ID field. I'm not sure if that's the case in your application.
What you are asking for can't have a perfect solution or even the most optimal one, but I can suggest couple options:
Create single field index for each field in the document. Let Mongo query optimizer decide which index to use depending on query. Disadvantages - takes lots of space on disk and in memory. Makes inserts slower. Mongo can use only 1 index in condition clause, so it will not be able to use compound index. You can easily extract schema with a tool like this. I wrote this little prototype that analyzes and prints Mongo schema.
Let your application learn what indexes to create. Get slow queries from Mongo profiler (in Mongo log), analyze common parts (automatically?) and create indexes on most commonly used fields. That's not so easy to implement and efficiency might change with time if your client changes queries or data. Application will be slow in the start until it learns about itself :).
Would just like to emphasize in choosing your design that the ID and not _id field you mention is actually some unique entity identifier then you are better of putting this in _id.
The reason here is that the performance trade-off for using another unique index over the required _id is a considerable overhead. Thinking about this, since _id is required it is the first thing that MongoDB looks for when determining which index to use. Otherwise consider a compound _id field containing your entity information and some other useful uniqueness.
As for the user defined fields, which is kind of the essence of mongo documents, for my money I would make it part of the API to set up indexes as required. Depending on the type of searching that is happening you'll probably want compound indexes and generated queries that make sense to these.
Simply indexing every field will probably have limited use as only one index is going to be picked for the find anyhow, and the query optimizer is going to try all of them. As has been mentioned, a long option could be to set indexes according to the usage patterns. But it could take some work to do.

strategy for creating MongoDB short ids that scale

I want to have a friendlier facing ids (ie Youtube style: /posts/cxB6Ey6) than MongoDB's ObjectID.
I read that for scalability its best to leave _id as an ObjectID so I thought about two solutions:
1) add an indexed postid field to each document
2) create a mapping collection between _id and the postid
in both cases use something like https://github.com/dylang/shortid to generate the short id, and while generating make sure that the id is unique by querying the database.
(can this query-generate-insert be an atomic operation?)
will those solutions have a noticeable impact on performance ?
what's the best strategy for doing this ?
The normal method of doing this is to base64 encode a unique id but:
add an indexed postid field to each document
You definitely want to go for this method. Out of the two I would say this method is easily the most scalable and performant, for one it would only need one round trip to get a short URLs details where as the second option would take 2. Another consideration is the shortage of index overhead of maintaining an extra collection, this is a bit of a no-brainer.
I would not replace the _id field within the document either since the default ObjectId could still be useful in the foreseeable future.
So this limits it down to a separate field and index (unique key) for the short code of a URL.
The next thing is that you don't want an ID which forces you to query the database for uniqueness prior to every insert. This is where the ObjectId shines. The ObjectId is good at being made within the client application while being unique in the database without having to specifically query those assumptions.
Unique ids that do not require querying the database first are normally time based. In PHP ( http://php.net/manual/en/function.uniqid.php ) and in the MongoDB Drivers ( http://docs.mongodb.org/manual/core/object-id/ ) and even the plug-in you linked on github ( https://github.com/dylang/shortid/blob/master/lib/shortid.js#L50 ) they all use time as a basis for being unique.
Considering the plug-in you linked does not query the database to check its own IDs uniqueness I would say that this plug-in probably is quite performant and if you use it with the first solution you stated you should get a good benchmark out of it.
If you want to replace build-in ObjectID with custom user-friendly short id's then do it. You can either use build-in _id field or add a new unique-indexed field id for your custom ids. The benefit of using build-in ObjectID's is than they won't duplicate even if your database is extremely large. So, by replacing them with short id's you take the risk of id duplication.
Now about the performance. I think that the best solution is not to query DB for id's, because with properly adjusted ids length the probability of duplication is extremely small. So, the best way to handle ids duplication in this model is to check Mongo responses. If it responded with "duplicate key error" then you shall generate a new one.
And now about scaling. To scale your custom ids you can just add a few more symbols to it. "Duplicate key error" shall be a trigger for making that change. Normally there shall be no such errors. So, if they started to appear then its time to scale.
I don't think that generating ObjectId for _id field affect directly scalability or performance. Whereby this can be happen?
Main difference is that ObjectIds are created by MongoDB and you don't burden yourself with responsibility for this. Otherwise you must by yourself to determine optimal size of id and to ensure unique value for each _id field of documents stored in collection. It's required because _id used as primary key. This can be justified if you have not very big collection and custom value of identifier is need for you.
But you have such additional benefits with _id field that stores ObjectId values as opportunity to create object id's from time and use this fact to your advantage in queries. Also you can get timestamp of ObjectId’s creation with getTimestamp() method. And sorting on _id in this case is equivalent to sorting by creation time.
But if you're going to use ObjectId in URLs or HTML then for security concerns you can encrypt it. To prevent leakage of information and access to object's creation time. It may be security risk.
About your solutions:
1) I suppose this's very convenient and flexible solution. In this case you can specify any value in postId which doesn't depend directly on _id.
But little disadvantage of this solution is that you have to have extra field and to create extra index. While _id is automatically indexed.
2) I don't think this's good solution from the point of view of performance and philosophy of noSQL approach.

Custom Attribtues - No SQL Data Store

We want to develop a application which need to support custom attribtues to different entities (like user, project, folder, document etc..) in our application.
I googled and prima face it looks like No-SQL database can be suited for our requirement. Do you see any limitation ? What are the prons/cons of using No-SQL instead of RDBMS?
There are many NO-SQL databases available - http://nosql-database.org/ ? But we don't have any experiance in using No SQL database.Don't find any good article which compares these NO-SQL database. Any suggestion which No-SQL data store we can use to achive custom attribtues functionality?
One big advantage of No-sql database is its free-style: you will never specify the columns like "user, project, folder" before you insert your real data. The columns can be added at any time.
While in RDBMS, the table schema is strictly defined, can not modify during run time.
Another advantage is the performance in query. It is quite efficient if you query all the records of a user, say "Michael", since the data is stored following the principle of Big Table, named by google.
There are two ways to solve your question: a column database such as Cassandra; or a name-value pair (also called attribute-value pair) in relational.
First, Cassandra is a structured key-value store. A key can contain multiple and variable attributes and values. Values or columns are grouped into column families. The column families are fixed when a Cassandra database is created. A family is analogous to an entity in a logical data model or to a table in relational. Columns can be added to a family at any time. Thereby, different instances of the column family can have different columns, which is what you need. Furthermore, columns are assigned to specified keys, so different keys can have different numbers of columns in any given family.
A name value pair, also called an attribute value pair, can be created in logical data modeling and in relational. This can be done with three related entities or tables:
The base entity (such as customer), which in analogous to a column family.
A "type" entity, which describes the attribute and its characteristics such as Net Worth Amount,
A "value" entity, which assigns the attribute to an instance of a base entity and assigns it a value.
The "type" entity is simply a code table identified by a type code and containing a description and other domain characteristics. Domain refers to data type, length, meaning, and units of measure. It describes the attribute out of context (i.e., unassigned). An example could be Net Worth Amount, which is a number 8 digits with 2 decimal places, right justified, and its description is "a value representing the total financial value of a customer including liquid and non-liquid amounts".
The "value" entity is an associative entity or table that is identified by the customer id and the attribute type code, and has a value attribute that assigns the Net Worth Amount type the Customer and gives it a value, such as "$2,000,000."
However, in relational name-value pairs are somewhat difficult to query in SQL and generally do not perform well. This could be addressed by denormalizing the "type" and "value" entities into one. Instead of having three tables you have two -- one-to-many. Actually, that is essentially how Cassandra does it. A column family is a fully flattened attribute-value pair.
I hope this helps. If you are going to use NOSQL, I'd use something like Cassandra. If you use relational, I'd denormalize (i.e., collapse into one) the type and value. The advantage of relational is that your already have it. The disadvantage to Cassandra is that you have to learn it but it is build to do what you want.
Couchbase would be a great answer for you, if you can encapsulate your model into JSON then you are already halfway there. You can have any number of properties for your object:
product::001
{
"name": "Hard Drive",
"brand": "Toshiba",
...
...
}
To learn some simple patterns moving from RDBMS to Couchbase, check out their webinars at http://www.couchbase.com/webinars or some simple design patterns at http://CouchbaseModels.com (examples are in Ruby though)
The real advantage of Couchbase is schema flexibility, horizontal scalability on commodity hardware, and speed. After learning the basics, it fits better into Agile processes, with almost no need for migrations. In enterprise organizations it's very effective since every column modification will require business processes and approvals with the DBA. Couchbase schema flexibility circumvents a lot of these issues.

Mongodb: object id as short primary key within a collection

How to make better use of objectId generate by MongoDB. I am not an expert user, but so far i ended up creating seperate id for my object (userid, postid) etc because the object id is too long and makes the url ugly if use as the main ID. I keep the _id intact as it help indexing etc. I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key. I read the key was a combination of date etc, so any of the part can be used unique within a collection for this purpose.
thanks,
bsr/
If you have an existing ID (say from an existing data set), then it's perfectly OK to override _id with the one you have.
...keeo the _id intact as it help indexing etc
MongoDB indexes the _id field by default. If you start putting integers in the _id field, they will be indexed like everything else.
So most RDBMs provide an "auto-increment" ID. This is nice for small datasets, but really poor in terms of scalability. If you're trying to insert data to 20 servers at once, how do you keep the "auto-increment" intact?
The normal answer is that you don't. Instead, you end up using things like GUIDs for those IDs. In the case of MongoDB, the ObjectId is already provided.
I was wondering about any better strategy so that one can use mongo objectId as more url friendly and easy to remember key
So the problem here is that "easy to remember" ID doesn't really mesh with "highly scalable database". When you have a billion documents, the IDs are not really "easy to remember".
So you have to make the trade-off here. If you have a table that can get really big, I suggest using the ObjectId. If you have a table that's relatively small and doesn't get updated often, (like a "lookup" table) then you can build your own auto-increment.
The choice is really up to you.
You can overwrite the _id yourself. There is no obligation for using the auto-generated object id. What is the problem with overriding _id inside your app according to your own needs?