Is document-level Transaction enough? (in mongodb) - mongodb

MongoDB documentation and blog describe its transaction capabilities like this.
MongoDB write operations are ACID-compliance at the document level- including the >ability to update embedded arrays and sub-documents automatically.
Now I'm wondering is this "document-level transaction support" enough ?
by enough I mean can it be as good as transaction support in old fashioned RDBMSs ?
about the possible duplicate, what i had in mind was a general question, the fact that "is this enough?" for a developer? or not.

I'm going to agree with Joshua on this and add my two cents. In the RDBMS world a transaction is very frequently updating multiple normalized data-bearing structures. A robust level of atomicity is required to ensure that changes are committed to all of those structures as a unit, or rolled back as a unit. In MongoDB you would ideally be designing your schema to keep data that logically belongs together housed together in the same document. This makes document-level atomicity perfectly sufficient for your typical document schema.
I'll also agree that neither RDBMS nor MongoDB transaction handling should be your only line of defense against errors and data corruption. For critical data changes that must be atomic you should always check consistency at the code level post-update.
One final thought: In most RDBMS systems, transaction handling does not always map one-to-one to concurrency. Frequently a large transaction can lock an entire table or tables and cause backlogs in response. In MongoDB, document-level ACID compliance in transaction handling pairs well with document-level concurrency available to those using the WiredTiger storage engine. If designed with both in mind your application can be highly concurrent and completely ACID compliant at the document level, giving you a high level of performance and throughput for transactional workloads.
Cheers,
Bill Finch

Answering this question involves an understanding of schema design in the NoSQL world. If you approach your schema design like you would in an RDBMS, then you will have a very bad time, and not just because of transactions.
If you design your documents properly, however, document level ACID-compliance should be just fine for 99% of use cases. I would even argue that outside that 99% and in that 1% of use cases, you shouldn't be relying on your database for transactions anyways. This would be a really complicated case where you were changing two completely separate things in parallel. Even in an RDBMS if you were doing this, you would always write a verification in code.
One example might be a bulk update for a banking customer that involved them changing their name and doing an address change at the same time. In an RDBMS these are likely to be separate tables. In MongoDB these will both be in the same document. So this fits in the 99%.
A debit to one account and credit to another would be an example that fit into the 1%. You can wrap that in a transaction in SQL, but if you don't write code to verify the writes afterward, you are going to loose your job. You would never rely on the database for that. Same with MongoDB, where these would be two different documents.

Document-level transactions are good, but not enough for real-world applications. in general, you have to think a bit different as in a RDBMS-world, and use "sub-document" and you can solve many situations without collection-wide transactions, but there are enough use-cases, where you definitly need collection-wide transactions.
The debit/credit-situation of an account-system is one example... or if you implement a battle-game, where two player fight against each other and the one (winner) gets "resources" from the other (looser)... you have to update the resource-state of both players in parallel or both have to be rolled back, if something failed. This is not handled by MongoDB transactional as in RDBM-systems.
Once again, as others said already: you have to think in objects/document-structure, there you can handle many situations, where document-level-transactions are far enough...
But the collection-wide-transactions are on the roadmap of MongoDB ;-)

If you are able to include all your logical data in one document, MongoDB is going to be faster and higher performance than a relational database. You must be sure that all your data are going to be written, or not, at the same time (ACID compliant at the document level).
If you are not in a hurry, MongoDB is working hard to get transactions across collections!
Regards,
Juan

Starting from version 4.0 MongoDB will add support for multi-document transactions. So you will have the power of the document model with ACID guarantees in MongoDB.
For details visit this link: https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb?jmp=community

Related

How to live without transactions?

Most of the popular NoSQL databases (MongoDB, RethinkDB) do not support ACID transactions. They are very popular today within developers of different systems.
The problem is: how to guarantee data consistency without transactions?
I thought that data consistency is one of the main things in production. Am I wrong?
Maybe there is some technics to restore data consistency?
I would like to use RethinkDB for my project, but I'm scare about missed transactions.
I do not know much about RethinkDB, so this answer is primarily based on MongoDB.
while MongoDB can not provide atomic operations on multiple documents at the same time, it does guarantee atomicity for a single operation which affects one document. That means when one query changes multiple fields of the same document, you can be sure that all these changes will be performed at the same time. Combined with the MongoDB philosophy of keeping a consistent dataset in one document instead of spreading it over many rows of different related tables, this removes many situations where you would need transactions in a relational database.
not every project needs complex transactions. Sure, there are some domains where it is essential (like most situations where you deal with money), but in other cases it isn't actually that big of a deal when some data is inconsistent for a few milliseconds. You need to consider how important data consistency is for your project. When you come to the conclusion that there are many situations where you do need transactions, then by all means, stick to SQL.
In a pinch, MongoDB can simulate multi-document transactions by using the two-phase commit model. It's not easy to implement, it's not easy to work with, it does not result in a pretty data model, but it is a valid workaround when you have a project which would be perfect for MongoDB in all regards except for that one use-case which just can't do without transactions.
A lot of popular NoSQL data stores don't support atomic multi-key updates (transactions) of the box but most of them provide primitives which allow you to build ACID transactions on the application level.
If a data store supports per key linearizability and compare-and-set operation (atomic document updates) then it's enough to implement serializable client-side transactions. For example, this approach is used in Google's Percolator and in CockroachDB database.
In my blog I created step-by-step visualization of serializable cross shard client-side transactions, described the major use cases and provided links to the variants of the algorithm. I hope it will help you to understand how to work with transactions with NoSQL data stores.
Among the data stores which support per key linearizability and CAS are:
Cassandra with lightweight transactions
Riak with consistent buckets
RethinkDB
ZooKeeper
Etdc
HBase
DynamoDB
MongoDB
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on RAMP transactions by Peter Bailis. They can be also implemented with the same set of primitives.
In RethinkDB, you have some guanrantee for atomicity. According to the document https://rethinkdb.com/docs/architecture/
Write atomicity is supported on a per-document basis – updates to a
single JSON document are guaranteed to be atomic. RethinkDB is
different from other NoSQL systems in that atomic document updates
aren’t limited to a small subset of possible operations – any
combination of operations that can be performed on a single document
is guaranteed to update the document atomically
When you want to run a non-atomic update, you have to explicitly opt in for it, according to https://www.rethinkdb.com/api/javascript/update/
nonAtomic: if set to true, executes the update and distributes the
result to replicas in a non-atomic fashion. This flag is required to
perform non-deterministic updates, such as those that require reading
data from another table.
It has an issue to track some Transaction support for RethinkDB here: https://github.com/rethinkdb/rethinkdb/issues/4598
Anyway, you don't have good transaction but you have some basic guarantee that is enough for you. And try to design your operation around those basic thing.

NoSQL vs. Relational Databases vs. Possible Hybrid

I'm hearing more about NoSQL, but have yet had someone give me a clear explanation of how it is to be used instead of relational databases.
I've read that it can't do left joins, so I was trying to figure out how you'd be able to use such a data storage. From reading: Preserve Joins by code in MongoDB it seems like a suggestion is to just make a large table, as if you already did the joins on it.
If the above statement is true, then I can see how it can be used. However I'm curious on how you'd handle repeat data. As the concept of normalizing, helps you remove the redundancy and ensure consistency in the data (e.g. Slight modifications like capitalization, white space, etc)...
Are we simply sacrificing the consistency of the data for scalable speed, or am I missing something?
Edit
I've been doing some more digging and found the answers the following questions useful for clarifying my understanding:
Why Google's BigTable referred as a NoSQL database?
How do you track record relations in NoSQL?
My understanding of consistency seems to be correct from those answers. And it looks like NoSQL is suppose to be used for specific problems types and that if you need relations that you should use a relational database.
But this raises more questions like:
It makes me wonder about real life examples of when to use NoSQL versus when not to?
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
MongoDB has the ability to have documents which include arrays of other documents. This solves many cases where you would have relations in reational databases.
When an invoice has multiple positions, you wouldn't put these positions into a separate collection. You would embed them as an array.
It makes me wonder about real life examples of when to use NoSQL versus when not to?
There are many different NoSQL databases, each one designed with different use-cases in mind. But you tagged this question as MongoDB, so I assume that you mean MongoDB in particular.
MongoDB has two main advantages over relational databases.
First, it scales well.
When the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. This doesn't work nearly as well with most relational databases.
Second, it allows heterogeneous data.
Imagine, for example, the product database of a computer hardware store. What properties do products have? All products have a price and a vendor. But CPUs have a clock rate, hard drives and RAM chips have a capacity (and these capacities aren't comparable), monitors have a resolution and so on. How would you design this in a relational database? You would either create a very long productID-property-value table or you would create a very wide and sparse product table with every property you can imagine, but most of them being NULL for most products. Both solutions aren't really elegant. But MongoDB can solve this much better because it allows each document in a collection to have a different set of properties.
What can't it do?
As a rather new technology, there isn't that much literature about it. The software ecosystem around it isn't that well either. The tools you can get for relational databases are often much more shiny.
There are also some use-cases MongoDB isn't well-suited for.
MongoDB doesn't do JOINs. When your data is very relational and denormalizing it would be counter-productive, it might be a poor choice for your product. But you might want to take a look at graph databases like Neo4j, which focus even more on relations than relational databases. Update 2016: MongoDB 3.2 now has rudimentary JOIN support with the $lookup aggregation stage, but it's still very limited in functionality compared to relational and graph databases.
MongoDB doesn't do transactions. At least not complex transactions. Certain actions which only affect a single document are guaranteed to be atomic, but as soon as you affect more than one document, you can't guarantee that no other query will happen in-between and find an inconsistent state.
MongoDB is bad for ad-hoc reporting. Its options for data-mining are severely limited. The rather new aggregation functions help and MapReduce can also solve some surprisingly complex problems when you learn to use it smart, but SQL has usually the better tools for things like that.
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Relational databases are around for about 40 years. Their theory is a well-researched topic in computer science. There are whole libraries of books written about the theory behind them. There is a by-the-book solution for every imaginable corner-case by now.
But NoSQL databases, on the other hand, are a rather new technology. We are still figuring out the best practices. The most frequent advise is: "Use your own head. Think about what queries are performed most often, and optimize your data schema for them."
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
When possible I would advise against using two different database technologies in the same product:
Anyone who maintains and supports the product must be familiar with both technologies
Troubleshooting gets a lot harder
The sysadmins need to keep an additional database running and updated
You have an additional point of failure which can lead to downtime
I would only recommend to mix database technologies when fulfilling your requirements without it doesn't just become hard but physically impossible. Otherwise, make your pick and stay with it.

What operations are cheap/expensive in mongodb?

I'm reading up on MongoDB, and trying to get a sense of where it's best used. One question that I don't see a clear answer to is which operations are cheap or expensive, and under what conditions.
Can you help clarify?
Thanks.
It is often claimed that mongodb has insanely fast writes. While they are not slow indeed, this is quite an overstatement. Write throughput in mongodb is limited by global write lock. Yes, you heard me right, there can be only ONE* write operation happening on the server at any given moment.
Also I suggest you take advantage of schemaless nature of mongodb and store your data denormalized. Often it is possible to do just one disk seek to fetch all required data (because it is all in the same document). Less disk seeks - faster queries.
If data sits in RAM - no disk seeks are required at all, data is served right from memory. So, make sure you have enough RAM.
Map/Reduce, group, $where queries are slow.
It is not fast to keep writing to one big document (using $push, for example). The document will outgrow its disk boundaries and will have to be copied to another place, which involves more disk operations.
And I agree with #AurelienB, some basic principles are universal across all databases.
Update
* Since 2011, several major versions of mongodb were released, improving situation with locking (from server-wide to database-level to collection-level). A new storage engine was introduced, WiredTiger, which has document-level locks. All in all, writes should be significantly faster now, in 2018.
From my practice one thing that should mentioned is that mongodb not very good fit for reporting, because usual in reports you need data from different collections ('join') and mongodb does not provide good way to aggregate data multiple collections (and not supposed to provide). For sure for some reports map/reduce or incremental map/reduce can work well, but it rare situations.
For reports some people suggest to migrate data into relations databases, that's have a lot of tools for reporting.
This is not very different than all database systems.
Query on indexed data are fast. Query on a lot of data are... slow.
Due to denormalization, if there is no index, writing on the base will be fast, that's why logging is the basic use case.
At the opposite, reading data which are on disk (not in RAM) without index can be very slow when you have billion of document.

Why doesn't MongoDB use fsync()?

So I have done some research and found out that MongoDB doesn't do fsync(), which means that when you tell the database to write something, the database might tell you it's written, although it's not. Isn't this going against CRUD?
If I'm correct, are there any good reasons for this?
The reason is performance. Without having to write to disk on each change, MongoDB can handle updates faster.
MongoDB tells you when updates have been delivered to the server, not when the updates have been written, as you can read in the documentation on Verifying Propagation of Writes with getLastError:
Note: the current implementation returns when the data has been delivered to [the] servers. Future versions will provide more options for delivery vs. say, physical fsync at the server.
This is going against ACID, more specifically against the D, which stands for durability:
Durability [guarantees] that once the user has been notified of a transaction's success the transaction will not be lost, the transaction's data changes will survive system failure, and that all integrity constraints have been satisfied, so the DBMS won't need to reverse the transaction.
ACID properties mostly apply to traditional RDBMS systems. NoSQL systems, which includes MongoDB, give up on one or more of the ACID properties in order to achieve better scalability. In MongoDB's case durability has been sacrificed for better performance when handling large amounts of updates.
MongoDB and ACID
Most ACID properties are guarantees at transaction level. A transaction is usually a group of queries that should be treated as a single unit. MongoDB has no concept of transactions, again for performance reasons. Therefore most ACID properties don't apply to MongoDB.
A — Atomicity states that a transaction should either succeed or fail. It is not allowed to partially succeed; if part of the transaction fails, the entire transaction should be rolled back. MongoDB supports atomic operations on a document level, but not on a 'transaction' level.
C — Consistency partially refers to atomicity, but also includes referential integrity. A relational database is responsible for making sure that all foreign key references are valid. MongoDB has no concept of foreign keys, so this ACID property doesn't apply.
I — Isolation states that two concurrent transactions are not allowed to interfere with each other; if two transactions try to modify the same data, the second transaction has to wait for the first one to complete. To achieve this, the database will lock the data. MongoDB has no concept of locking, so it doesn't support isolation for multiple operations1). Single operations are isolated.
D — Durability is described above. MongoDB doesn't support true durability (yet), in terms of ACID-ic durability.
Now, you may think that MongoDB is useless compared to RDBMS systems because it lacks transactions and most ACID guarantees. However, part of the reason that transactions exist is that relational databases need to treat certain data as a single entity, but this data has been normalized into multiple tables.
MongoDB allows you to store your data as a single entity. This removes the need for foreign keys and referential integrity in most cases. You also don't need multi-query transactions, because you don't need multiple tables to update a single entity. Most of the times you only have to update a single document, and these operations are atomic in MongoDB.
1) According to the first comment on this page, db.eval() provides isolation for multiple operations. However, according to the documentation you usually want to avoid the use of db.eval().
Is this relevant?
durability: added occasinal file sync
default: sync every 60 seconds, confiruable with syncdelay
http://github.com/mongodb/mongo/commit/c44bff08fd95616302a73e92b48b2853c1fd948d

What are the advantages of using a schema-free database like MongoDB compared to a relational database?

I'm used to using relational databases like MySQL or PostgreSQL, and combined with MVC frameworks such as Symfony, RoR or Django, and I think it works great.
But lately I've heard a lot about MongoDB which is a non-relational database, or, to quote the official definition,
a scalable, high-performance, open
source, schema-free, document-oriented
database.
I'm really interested in being on edge and want to be aware of all the options I'll have for a next project and choose the best technologies out there.
In which cases using MongoDB (or similar databases) is better than using a "classic" relational databases?
And what are the advantages of MongoDB vs MySQL in general?
Or at least, why is it so different?
If you have pointers to documentation and/or examples, it would be of great help too.
Here are some of the advantages of MongoDB for building web applications:
A document-based data model. The basic unit of storage is analogous to JSON, Python dictionaries, Ruby hashes, etc. This is a rich data structure capable of holding arrays and other documents. This means you can often represent in a single entity a construct that would require several tables to properly represent in a relational db. This is especially useful if your data is immutable.
Deep query-ability. MongoDB supports dynamic queries on documents using a document-based query language that's nearly as powerful as SQL.
No schema migrations. Since MongoDB is schema-free, your code defines your schema.
A clear path to horizontal scalability.
You'll need to read more about it and play with it to get a better idea. Here's an online demo:
http://try.mongodb.org/
There are numerous advantages.
For instance your database schema will be more scalable, you won't have to worry about migrations, the code will be more pleasant to write... For instance here's one of my model's code :
class Setting
include MongoMapper::Document
key :news_search, String, :required => true
key :is_availaible_for_iphone, :required => true, :default => false
belongs_to :movie
end
Adding a key is just adding a line of code !
There are also other advantages that will appear in the long run, like a better scallability and speed.
... But keep in mind that a non-relational database is not better than a relational one. If your database has a lot of relations and normalization, it might make little sense to use something like MongoDB. It's all about finding the right tool for the job.
For more things to read I'd recommend taking a look at "Why I think Mongo is to Databases what Rails was to Frameworks" or this post on the mongodb website. To get excited and if you speak french, take a look at this article explaining how to set up MongoDB from scratch.
Edit: I almost forgot to tell you about this railscast by Ryan. It's very interesting and makes you want to start right away!
The advantage of schema-free is that you can dump whatever your load is in it, and no one will ever have any ground for complaining about it, or for saying that it was wrong.
It also means that whatever you dump in it, remains totally void of meaning after you have done so.
Some would label that a gross disadvantage, some others won't.
The fact that a relational database has a well-established schema, is a consequence of the fact that it has a well-established set of extensional predicates, which are what allows us to attach meaning to what is recorded in the database, and which are also a necessary prerequisite for us to do so.
Without a well-established schema, no extensional predicates, and without extensional precicates, no way for the user to make any meaning out of what was stuffed in it.
My experience with Postgres and Mongo after working with both the databases in my projects .
Postgres(RDBMS)
Postgres is recommended if your future applications have a complicated schema that needs lots of joins or all the data have relations or if we have heavy writing. Postgres is open source, faster, ACID compliant and uses less memory on disk, and is all around good performant for JSON storage also and includes full serializability of transactions with 3 levels of transaction isolation.
The biggest advantage of staying with Postgres is that we have best of both worlds. We can store data into JSONB with constraints, consistency and speed. On the other hand, we can use all SQL features for other types of data. The underlying engine is very stable and copes well with a good range of data volumes. It also runs on your choice of hardware and operating system. Postgres providing NoSQL capabilities along with full transaction support, storing JSON documents with constraints on the fields data.
General Constraints for Postgres
Scaling Postgres Horizontally is significantly harder, but doable.
Fast read operations cannot be fully achieved with Postgres.
NO SQL Data Bases
Mongo DB (Wired Tiger)
MongoDB may beat Postgres in dimension of “horizontal scale”. Storing JSON is what Mongo is optimized to do. Mongo stores its data in a binary format called BSONb which is (roughly) just a binary representation of a superset of JSON. MongoDB stores objects exactly as they were designed. According to MongoDB, for write-intensive applications, Mongo says the new engine(Wired Tiger) gives users an up to 10x increase in write performance(I should try this), with 80 percent reduction in storage utilization, helping to lower costs of storage, achieve greater utilization of hardware.
General Constraints of MongoDb
The usage of a schema less storage engine leads to the problem of implicit schemas. These schemas aren’t defined by our storage engine but instead are defined based on application behavior and expectations.
Stand-alone NoSQL technologies do not meet ACID standards because they sacrifice critical data protections in favor of high throughput performance for unstructured applications. It’s not hard to apply ACID on NoSQL databases but it would make database slow and inflexible up to some extent. “Most of the NoSQL limitations were optimized in the newer versions and releases which have overcome its previous limitations up to a great extent”.
It's all about trade offs. MongoDB is fast but not ACID, it has no transactions. It is better than MySQL in some use cases and worse in others.
Bellow Lines Written in MongoDB: The Definitive Guide.
There are several good reasons:
Keeping different kinds of documents in the same collection can be a
nightmare for developers and admins. Developers need to make sure
that each query is only returning documents of a certain kind or
that the application code performing a query can handle documents of
different shapes. If we’re querying for blog posts, it’s a hassle to
weed out documents containing author data.
It is much faster to get a list of collections than to extract a
list of the types in a collection. For example, if we had a type key
in the collection that said whether each document was a “skim,”
“whole,” or “chunky monkey” document, it would be much slower to
find those three values in a single collection than to have three
separate collections and query for their names
Grouping documents of the same kind together in the same collection
allows for data locality. Getting several blog posts from a
collection containing only posts will likely require fewer disk
seeks than getting the same posts from a collection con- taining
posts and author data.
We begin to impose some structure on our documents when we create
indexes. (This is especially true in the case of unique indexes.)
These indexes are defined per collection. By putting only documents
of a single type into the same collection, we can index our
collections more efficiently
After a question of databases with textual storage), I glanced at MongoDB and similar systems.
If I understood correctly, they are supposed to be easier to use and setup, and much faster. Perhaps also more secure as the lack of SQL prevents SQL injection...
Apparently, MongoDB is used mostly for Web applications.
Basically, and they state that themselves, these databases aren't suited for complex queries, data-mining, etc. But they shine at retrieving quickly lot of flat data.
MongoDB supports search by fields, regular expression searches.Includes user defined java script functions.
MongoDB can be used as a file system, taking advantage of load balancing and data replication features over multiple machines for storing files.