Data Synchronization in a Distributed system - rest

We have an REST-based application built on the Restlet framework which supports CRUD operations. It uses a local-file to store the data.
Now the requirement is to deploy this application on multiple VMs and any update operation in one VM needs to be propagated other application instances running on other VMs.
Our idea to solve this was to send multiple POST msgs (to all other applications) when a update operation happens in a given VM.
The assumption here is that each application has a list/URLs of all other applications.
Is there a better way to solve this?

Consistency is a deep topic, and a hard thing to get right. The trouble comes when two nearly-simultaneous changes occur to the same data: conflicting updates can arrive in one order on one server, and in another order on another. This is a problem, since the two servers no longer agree on what the data is, and it isn't clear who is "right".
The short-story: get your favorite RDBMS (for example, mysql is popular) and have your app servers connect to in what is called the three-tier model. Be sure to perform complex updates in transactions, which will provide an acceptable consistency model.
The long-story: The three-tier model serves well for small-to-medium scale web sites/services. You will eventually find that the single database becomes the bottleneck. For services whose read traffic is substantially larger than write traffic, a common optimization is to create a single-master, many-slave database replication arrangement, where all writes go to the single master (required for consistency with non-distributed transactions), but the more-common reads could go to any of the read slaves.
For services with evenly-mixed read/write traffic, you may be better served by dropped some of the conveniences (and accompanying restrictions) that formal SQL provides and instead use of one of the various "nosql" data stores that have recently emerged. Their relative merits and fitness for various problems is a deep topic in itself.

I can see 7 major options for now. You should find out more details and decide whether the facilities / trade-offs are appropriate for your purpose
Perform the CRUD operation on a common RDBMS. Simplest and most consistent
Perform the CRUD operations on a common RDBMS which runs as fast in-memory RDBMS. eg TimesTen from Oracle etc
Perform the CRUD on a distributed cache or your own home cooked distributed hash table which can guarantee synchronization eg Hazelcast/ehcache and others
Use a fast common state server like REDIS/memcached and perform your updates
in a synchronized manner on it and write out the successfull operations to a DB in a lazy manner if required.
Distribute your REST servers such that the CRUD operations on a single entity are only performed by a single master. Once this is done, the details about the changes can be communicated to everyone else using a reliable message bus or a distributed database (eg postgres) that runs underneath and syncs all of your updates fairly fast.
Target eventual consistency and use a distributed data store like Cassandra which lets you target the consistency you require
Use distributed consensus algorithms like Paxos or RAFT or an implementation of the same(recommended) like zookeeper or etcd respectively and take ownership of the item you want to change from each REST server before you perform the CRUD operation - might be a bit slow though and same stuff is what Cassandra might give you.

Related

How to write data to both NoSQL and RDBMS simultaneously and efficiently

Let’s assume a setup where a mobile application is communicating with its backend via an API, and data resulting from this communication (eg JSON- based transaction writes among others) is written into and read from a MongoDB instance.
Now since I would like to perform some heavy analytics on data stored in mongo, should I rather:
save data directly to RDBMS at the same time as I write to Mongo (so the backend service calls Mongo and after successful write also calls RDBMS)
perform read from Mongo (with some intervals) and load fresh data into RDBMS
I am afraid that both of those solutions require also re-engineering theoretically schema-less Mongo to be in constant agreement with relations and schema in RDBMS. Does it really require more planning for any document structure changes in Mongo? I intuitively say yes, but I look for real world examples. I hope my point is clear enough.
Maybe CQRS pattern will be good for You.
See: https://martinfowler.com/bliki/CQRS.html
You can use RDBMS for Write Model. Mongo - for Read Model.
After every write operation to RDBMS You should update Your ReadModel (MongoDB Document) based on data from Write Model.
There are a few constraints that need to be understood before you embark on a solution here. The most relevant of these is latency. How out-of-date can your data be?
You are almost definitely looking at some kind of write-behind solution here, taking data out of MongoDB, and writing it to your data warehouse. The question is, how far behind your MongoDB can your data warehouse be? Many solutions based on an extract-transform-load model (ETL) work on a nightly basis, so as to minimize impact on the online system. Some can do the same on an hourly basis, but will have more potential impact on the live system.
Transaction-by-transaction support is likely not needed for an analysis system. You really want to avoid this if you can, as it puts far more load on both systems than is usually justified.
To answer your second question, yes, once you start depending on a schema, it needs to be stable. It doesn't have to be synced up with your target schema necessarily, but your ETL process will have to be aware of both, and will have to be modified any time either one materially changes. Being "schema-less" doesn't mean there isn't a schema, it just means that the schema is not enforced by the software, instead it is enforced by the dependencies on the system.
I think the option with least engineering effort is to use a Kafka connector for MongoDB, such that the connector will read the MongoDB changes from the oplog in near-real time and write the event in Kafka. Then from Kafka you can write the data to a relational DB using a stream processing.
Dual write from UI is not a good option as it can introduce latency, complexity and opeeational overhead. What if the write to one DB fails?

CQRS Write database

In our company we are developing a microservice based system and we apply CQRS pattern. In CQRS we separate Commands and Queries, because of that we have to develop 2 microservices. Currently I was assigned to enhance CQRS pattern to save events in a separate database (event sourcing). I understand that having a separate event database is very important but do we really need a separate Write Database? What is the actual use of the Write database?
If you have an event database, it is your Write database. It is the system-of-record and contains the transactionally-consistent state of the application.
If you have a separate Read database, it can be built off of the event log in either a strongly-consistent or eventually-consistent manner.
I understand that having a separate event database is very important but do we really need a separate Write Database? What is the actual use of the Write database?
The purpose of the write database is to stand as your book of record. The write database is the persisted representation that you use to recover on restart. It's the synchronization point for all writes.
It's "current Truth" as your system understands it.
In a sense, it is the "real" data, where the read models are just older/cached representations of what the real data used to look like.
It may help to think in terms of an RDBMS. When traffic is small, we can serve all of the incoming requests from a single database. As traffic increases, we want to start offloading some of that traffic. Since we want the persisted data to be in a consistent state, we can't offload the writes -- not if we want to be resolving conflicts at the point of the write. But we can shed reads onto other instances, provided that we are wiling to admit some finite interval of time between when a write happens, and when the written data is available on all systems.
So we send all writes to the leader, who is responsible for organizing everything into the write ahead log; changes to the log can then be replicated to the other instances, which in turn build out local copies of the data structures used to support low latency queries.
If you look very carefully, you might notice that your "event database" shares a lot in common with the "write ahead log".
No, you don't necessarily need a separate write database. The core of CQRS segregation is at the model (code) level. Going all the way to the DB might be beneficial or detrimental to your project, depending on the context.
As with many orthogonal architectural decisions surrounding the use of CQRS (Event Sourcing, Command Bus, etc.), the pros and cons should be considered carefully prior to adoption. Below some amount of concurrent access, separating read and write DBs might not be worth the effort.

Is there a reliable (single server) MongoDB alternative?

I like the idea of document databases, especially MongoDB. It allows for faster development as we don't have to adjust database schema's. However MongoDB doesn't support multi-document transactions and doesn't guarantee that modifications get written to disk immediately like normal databases (I know that you can make the time between flushes quite small, but it's still no guarantee).
Most of our projects are not that big that they need things like multi-server environments. So keeping that in mind. Are there any single server MongoDB-like document databases that support multi-document transactions and reliable flushing to disk?
It might be worthwhile to look at ArangoDB. It is a multi model database with a flexible data model for documents, graphs, and key-values. With respect to your specific requirements, ArangoDB database has full ACID transactions which can span over multiple documents in the same collection as well as over multiple collections (see Transactions in ArangoDB). That is, you can execute a group of manipulations to your documents together in a transaction and have guaranteed atomicity and isolation. If you additionally set waitForSync: true
(as described further down on said page), you get a guaranteed sync to disk before your transaction reports completion. Note that this happens automatically if your transaction spans multiple collections.
A very short answer to your specific (but brief) requirements:
Are there any single server MongoDB-like document databases that support multi-document transactions and reliable flushing to disk?
RavenDB [1] provides support for multi-doc transactions [2]. Unfortunately I don't know it handles durability.
CouchDB [3] provides durable writes, but no multi-doc transactions
RethinkDB [4] provides durable writes, but no multi-doc transactions.
So you might wonder what's different about these 3 solutions? Most of the time is their querying support (I'd say RethinkDB has the most advanced one covering pretty much all types of queries: sub-queries, JOINs, aggregations, etc.), their history (read: production readiness -- here I'd probably say CouchDB is in the lead), their distribution model (you mentioned that's not interesting for you), their licensing (RavenDB: commercial, CouchDB: Apache License, Rethinkdb: AGPL).
The next step would be for you to briefly look over their feature set and figure out which one comes close to your needs and give it a try.
I have some experience with CouchDB and ArangoDB which I can share:
You can run CouchDB with durability turned on (delayed_commits = false) so it will also sync your data to disk.
However, this is a global setting so it affects all writes. AFAIK you cannot set it on a per-collection level (the CouchDB term for "collection" would be "database").
Regarding multi-document operations: CouchDB has MVCC, so reading multiple documents from the same database provides a consistent result even in the face of parallel writers.
Writing multiple documents to the same database can also be made transactional for special cases, e.g. when using the bulk documents API.
But there is no way to execute cross-database operations in CouchDB. This is just not intended.
On ArangoDB: in ArangoDB you can turn on immediate syncing to disk on a per-collection level: you can turn it on for collections which you cannot tolerate any data loss in. You can turn immediate syncing off for not-so-important collections for performance reasons. It will then still sync modifications to disk frequently, but not immediately. It provides multi-document and multi-collection transactions.
Checkout the following:
arangodb
rethinkdb
I would suggest you look at Couchbase.
Couchbase can be run single server & you can add nodes later if you want.
Couchbase has memcached integrated so you have fast caching of common data, with a reliable method of writing updates to disk.
They also have a new query language (in development but you can use it now) called NQL ("Nickel") that gives you SQL like access, if that's important to you.
With cross-datacenter replication, you can keep two DBs on different machines or data centers in sync, which is good for having an offsite backup. This also allows you to add elastic search if you wish to have a full text search engine for those types of queries.
In short, Couchbase is a pretty complete solution, all open source and has intelligent (in my opinion) architecture for addressing the typical problems with distributed databases (e.g.: every document is "owned" by a given node, so all changes go to that node, and then the updates are replicated, this is better, I think, than say Riak where you can have updates go to two nodes and then have to be reconciled.)
You can use Couchbase on one node to run the database for many projects by separating the projects into different buckets.
there are so many nosql databases and definitely its hard to choose one. You will have to come up with proper requirements and know exactly what you want.
Following link compared almost all the popular nosql databases
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
I hope this helps.
Berkeley DB is one we used. It supports ACID. It does have transactions, but as to your term "multidocument" applies, I'm not entirely sure. I imagine so long as each database (i.e. individual document) shares the same BDB environment (i.e. where transactions are stored) then maybe that gets what you want. BDB does have other tradeoffs though. With fully durability and high concurrency, commits are pretty slow.
Give a try to: http://www.orientdb.org/
"OrientDB has the flexibility of the Document databases and the power of the Graph databases to manage relationships. It can work in schema-less mode, schema-full or a mix of both. Supports advanced features such as ACID Transactions, Fast Indexes, Native and SQL queries. It imports and exports documents in JSON. OrientDB uses a new indexing algorithm called MVRB-Tree, derived from the Red-Black Tree and from the B+Tree with benefits of both: fast insertion and ultra fast lookup".
You do not have to adjust schemas in document data stores, but that does not mean you do not need some sort of schema as you probably want to do something meaningful with your data. It appears you would like an ACID database. If you have relational data, and you need transactions with that data, well it sounds very much like you need a relational database.
With "NoSQL" databases like Mongo, you are giving up ACID for features like many writable replicas, sharding, and quick accessing of document data. Sounds like you do not benefit from that so why take the tradeoff? A lot of people have been doing hybrid approaches lately with PostgreSQL by storing documents in a relational table as blobs of JSON. With this, you can have the advantage of storing your data as not strictly structured columns where it is not needed.
So if you have multiple documents that you need to be transactional on update, you can column out the keys, and have a column "document" or something where it is simply a blob of JSON where you serialize and deserialize it. This is not criticizing Mongo or other document stores as a database but it is just not really a good choice for transactional multidocument data. MarkLogic I believe does ACID over multiple documents too.
I think a lot of people find appeal with mongodb due to the schema-less-ness but I think in the end they get bit by trying to shoehorn a relational model into it. So as always the DB choice depends on how your data is.
If I were you I would take a close look at Solr. The underlying data-layer (Lucene) is by far the most mature of the NoSQL databases, and Solr makes installing, configuring, and integrating a single-host lucene store trivial.
In answer to your question, it supports user-delineated transactions. The read-optimised nature of Lucene can make it unsuitable for many applications, but most of those are well suited to Solr/Lucene+[SQL,Cassandra,CouchDB,RDF] depending on the requirements.
Personally I tend to start with Solr+SQL or Solr+RDF, but I know some people who love the whole NodeJS+CouchDB style, and I am convinced of the value of the flexibility that provides.
The bottom line is that there are enough NoSQL and SQL-extensions out there that care about data integrity to satisfy any requirement you have without you having to compromise you or your users' data.
Personally I believe you really need to check what your requirements are.
Due to the dynamics of how the OS of your server works it is complicated to say that everything "immediately" goes to disk even when you tell it to. certainly I know ACID techs like SQL are vulnerable to partial corruption through unfinished business and losing operations within a specific window when a single server goes down, unfortunately this is one of the problems of using a single server; you have no choice but to accept it.
I should note that a transaction does not ensure that your server will receive the entire data before failure ( http://en.wikipedia.org/wiki/Database_transaction ), I mean what if the server dies part way through a transaction?
You can perform a safe rollback based on constraints with transactions but few databases will provide the ability to continue playing the transaction unless they have already received all necessary data for it (which isn't normally the case), by which time the data might even be stale anyway.
In fact due to the weight of some transactions and the amount of queries performed within them I reckon you might get a greater window of operational loss using transactions than you might from the 60ms write to disk window on MongoDB at times. But of course that depends upon abuse, however, just like stored procedures, this abuse is common place.
Transactions shine on cascading deletes and typical scenarios like transferring money in a bank account, however, cascadable deletes are normally better done (as most sites do) by a cronjob with the application marking the row as deleted (to avoid the rollback of a transaction showing the deleted data back to the user again); this way you can do a lot of stuff to ensure consistency that you cannot in real-time do while the user is using your application.
So you should really question why you need a tech and what it will succeed in doing, atm the brevity of your question tells me your not sure about your requirements completely.

SQL vs NoSQL for an inventory management system

I am developing a JAVA based web application. The primary aim is to have inventory for products being sold on multiple websites called channels. We will act as manager for all these channels.
What we need is:
Queues to manage inventory updates for each channel.
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
The solutions i am looking at are postgres(our db till now in a synchronous replication mode), NoSQL solutions like Cassandra, Redis, CouchDB and MongoDB.
My constraints are:
Inventory updates cannot be lost.
Job Queues should be executed in order and preferably never lost.
Easy/Fast development and future maintenance.
I am open to any suggestions. thanks in advance.
Queues to manage inventory updates for each channel.
This is not necessarily a database issue. You might be better off looking at a messaging system(e.g. RabbitMQ)
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
session data should probably be put in a separate database more suitable for the task(e.g. memcached, redis, etc)
There is no one-size-fits-all DB
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
My constraints are:
1. Inventory updates cannot be lost.
There are 3 ways to answer this question:
This feature must be provided by your application. The database can guarantee that a bad record is rejected and rolled back, but not guarantee that every query will get entered.
The app will have to be smart enough to recognize when an error happens and try again.
some DBs store records in memory and then flush memory to disk peridocally, this could lead to data loss in the case of a power failure. (e.g Mongo works this way by default unless you enable journaling. CouchDB always appends to the records(even a delete is a flag appended to the record so data loss is extremely difficult))
Some DBs are designed to be extremely reliable, even if an earthquake, hurricane or other natural disaster strikes, they remain durable. these include Cassandra, Hbase, Riak, Hadoop, etc
Which type of durability are your referring to?
Job Queues should be executed in order and preferably never lost.
Most noSQL solutions prefer to run in parallel. so you have two options here.
1. use a DB that locks the entire table for every query(slower)
2. build your app to be smarter or evented(client side sequential queuing)
Easy/Fast development and future maintenance.
generally, you will find that SQL is faster to develop at first, but changes can be harder to implement
noSQL may require a little more planning, but is easier to do ad hoc queries or schema changes.
The questions you probably need to ask yourself are more like:
"Will I need to have intense queries or deep analysis that a Map/Reduce is better suited to?"
"will I need to my change my schema frequently?
"is my data highly relational? in what way?"
"does the vendor behind my chosen DB have enough experience to help me when I need it?"
"will I need special feature such as GeoSpatial indexing, full text search, etc?"
"how close to realtime will I need my data? will it hurt if I don't see the latest records show up in my queries until 1sec later? what level of latency is acceptable?"
"what do I really need in terms of fail-over"
"how big is my data? will it fit in memory? will it fit on one computer? is each individual record large or small?
"how often will my data change? is this an archive?"
If you are going to have multiple customers(channels?) each with their own inventory schemas, a document based DB might have it's advantages. I remember one time I looked at an ecommerce system with inventory and it had almost 235 tables!
Then again, if you have certain relational data, a SQL solution can really have some advantages too.
I can certainly see how I could build a solution using mongo, couch, riak or orientdb with the given constraints. But as for which is the best? I would try talking directly DB vendors, and maybe watch the nosql tapes
Addressing your constraints:
Most NoSQL solutions give you a configurable tradeoff of consistency vs. performance. In MongoDB, for instance, you can decide how durable a write should be. If you want to, you can force the write to be fsync'ed on all your replica set servers. On the other extreme, you can choose to send the command and don't even wait for the server's response.
Executing job queues in order seems to be an application code issue. I'd say a timestamp in the db and an order by type of query should do for most applications. If you have multiple application servers and your queues need to be perfect, you'd have to use a truly distributed algorithm that provides ordering, but that is not a typical requirement, and it's very tricky indeed.
We've been using MongoDB for some time now, and I'm convinced this gives your app development speed a real boost. There's no big difference in maintenance, maintaining data is a pain either way. Not having a schema gives you added flexibility (lazy migrations), but it's more elaborate and requires some care.
In summary, I'd say you can do it both ways. The NoSQL is more code driven, and transactions and relational integrity are mostly managed by your code. If you're uncomfortable with that, go for a relational DB.
However, if you're data grows huge, you'll have to code some of this logic manually because you probably wouldn't want to do real-time joins on a 10B row database. Still, you can implement that with SQL as well.
A good way to find the boundary for different databases is to consider what you can cache. Data that can be cached and reconstructed at any time are a great way to start introducing a new layer, because there's no big risks there. Also, cached data usually doesn't keep any relations so you're not sacrificing any consistency here.
NoSQL is not correct for this application.
I mean, you can use it sure, but you will end up re-implementing a lot of what SQL offers for you. For example I see a lot of relations there. You also want ACID (although some NoSQL solutions do offer that).
There is no reason you can't use both - keep relational data in relational databases, and non-relational data in key/value stores.

Is NoSQL 100% ACID 100% of the time?

Quoting: http://gigaom.com/cloud/facebook-trapped-in-mysql-fate-worse-than-death/
There have been various attempts to
overcome SQL’s performance and
scalability problems, including the
buzzworthy NoSQL movement that burst
onto the scene a couple of years ago.
However, it was quickly discovered
that while NoSQL might be faster and
scale better, it did so at the expense
of ACID consistency.
Wait - am I reading that wrongly?
Does it mean that if I use NoSQL, we can expect transactions to be corrupted (albeit I daresay at a very low percentage)?
It's actually true and yet also a bit false. It's not about corruption it's about seeing something different during a (limited) period.
The real thing here is the CAP theorem which simply states you can only choose two of the following three:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition
tolerance (the system continues to operate despite arbitrary message loss)
The traditional SQL systems choose to drop "Partition tolerance" where many (not all) of the NoSQL systems choose to drop "Consistency".
More precise: They drop "Strong Consistency" and select a more relaxed Consistency model like "Eventual Consistency".
So the data will be consistent when viewed from various perspectives, just not right away.
NoSQL solutions are usually designed to overcome SQL's scale limitations. Those scale limitations are explained by the CAP theorem. Understanding CAP is key to understanding why NoSQL systems tend to drop support for ACID.
So let me explain CAP in purely intuitive terms. First, what C, A and P mean:
Consistency: From the standpoint of an external observer, each "transaction" either fully completed or is fully rolled back. For example, when making an amazon purchase the purchase confirmation, order status update, inventory reduction etc should all appear 'in sync' regardless of the internal partitioning into sub-systems
Availability: 100% of requests are completed successfully.
Partition Tolerance: Any given request can be completed even if a subset of nodes in the system are unavailable.
What do these imply from a system design standpoint? what is the tension which CAP defines?
To achieve P, we needs replicas. Lots of em! The more replicas we keep, the better the chances are that any piece of data we need will be available even if some nodes are offline. For absolute "P" we should replicate every single data item to every node in the system. (Obviously in real life we compromise on 2, 3, etc)
To achieve A, we need no single point of failure. That means that "primary/secondary" or "master/slave" replication configurations go out the window since the master/primary is a single point of failure. We need to go with multiple master configurations. To achieve absolute "A", any single replica must be able to handle reads and writes independently of the other replicas. (in reality we compromise on async, queue based, quorums, etc)
To achieve C, we need a "single version of truth" in the system. Meaning that if I write to node A and then immediately read back from node B, node B should return the up-to-date value. Obviously this can't happen in a truly distributed multi-master system.
So, what is the "correct" solution to the problem? It details really depend on your requirements, but the general approach is to loosen up some of the constraints, and to compromise on the others.
For example, to achieve a "full write consistency" guarantee in a system with n replicas, the # of reads + the # of writes must be greater or equal to n : r + w >= n. This is easy to explain with an example: if I store each item on 3 replicas, then I have a few options to guarantee consistency:
A) I can write the item to all 3 replicas and then read from any one of the 3 and be confident I'm getting the latest version B) I can write item to one of the replicas, and then read all 3 replicas and choose the last of the 3 results C) I can write to 2 out of the 3 replicas, and read from 2 out of the 3 replicas, and I am guaranteed that I'll have the latest version on one of them.
Of course, the rule above assumes that no nodes have gone down in the meantime. To ensure P + C you will need to be even more paranoid...
There are also a near-infinite number of 'implementation' hacks - for example the storage layer might fail the call if it can't write to a minimal quorum, but might continue to propagate the updates to additional nodes even after returning success. Or, it might loosen the semantic guarantees and push the responsibility of merging versioning conflicts up to the business layer (this is what Amazon's Dynamo did).
Different subsets of data can have different guarantees (ie single point of failure might be OK for critical data, or it might be OK to block on your write request until the minimal # of write replicas have successfully written the new version)
The patterns for solving the 90% case already exist, but each NoSQL solution applies them in different configurations. The patterns are things like partitioning (stable/hash-based or variable/lookup-based), redundancy and replication, in memory-caches, distributed algorithms such as map/reduce.
When you drill down into those patterns, the underlying algorithms are also fairly universal: version vectors, merckle trees, DHTs, gossip protocols, etc.
It does not mean that transactions will be corrupted. In fact, many NoSQL systems do not use transactions at all! Some NoSQL systems may sometimes lose records (e.g. MongoDB when you do "fire and forget" inserts rather than "safe" ones), but often this is a design choice, not something you're stuck with.
If you need true transactional semantics (perhaps you are building a bank accounting application), use a database that supports them.
First, asking if NoSql is 100% ACID 100% of the time is a bit of a meaningless question. It's like asking "Are dogs 100% protective 100% of the time?" There are some dogs that are protective (or can be trained to be) such as German Shepherds or Doberman Pincers. There are other dogs that could care less about protecting anyone.
NoSql is the label of a movement, and not a specific technology. There are several different types of NoSql databases. There are document stores, such as MongoDb. There are graph databases such as Neo4j. There are key-value stores such as cassandra.
Each of these serve a different purpose. I've worked with a proprietary database that could be classified as a NoSql database, it's not 100% ACID, but it doesn't need to be. It's a write once, read many database. I think it gets built once a quarter (or once a month?) and then is read 1000s of time a day.
There is a lot of different NoSQL store types and implementations. Every of them can solve trade-offs between consistency and performance differently. The best you can get is a tunable framework.
Also the sentence "it was quickly discovered" from you citation is plainly stupid, this is no surprising discovery but a proven fact with deep theoretical roots.
In general, it's not that any given update would fail to save or get corrupted -- these are obviously going to be a very big issue for any database.
Where they fail on ACID is in data retrieval.
Consider a NoSQL DB which is replicated across numerous servers to allow high-speed access for a busy site.
And lets say the site owners update an article on the site with some new information.
In a typical NoSQL database in this scenario, the update would immediately only affect one of the nodes. Any queries made to the site on the other nodes would not reflect the change right away. In fact, as the data is replicated across the site, different users may be given different content despite querying at the same time. The data could take some time to propagate across all the nodes.
Conversely, in a transactional ACID compliant SQL database, the DB would have to be sure that all nodes had completed the update before any of them could be allowed to serve the new data.
This allows the site to retain high performance and page caching by sacrificing the guarantee that any given page will be absolutely up to date at an given moment.
In fact, if you consider it like this, the DNS system can be considered to be a specialised NoSQL database. If a domain name is updated in DNS, it can take several days for the new data to propagate throughout the internet (depending on TTL configuration).
All this makes NoSQL a useful tool for data such as web site content, where it doesn't necessarily matter that a page isn't instantly up-to-date and consistent as long as it is reasonably up-to-date.
On the other hand, though, it does mean that it would be a very bad idea to use a NoSQL database for a system which does require consistency and up-to-date accuracy. An order processing system or a banking system would definitely not be a good place for your typical NoSQL database engine.
NOSQL is not about corrupted data. It is about viewing at your data from a different perspective. It provides some interesting leverage points, which enable for much easier scalability story, and often usability too. However, you have to look at your data differently, and program your application accordingly (eg, embrace consequences of BASE instead of ACID). Most NOSQL solutions prevent you from making decisions which could make your database hard to scale.
NOSQL is not for everything, but ACID is not the most important factor from end-user perspective. It is just us developers who cannot imagine world without ACID guarantees.
You are reading that correctly. If you have the AP of CAP, your data will be inconsistent. The more users, the more inconsistent. As having many users is the main reason why you scale, don't expect the inconsistencies to be rare. You've already seen data pop in and out of Facebook. Imagine what that would do to Amazon.com stock inventory figures if you left out ACID. Eventual consistency is merely a nice way to say that you don't have consistency but you should write and application where you don't need it. Some types of games and social network application does not need consistency. There are even line-of-business systems that don't need it, but those are quite rare. When your client calls when the wrong amount of money is on an account or when an angry poker player didn't get his winnings, the answer should not be that this is how your software was designed.
The right tool for the right job. If you have less than a few million transactions per second, you should use a consistent NewSQL or NoSQL database such as VoltDb (non concurrent Java applications) or Starcounter (concurrent .NET applications). There is just no need to give up ACID these days.