How does OrientDB's master-less architecture handle conflicting writes? - orientdb

I read in the documentation here:
http://orientdb.com/docs/last/Distributed-Architecture.html
That OrientDB has a master-less architecture where replicas can handle both reads and writes. In the case of two clients writing concurrently to two replicas, how does the database handle conflict resolution between the two versions?
For instance in Riak KV they use vector clocks (or dotted version vectors now) to detect conflicts which either get punted to the user to handle the merge, or a default policy can be set in place to pick something like last-write-wins.
I'm wondering how OrientDB handles this.

Here is the Conflict resolution strategy used by orientdb.
In case of an even number of servers or when database are not aligned, OrientDB uses a Conflict Resolution Strategy chain.
This default chain is defined as a global setting (distributed.conflictResolverRepairerChain):
-Ddistributed.conflictResolverRepairerChain=majority,content,version
The Conflict Resolution Strategy implementation are called in chain following the declaration order until a winner is selected. In the default configuration (above):
is first checked if there is a strict majority for the record in terms of record versions. If the majority exists, the winner is selected
if no strict majority was found, the record content is analyzed. If the majority is reached by founding a record with different versions but equal content, then that record will be the winner by using the higher version between them
if no majority has been found with the content, then the higher version wins (supposing an higher version means the most update record)
OrientDB Enterprise Edition supports the additional Data Center Conflict Resolution (dc).
At the end of the chain, if no winner is found, the records are untouched and only a manual intervention can decide who is the winner. In this case a WARNING message is displayed in the console with text Auto repair cannot find a winner for record and the following groups of contents:[records].

Pretty similar to old school dynamo, only they are super lazy about it, essentially they just use hazlecast for a dht with a low hop count, and to keep partial ordering of events,(hazlecast has something similar to version vectors in each entry of their collections, but I think they use a timestamp rather than a logical clock for some reason), then they slap on a sloppy quorum for read repair, in a query the successor / coordinator, gets the highest clock, fetch the value from that peer and applies it to itself and everyone else, and for writes they just usually needs an n-1 (n is the replication factor), to do a write again it has to do with the logical clock.
Now as an actual DB, apart from the total lack of effort on the replication, it's pretty excellent, and I have to hand it to them that they make sound technical decisions.
Before you say oh you couldn't do it, I have done it, and I have implemented dynamo's chord variant, riaks plum tree gossip protocol, DVV, and a shit load of CRDTS, mind you mostly from scratch only using RPC frameworks, message passing, or an event loop when I am in C.
Don't call me an old man because I am 20.

Related

Atomicity of small PCIE TLP writes

Are there any guarantees about how card to host writes from a PCIe device targeting regular memory are implemented from a software process' perspective, where a single TLP write is fully contained within a single CPU cache-line?
I'm wondering about a case where my device may write some number of words of data followed by a byte to indicate that the structure is now valid (for example an event completion), for example:
struct PCIE_COMPLETION_T {
uint64_t data_a;
uint64_t data_b;
uint64_t data_c;
uint64_t data_d;
uint8_t valid;
} alignas(SYSTEM_CACHE_LINE_SIZE);
Can I use a single TLP to write this structure, such that when software sees the valid member change to 1 (having been previously cleared to zero by software), then will the other data members will also reflect the values that I had written and not a previous value?
Currently I'm performing 2 writes, first writing the data and secondly marking it as valid, which doesn't have any apparent race conditions but does of course add unwanted overhead.
The most relevant question I can see on this site seems to be Are writes on the PCIe bus atomic? although this appears to relate to the relative ordering of TLPs.
Perusing the PCIe 3.0 specification, I didn't find anything that seemed to explicitly cover my concerns, I don't think that I need AtomicOps particularly. Given that I'm only concerned about interactions with x86-64 systems, I also dug through the Intel architecture guide but also came up no clearer.
Instinctively it seems that it should be possible for such a write to be perceived atomically -- especially as it is said to be a transaction -- but equally I can't find much in the way of documentation explicitly confirming that view (nor am I quite sure what I'd need to look at, probably the CPU vendor?). I also wonder if such a scheme can be extended over multiple cachelines -- ie if the valid sits on a second cacheline written from the same TLP transaction can I be assured that the first will be perceived no later than the second?
The write may be broken into smaller units, as small as dwords, but if it is, they must be observed in increasing address order.
PCIe revision 4, section 2.4.3:
If a single write transaction containing multiple DWs and the Relaxed
Ordering bit Clear is accepted by a Completer, the observed ordering
of the updates to locations within the Completer's data buffer must be
in increasing address order. This semantic is required in case a PCI
or PCI-X Bridge along the path combines multiple write transactions
into the single one. However, the observed granularity of the updates
to the Completer's data buffer is outside the scope of this
specification.
While not required by this specification, it is
strongly recommended that host platforms guarantee that when a PCI
Express write updates host memory, the update granularity observed by
a host CPU will not be smaller than a DW.
As an example of update
ordering and granularity, if a Requester writes a QW to host memory,
in some cases a host CPU reading that QW from host memory could
observe the first DW updated and the second DW containing the old
value.
I don't have a copy of revision 3, but I suspect this language is in that revision as well. To help you find it, Section 2.4 is "Transaction Ordering" and section 2.4.3 is "Update Ordering and Granularity Provided by a Write Transaction".

Consistency for read from distributed databases

I have a set of databases, distributed across multiple locations in the network and for ex. one client that needs to store some data in that databases.
I need to make sure my data will always be stored.
I can't organize a replica set with sync/async replication as it will make me to connect to one master which is a point of failure, so I send data from the client to all databases I know. Apparently, one database can fail to store, so I am relying on other databases writes. In the end I get different data sets stored in DB's though these sets are overlapping. (Ex. DB1 -> [1, 2, 3], DB2 -> [1, 3], DB3 -> [2,3,4])
How can get consistent data when reading from these DBs? What techniques should I apply on the client that writes data and a client that reads to be able to merge data sets successfully (getting on reader [1,2,3,4])?
What you're asking is basically an entire branch of computer science. It is very much a non-trivial problem and you will find that a surprising number of things are impossible.
Also note that simply saying "consistent" data is not a sufficient definition. There are all sorts of levels of consistency (read-your-own-writes, reads-follow-writes, monotonic read, linearizable, causal, etc.) I think you likely mean (in a very loose sense): consistency similar to what you get when you use just one database.
To answer your question directly, you want to decide on a read quorum size and a write quorum size. These sizes must be selected such that reads and writes will overlap by at least one database instance. If you want to optimize for write latency, use a smaller write quorum and do the opposite if you want to optimize for read latency.
A more detailed exposition of overlapping read/write quorums can be found in Weighted Voting for Replicated Data. This is considered a seminal work in the field of replication.
Also be careful around the behavior of your overlapping quorums when adding or removing a database instance. It sounds like you have a relatively static topology, but if that is not the case, then an entirely different set of choices need to be made.
Lastly - and here's the real kick in the teeth - what I have described doesn't actually give you consistency (by any definition) in some cases (I like Daniel Abadi's explanation of when andy why), but for many systems it gives you good enough consistency. It's up to you to decide exactly what level of consistency you need.
There are two-way/three-way replication software that do not require a "master".
You can also use transaction log based replications.
What and how you can use will depend on the database product you use.
HTH

Is NoSQL 100% ACID 100% of the time?

Quoting: http://gigaom.com/cloud/facebook-trapped-in-mysql-fate-worse-than-death/
There have been various attempts to
overcome SQL’s performance and
scalability problems, including the
buzzworthy NoSQL movement that burst
onto the scene a couple of years ago.
However, it was quickly discovered
that while NoSQL might be faster and
scale better, it did so at the expense
of ACID consistency.
Wait - am I reading that wrongly?
Does it mean that if I use NoSQL, we can expect transactions to be corrupted (albeit I daresay at a very low percentage)?
It's actually true and yet also a bit false. It's not about corruption it's about seeing something different during a (limited) period.
The real thing here is the CAP theorem which simply states you can only choose two of the following three:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition
tolerance (the system continues to operate despite arbitrary message loss)
The traditional SQL systems choose to drop "Partition tolerance" where many (not all) of the NoSQL systems choose to drop "Consistency".
More precise: They drop "Strong Consistency" and select a more relaxed Consistency model like "Eventual Consistency".
So the data will be consistent when viewed from various perspectives, just not right away.
NoSQL solutions are usually designed to overcome SQL's scale limitations. Those scale limitations are explained by the CAP theorem. Understanding CAP is key to understanding why NoSQL systems tend to drop support for ACID.
So let me explain CAP in purely intuitive terms. First, what C, A and P mean:
Consistency: From the standpoint of an external observer, each "transaction" either fully completed or is fully rolled back. For example, when making an amazon purchase the purchase confirmation, order status update, inventory reduction etc should all appear 'in sync' regardless of the internal partitioning into sub-systems
Availability: 100% of requests are completed successfully.
Partition Tolerance: Any given request can be completed even if a subset of nodes in the system are unavailable.
What do these imply from a system design standpoint? what is the tension which CAP defines?
To achieve P, we needs replicas. Lots of em! The more replicas we keep, the better the chances are that any piece of data we need will be available even if some nodes are offline. For absolute "P" we should replicate every single data item to every node in the system. (Obviously in real life we compromise on 2, 3, etc)
To achieve A, we need no single point of failure. That means that "primary/secondary" or "master/slave" replication configurations go out the window since the master/primary is a single point of failure. We need to go with multiple master configurations. To achieve absolute "A", any single replica must be able to handle reads and writes independently of the other replicas. (in reality we compromise on async, queue based, quorums, etc)
To achieve C, we need a "single version of truth" in the system. Meaning that if I write to node A and then immediately read back from node B, node B should return the up-to-date value. Obviously this can't happen in a truly distributed multi-master system.
So, what is the "correct" solution to the problem? It details really depend on your requirements, but the general approach is to loosen up some of the constraints, and to compromise on the others.
For example, to achieve a "full write consistency" guarantee in a system with n replicas, the # of reads + the # of writes must be greater or equal to n : r + w >= n. This is easy to explain with an example: if I store each item on 3 replicas, then I have a few options to guarantee consistency:
A) I can write the item to all 3 replicas and then read from any one of the 3 and be confident I'm getting the latest version B) I can write item to one of the replicas, and then read all 3 replicas and choose the last of the 3 results C) I can write to 2 out of the 3 replicas, and read from 2 out of the 3 replicas, and I am guaranteed that I'll have the latest version on one of them.
Of course, the rule above assumes that no nodes have gone down in the meantime. To ensure P + C you will need to be even more paranoid...
There are also a near-infinite number of 'implementation' hacks - for example the storage layer might fail the call if it can't write to a minimal quorum, but might continue to propagate the updates to additional nodes even after returning success. Or, it might loosen the semantic guarantees and push the responsibility of merging versioning conflicts up to the business layer (this is what Amazon's Dynamo did).
Different subsets of data can have different guarantees (ie single point of failure might be OK for critical data, or it might be OK to block on your write request until the minimal # of write replicas have successfully written the new version)
The patterns for solving the 90% case already exist, but each NoSQL solution applies them in different configurations. The patterns are things like partitioning (stable/hash-based or variable/lookup-based), redundancy and replication, in memory-caches, distributed algorithms such as map/reduce.
When you drill down into those patterns, the underlying algorithms are also fairly universal: version vectors, merckle trees, DHTs, gossip protocols, etc.
It does not mean that transactions will be corrupted. In fact, many NoSQL systems do not use transactions at all! Some NoSQL systems may sometimes lose records (e.g. MongoDB when you do "fire and forget" inserts rather than "safe" ones), but often this is a design choice, not something you're stuck with.
If you need true transactional semantics (perhaps you are building a bank accounting application), use a database that supports them.
First, asking if NoSql is 100% ACID 100% of the time is a bit of a meaningless question. It's like asking "Are dogs 100% protective 100% of the time?" There are some dogs that are protective (or can be trained to be) such as German Shepherds or Doberman Pincers. There are other dogs that could care less about protecting anyone.
NoSql is the label of a movement, and not a specific technology. There are several different types of NoSql databases. There are document stores, such as MongoDb. There are graph databases such as Neo4j. There are key-value stores such as cassandra.
Each of these serve a different purpose. I've worked with a proprietary database that could be classified as a NoSql database, it's not 100% ACID, but it doesn't need to be. It's a write once, read many database. I think it gets built once a quarter (or once a month?) and then is read 1000s of time a day.
There is a lot of different NoSQL store types and implementations. Every of them can solve trade-offs between consistency and performance differently. The best you can get is a tunable framework.
Also the sentence "it was quickly discovered" from you citation is plainly stupid, this is no surprising discovery but a proven fact with deep theoretical roots.
In general, it's not that any given update would fail to save or get corrupted -- these are obviously going to be a very big issue for any database.
Where they fail on ACID is in data retrieval.
Consider a NoSQL DB which is replicated across numerous servers to allow high-speed access for a busy site.
And lets say the site owners update an article on the site with some new information.
In a typical NoSQL database in this scenario, the update would immediately only affect one of the nodes. Any queries made to the site on the other nodes would not reflect the change right away. In fact, as the data is replicated across the site, different users may be given different content despite querying at the same time. The data could take some time to propagate across all the nodes.
Conversely, in a transactional ACID compliant SQL database, the DB would have to be sure that all nodes had completed the update before any of them could be allowed to serve the new data.
This allows the site to retain high performance and page caching by sacrificing the guarantee that any given page will be absolutely up to date at an given moment.
In fact, if you consider it like this, the DNS system can be considered to be a specialised NoSQL database. If a domain name is updated in DNS, it can take several days for the new data to propagate throughout the internet (depending on TTL configuration).
All this makes NoSQL a useful tool for data such as web site content, where it doesn't necessarily matter that a page isn't instantly up-to-date and consistent as long as it is reasonably up-to-date.
On the other hand, though, it does mean that it would be a very bad idea to use a NoSQL database for a system which does require consistency and up-to-date accuracy. An order processing system or a banking system would definitely not be a good place for your typical NoSQL database engine.
NOSQL is not about corrupted data. It is about viewing at your data from a different perspective. It provides some interesting leverage points, which enable for much easier scalability story, and often usability too. However, you have to look at your data differently, and program your application accordingly (eg, embrace consequences of BASE instead of ACID). Most NOSQL solutions prevent you from making decisions which could make your database hard to scale.
NOSQL is not for everything, but ACID is not the most important factor from end-user perspective. It is just us developers who cannot imagine world without ACID guarantees.
You are reading that correctly. If you have the AP of CAP, your data will be inconsistent. The more users, the more inconsistent. As having many users is the main reason why you scale, don't expect the inconsistencies to be rare. You've already seen data pop in and out of Facebook. Imagine what that would do to Amazon.com stock inventory figures if you left out ACID. Eventual consistency is merely a nice way to say that you don't have consistency but you should write and application where you don't need it. Some types of games and social network application does not need consistency. There are even line-of-business systems that don't need it, but those are quite rare. When your client calls when the wrong amount of money is on an account or when an angry poker player didn't get his winnings, the answer should not be that this is how your software was designed.
The right tool for the right job. If you have less than a few million transactions per second, you should use a consistent NewSQL or NoSQL database such as VoltDb (non concurrent Java applications) or Starcounter (concurrent .NET applications). There is just no need to give up ACID these days.

NoSQL and eventual consistency - real world examples [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm looking for good examples of NoSQL apps that portray how to work with lack of transactionality as we know it in relational databases. I'm mostly interested in write-intensive code, as for mostly read-only code this is a much easier task. I've read a number of things about NoSQL in general, about CAP theorem, eventual consistency etc. However those things tend to concentrate on the database architecture for its own sake and not on the design patterns to use with it. I do understand that it's impossible to achieve full transactionality within a distributed app. This is exactly why I would like to understand where and how requirements should be lowered in order to make the task feasable.
EDIT:
It's not that eventual consistency is my goal on it's own. For the time being I don't really see how to use NoSQL to certain things that are write-intensive. Say: I have a simplistic auction system, where there are offers. In theory the first person to accept an offer wins. In practice I would like at least to guarantee that there is only a single winner and that people get their results in the same request. It's probably not feasable. But how to solve it in practice - maybe some requests could take longer than usual, because something went wrong. Maybe some requests should be automatically refreshed. It's just an example.
Let me explain CAP in purely intuitive terms. First, what C, A and P mean:
Consistency: From the standpoint of an external observer, each
"transaction" either fully completed or is fully rolled back. For example,
when making an amazon purchase the purchase confirmation, order status
update, inventory reduction etc should all appear 'in sync'
regardless of the internal partitioning into sub-systems
Availablility: 100% of requests are completed successfully.
Partition Tolerance: Any given request can be completed even if a
subset of nodes in the system are unavailable.
What do these imply from a system design standpoint? what is the tension which CAP defines?
To achieve P, we needs replicas. Lots of em! The more replicas we keep, the better the chances are that any piece of data we need will be available even if some nodes are offline. For absolute "P" we should replicate every single data item to every node in the system. (Obviously in real life we compromise on 2, 3, etc)
To achieve A, we need no single point of failure. That means that "primary/secondary" or "master/slave" replication configurations go out the window since the master/primary is a single point of failure. We need to go with multiple master configurations. To achieve absolute "A", any single replica must be able to handle reads and writes independently of the other replicas. (in reality we compromise on async, queue based, quorums, etc)
To achieve C, we need a "single version of truth" in the system. Meaning that if I write to node A and then immediately read back from node B, node B should return the up-to-date value. Obviously this can't happen in a truly distributed multi-master system.
So, what is the solution to your question? Probably to loosen up some of the constraints, and to compromise on the others.
For example, to achieve a "full write consistency" guarantee in a system with n replicas, the # of reads + the # of writes must be greater or equal to n : r + w >= n. This is easy to explain with an example: if I store each item on 3 replicas, then I have a few options to guarantee consistency:
A) I can write the item to all 3 replicas and then read from any one of the 3 and be confident I'm getting the latest version
B) I can write item to one of the replicas, and then read all 3 replicas and choose the last of the 3 results
C) I can write to 2 out of the 3 replicas, and read from 2 out of the 3 replicas, and I am guaranteed that I'll have the latest version on one of them.
Of course, the rule above assumes that no nodes have gone down in the meantime. To ensure P + C you will need to be even more paranoid...
There are also a near-infinite number of 'implementation' hacks - for example the storage layer might fail the call if it can't write to a minimal quorum, but might continue to propagate the updates to additional nodes even after returning success. Or, it might loosen the semantic guarantees and push the responsibility of merging versioning conflicts up to the business layer (this is what Amazon's Dynamo did).
Different subsets of data can have different guarantees (ie single point of failure might be OK for critical data, or it might be OK to block on your write request until the minimal # of write replicas have successfully written the new version)
There is more to talk about, but let me know if this was helpful and if you have any followup questions, we can continue from there...
[Continued...]
The patterns for solving the 90% case already exist, but each NoSQL solution applies them in different configurations. The patterns are things like partitioning (stable/hash-based or variable/lookup-based), redundancy and replication, in memory-caches, distributed algorithms such as map/reduce.
When you drill down into those patterns, the underlying algorithms are also fairly universal: version vectors, merckle trees, DHTs, gossip protocols, etc.
The same can be said for most SQL solutions: they all implement indexes (which use b-trees under the hood), have relatively smart query optimizers which are based on known CS algorithms, all use in-memory caching to reduce disk IO. The differences are mostly in implementation, management experience, toolset support, etc
unfortunately I can't point to some central repository of wisdom which contains all you will need to know. In general, start with asking yourself what NoSQL characteristics you really need. That will guide you to choosing between a key-value store, a document store or a column store. (those are the 3 main categories of NoSQL offerings). And from there you can start comparing the various implementations.
[Updated again 4/14/2011]
OK here's the part which actually justifies the bounty..
I just found the following 120 page whitepaper on NoSQL systems. This is very close to being the "NoSQL bible" which I told you earlier doesn't exist. Read it and rejoice :-)
NoSQL Databases, Christof Strauch
There are many applications where eventual consistency is fine. Consider Twitter as a rather famous example. There's no reason that your "tweets" have to go out to all of your "followers" instantaneously. If it takes several seconds (or even minutes?) for your "tweet" to be distributed, who would even notice?
If you want non-web examples, any store-and-forward service (like email and USENET) would be require eventual consistency.
It's not impossible to get transactions or consistency in NoSQL. A lot of people define NoSQL in terms of a lack of transactions or as requiring eventual consistency at best, but this isn't accurate. There are transactional nosql products out there - consider tuple spaces, for example - that scale very well even while providing app consistency.

why does memcached not support "multi set"

Can anyone explain why memcached folks decided to support multi get but not multi set.
By multi I mean operation involving more than one key (see protocol at http://code.google.com/p/memcached/wiki/NewCommands).
So you can get multiple keys in one shot (basic advantage is the standard saving you get by doing less round trips) but why can not you get bulk sets?
My theory is that it was meant to do less number of sets and that too individually (e.g. on a cache read and miss). But I still do not see how multi-set really conflicts with the general philosophy of memcached.
I looked at the client features at http://code.google.com/p/memcached/wiki/NewCommonFeatures and it seems that some clients potentially do support "Multi-Set" (why only in binary protocol?). I am using Java spy memcached, btw.
It's not supported in the text protocol because it'd be very, very complicated to express, no clients would support it, and it would provide very little that you can't already do from the text protocol.
It's supported in the binary protocol because it's a trivial use case of binary operations.
spymemcached supports it implicitly -- just do a bunch of sets and magic happens:
http://dustin.github.com/2009/09/23/spymemcached-optimizations.html
I don't know a lot about memcache internals, but I assume writes have to be blocking, atomic operations. I assume that allowing multiple set operations to be batched, you could block all reads for a long time (or risk a get occurring while only half of a write had been applied). Forcing writes to be done individually allows them to be interleaved fairly with gets.
I would imagine that the restriction against using multi sets is to avoid collisions when writing cached values to the memcache.
As an object cache, I can't foresee an example of when you would need transactional type writes. This use case seems less suited for a caching layer, but better suited for the underlying database.
If sets come in interleaved from different clients, it is most likely the case that for one key, the last one wins, or is at least close enough, until the cache is invalidated and a newer value is written.
As Gian mentions, there don't seem to be any good reasons to block reads from the cache while several or many writes to the cache happen.