how orientdb find the appropriate master node when queries come and how to know data is latest? - orientdb

When client issues a query, how does orientdb decide which master to choose ? If it chooses local master, what if the data in the local master is not latest, that is, syncing from other master is not complete when the query comes. Should the client retry later ?
Furthormore, if the same record is written at different places concurrently, how to determine the final data ? According to LWW (last write win) or something else strategy ?

The OrientDB documentation already explained the solutions to this topic, you should look into these two pages (3.0.x):
Concurrency
Replication

Related

What exactly is preferSlave in Postgres

What does the setting targetServerType with value preferSlave in the PostgreSQL JDBC driver really mean?
The reason why I am asking this question is according to the documentation:
targetServerType = String
Allows opening connections to only servers with the required state, the
allowed values are any, master, slave, secondary, preferSlave and
preferSecondary. The master/slave distinction is currently done by
observing if the server allows writes. The value preferSecondary tries
to connect to secondary if any are available, otherwise allows falls
back to connecting also to master.
Now I was trying this setup in Cloudfoundry and if I was to look at the metrics of PostgreSQL on the dashboard, I still see reads being done on the master. Hence my question. Isn't master nodes supposed to be used for reads in this case?
And how does it affect performance in terms of read/write. Especially in an application where the writes are done with the target as master and reads are done with the target preferslave?

Why MongoDB is Consistent not available and Cassandra is Available not consistent?

Mongo
From this resource I understand why mongo is not A(Highly Available) based on below statement
MongoDB supports a “single master” model. This means you have a master
node and a number of slave nodes. In case the master goes down, one of
the slaves is elected as master. This process happens automatically
but it takes time, usually 10-40 seconds. During this time of new
leader election, your replica set is down and cannot take writes
Is it for the same reason Mongo is said to be Consistent(as write did not happen so returning the latest data in system ) but not Available(not available for writes) ?
Till re-election happens and write operation is in pending, can slave return perform the read operation ? Also does user re-initiate the write operation again once master is selected ?
But i do not understand from another angle why Mongo is highly consistent
As said on Where does mongodb stand in the CAP theorem?,
Mongo is consistent when all reads go to the primary by default.
But that is not true. If under Master/slave model , all reads will go to primary what is the use of slaves then ? It further says If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results. It means mongo may not be be
consistent with master/slaves(provided i do not configure write to all nodes before return). It does not makes sense to me to say mongo is consistent if all
read and writes go to primary. In that case every other DB also(like cassandra) will be consistent . Is n't it ?
Cassandra
From this resource I understand why Cassandra is A(Highly Available ) based on below statement
Cassandra supports a “multiple master” model. The loss of a single
node does not affect the ability of the cluster to take writes – so
you can achieve 100% uptime for writes
But I do not understand why cassandra is not Consistent ? Is it because node not available for write(as coordinated node is not able to connect) is available for read which can return stale data ?
Go through: MongoDB, Cassandra, and RDBMS in CAP, for better understanding of the topic.
A brief definition of Consistency and availability.
Consistency simply means, when you write a piece of data in a system/distributed system, the same data you should get when you read it from any node of the system.
Availability means, the system should always be available for read/write operation.
Note: Most systems are not, only available or only consistent, they always offer a bit of both
With the above definition let's see where MongoDB and Cassandra fall in CAP.
MongoDB
As you said MongoDB is highly consistent when reads and write go to the same node(the default case). Further, you can choose in MongoDB to read from other secondary nodes instead of reading from only leader/primary.
Now, when you try to read data from secondary, your consistency will completely depend on, how you want to read data:
You could ask data which is up to maximum, say 5 seconds stale or,
You could just say, return data from majority of nodes for your select statement.
Same way when you write from your client into Mongo leader, you can say, a write is successful if the data is replicated to or stored on majority of servers.
Clearly, from above, we can say MongoDb can be highly consistent or eventually consistent based on how you read/write your data.
Now, what about availability? MongoDB is mostly always available, but, the only time when the leader is down, MongoDB can't accept writes, until it figures out the new leader. Hence, not highly available
So, MongoDB is categorized under CP.
What about Cassandra?
In Cassandra, there is no leader and any nodes can accept write, so the Cassandra cluster is always available for writes and reads even if some nodes go down.
What about consistency in Cassandra?
Same as MongoDB Cassandra can be eventually consistent or highly consistent based on how you read/write data.
You can give consistency levels in your read/write operations, For example:
read/write data from one node
read/write data from majority/quorum of nodes and more
Let's say you give a consistency level of one in your read/write operation. So, your write is successful as soon as data is written to one replica. Now, if your read request happens to go to the other replica where the data is not updated yet(could be due to high network latency or any other reason), you will end up reading the old data.
So, Cassandra is highly available but has configurable consistency levels and hence not always consistent.
In conclusion, in their default behavior, MongoDB falls under CP and Cassandra in AP.
Consistency in the CAP paradigm also includes "eventual consistency" which MongoDB supports. In a contrast to ACID systems, the read in CAP systems does not guarantee a safe return.
In simple words, this means that your Master could have an updated value, but if you do read from Slave, it does not necessarily return the updated value, and that it's okay to no have this updated value by design.
The concept of eventual consistency is explained in an excellent answer here.
By architecture, Cassandra is supposed to be consistent; it offers a special implementation of eventual consistency called the 'tunable consistency' which would meant that the client application may choose the method of handling this- it even offers multi data centre consistency support at low levels!
Most issues from row wise inconsistency in Cassandra comes from the fact that Cassandra uses client timestamps to determine which value is the most recent, and not the server side ones, which may be tad bit confusing to understand at first.
I hope this helps!
You have only to understand the "point-in-time": As you only write to mongodb master, even if slave is not updated, it is consistent, as it has all the data generated util the sync moment.
That is not true for cassandra. As cassandra uses a master-less model, there's no garantee that other nodes has all the data. At a certain time, a node can have certain recent data, and not having older data from nodes not yet synced. Cassandra will only be consistent if you stop write to all nodes and put them online. As soon the sync finished you have a consistent data.

Integration of Kafka in Web Application

I have a java based web application which is using 2 backend database servers of Microsoft SQL (1 server is live database as it is transactional and the other one is reporting database). Lag between transactional and reporting databases is of around 30 minutes and incremental data is loaded using a SQL job which runs every 30 minutes and takes around 20-25 minutes in execution. This job is executing an SSIS package and using this package, data from reporting database is further processed and is stored in HDFS and HBase which is eventually used for analytics.
Now, I want to reduce this lag and to do this, I am thinking of implementing a messaging framework. After doing some research, I learned that Kafka could solve my purpose since Kafka can also work as an ETL tool apart from being a messaging framework.
How should I proceed? should I create topics similar to the table structures in SQL server and perform operations on that? Should I redirect my application to write any change happening in Kafka first and then in Transactional database? Please advise on usage of Kafka considering the mentioned use case.
There's a couple ways to do this that require minimal code, and then there's always the option to write your own code.
(Some coworkers just got finished looking at this, with SQL Server and Oracle, so I know a little about this here)
If you're using the enterprise version of SQL Server you could use Change Data Capture and Confluent Kakfa Connect to read all the changes to the data. This (seems to) require both a Enterprise license and may include some other additional cost (I was fuzzy on the details here. This may have been because we're using an older version of SQL Server or because we have many database servers ).
If you're not / can't use the CDC stuff, Kafka Connect's JDBC support also has a mode where it polls the database for changes. This works best if your records have some kind of timestamp column, but usually this is the case.
A poll only mode without CDC means you won't get every change - ie if you poll every 30 seconds and the record changes twice, you won't get individual messages about this change, but you'll get one message with those two changes, if that makes sense. This is Probably acceptable for your business domain, but something to be aware of.
Anyway, Kafka Connect is pretty cool - it will auto create Kafka topics for you based on your table names, including posting the Avro schemas to Schema Registry. (The topic names are knowable, so if you're in an environment with auto topic creation = false, well you can create the topics manually yourself based on the table names). Starting from no Kafka Connect knowledge it took me maybe 2 hours to figure out enough of the configuration to dump a large SQL Server database to Kafka.
I found additional documentation in a Github repository of a Confluent employee describing all this, with documentation of the settings, etc.
There's always the option of having your web app be a Kafka producer itself, and ignore the lower level database stuff. This may be a better solution, like if a request creates a number of records across the data store, but really it's one related event (an Order may spawn off some LineItem records in your relational database, but the downstream database only cares that an order was made).
On the consumer end (ie "next to" your other database) you could either use Kafka Connect on the other end to pick up changes, maybe even writing a custom plugin if required, or write your own Kafka consumer microservice to put the changes into the other database.

Is eventual Consistency possible in case of Master - Master Configuration -

Is eventual Consistency possible in case of Master - Master Configuration.
i.e if their are more than one master to accept writes then we can always have conflicting writes in case of eventual consistency.
for example : two master writing two user profile with same email id.
In case of an eventual consistent system both masters may be able to successfully commit two user-profile with same email id - which is actual an inconsistent system
One: Locks are taken up before writing to the database or a cache.
Two: If the locks are taken up at the same time then there are further two more ways to resolve.
Either an election is made among the two operations and one is elected whereas the other operation is rejected to the client and with it is returned the new value.
Or distributed servers allow you to write a conflict resolution code and be deployed at the server, and is executed when this happens.
Usually topologies don't work that way, they distribute writes and there is a master and slave concept in a master master configurations as well. :)
Speaking theoretical of course.

Is record replicated in postgres cluster?

We use Postgres as database for our project and found that streaming replication fit our needs which works asynchronous. We have one server in write only mode (MASTER) and others in read only (SLAVES).
In most cases we send data to MASTER and forget about it. But sometimes we want to be sure that current chunk is synchronised between master and slave before continue. For new rows (INSERTS) it is trivial, script can check is new row appeared by simple SELECT query. But for UPDATEs it becomes problem.
So, is there any simple and legal way to check does slave catch up master or not? I know that every record should have own internal id but not sure that it will be the same between servers.
We use Postgres 9.2 with very similar configuration described in this great article.
So, finally I found solution by myself. So far postgres stores all changes is special log files, and then replicates changes over slave servers. Same time postgres has rich set of API methods that allows to read state of different internal processes. So in the end I use this queries:
-- On MASTER server current xlog location:
SELECT
pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset;
-- On SLAVE server current offset of received and applied changes
SELECT
pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS receive,
pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay;
Getting this values from MASTER and SLAVE I can decide how big difference is between servers. And when I need to be sure that both servers are in the same state I can wait till offset on SLAVE becomes equal or bigger than on MASTER.
Probably this query can also be handy if you have the same problem:
-- Check is current server in cluster
SELECT
pg_is_in_recovery() AS recovery;
-- False — server is MASTER
-- True — server is SLAVE