Spark and sharded JDBC datasources - postgresql

I have a production sharded cluster of PostgreSQL machines where sharding is handled at the application layer. (Created records are assigned a system generated unique identifier - not a UUID - which includes a 0-255 value indicating the shard # that record lives on.) This cluster is replicated in RDS so large read queries can be executed against it.
I'm trying to figure out the best option for accessing this data within Spark.
I was thinking of creating a small dataset (a text file) that contains only the shard names, i.e., integration-shard-0, integration-shard-1, etc. Then I'd partition this dataset across the Spark cluster so ideally each worker would only have a single shard name (but I'd have to handle cases where a worker has more than one shard). Then when I create a JdbcRDD I'd actually create 1..n such RDDs, one for each shard name residing on that worker, and merge the resulting RDDs together.
This seems like it would work but before I go down this path I wanted to see how other people have solved similar problems.
(I also have a separate Cassandra cluster available as second datacenter for analytic processing which I will be accessing with Spark.)

I ended up writing my own ShardedJdbcRDD for which the preliminary version can be found at the following gist:
At the time I wrote it, this version doesn't support use from Java, only Scala. (I may update it.) It also doesn't have the same sub-partitioning scheme that JdbcRDD has, for which I will eventually create an overload constructor. Basically ShardedJdbcRDD will query your RDBMS shards across the cluster; if you have at least as many Spark slaves as shards, each slave will get one shard for its partition.
A future overloaded constructor will support the same range query that JdbcRDD has so if there are more Spark slaves in the cluster than shards the data can be broken up into smaller sets through range queries.


Sharding with replication

Sharding with replication]1
I have a multi tenant database with 3 tables(store,products,purchases) in 5 server nodes .Suppose I've 3 stores in my store table and I am going to shard it with storeId .
I need all data for all shards(1,2,3) available in nodes 1 and 2. But node 3 would contain only shard for store #1 , node 4 would contain only shard for store #2 and node 5 for shard #3. It is like a sharding with 3 replicas.
Is this possible at all? What database engines can be used for this purpose(preferably sql dbs)? Did you have any experience?
I have a feeling you have not adequately explained why you are trying this strange topology.
Anyway, I will point out several things relating to MySQL/MariaDB.
A Galera cluster already embodies multiple nodes (minimum of 3), but does not directly support "sharding". You can have multiple Galera clusters, one per "shard".
As with my comment about Galera, other forms of MySQL/MariaDB can have replication between nodes of each shard.
If you are thinking of having a server with all data, but replicate only parts to readonly Replicas, there are settings for replicate_do/ignore_database. I emphasize "readonly" because changes to these pseudo-shards cannot easily be sent back to the Primary server. (However see "multi-source replication")
Sharding is used primarily when there is simply too much traffic to handle on a single server. Are you saying that the 3 tenants cannot coexist because of excessive writes? (Excessive reads can be handled by replication.)
A tentative solution:
Have all data on all servers. Use the same Galera cluster for all nodes.
Advantage: When "most" or all of the network is working all data is quickly replicated bidirectionally.
Potential disadvantage: If half or more of the nodes go down, you have to manually step in to get the cluster going again.
Likely solution for the 'disadvantage': "Weight" the nodes differently. Give a height weight to the 3 in HQ; give a much smaller (but non-zero) weight to each branch node. That way, most of the branches could go offline without losing the system as a whole.
But... I fear that an offline branch node will automatically become readonly.
Another plan:
Switch to NDB. The network is allowed to be fragile. Consistency is maintained by "eventual consistency" instead of the "[virtually] synchronous replication" of Galera+InnoDB.
NDB allows you to immediately write on any node. Then the write is sent to the other nodes. If there is a conflict one of the values is declared the "winner". You choose which algorithm for determining the winner. An easy-to-understand one is "whichever write was 'first'".

MongoDB Sharding, Replication and Clustering

Based on my analysis below is understanding, correct me if my understating is wrong.
Sharding - Horizontal scaling, split the records into multiple chunks and store across multiple machine with good shard key for all collections.
Replication - Replicate the data across multiple machine for high availability
Clustering - As per Mongo architecture there will be one Master and multiple slave machine. Write and sensitive read operation performs against Master and read operation performs against slaves.
I am not able to correlate Clustering with Replication and Sharding, could you please someone guide me how to relate them?
Term "clustering" is not normally used with mongodb. Instead, its meaning included in the term "sharding". A shard is a node/replicaset with only a portion of your data, yes. And cluster is simply a collection of shards (and supporting nodes, like config servers and mongos routers)
Whereas replication is done with replica sets, which have one primary node (master) and other nodes are secondaries (slaves).

how to write data to multi master instance in mongodb 3.4

How can I write data to multi mongodb instances and keep data synchronous among these instances? Just like in mariaDB.
Currently we use the replica-set in mongodb, but this seems can only support writing data to one node, and this may cause pressure issue if writing requests going up.
Sharded Cluster is not appropriate for me.
Please read the docs (Replication in MongoDB)
Of the data bearing nodes, one and only one member is deemed the primary node, while the other nodes are deemed secondary nodes

Cassandra as replacement to PostgreSQL

Is Cassandra with multiple nodes a good choice as replacement to single node PostgreSql? Data being stored is a time series. It is about tens of gigabytes already and is expected to grow. Database should be integrated into pipeline with apache spark as source and possibly result destination.
What is needed:
1) redundancy: one node failure shouldn't stop the system (all data should be available)
2) speed: more nodes - less time per single insert/select for one client
3) concurrency: more nodes - better speed for simultaneous inserts/selects from different clients
For your points:
1) This is a question which is up to you while choosing the keyspace replication factor RF and the consistency levels CL of your inserts and selects. To be available and consistent you need RF=3 on your and CL.QUORUM for both insert and select for hande loss of one node (for QUORUM you need RF/2+1 nodes online, 3/2+1=2 - integer division, with RF=5 you would neeed 5/2+1=3 nodes online, so you can handle loss of 2).
2) A single request will be handled by a single node as coordinator in your cluster. You do not gain much performance here with singe and synchronous requsts. If you issue any requests and use async you will split your requests across more nodes and gain performance.
3) With more clients you have the same effect - the coordinator will be picked at random (ok there is the TokenAwarePolicy which will pick a appropriate coordinator).
You've mentioned that you use time series data.
1. Naturally, you can vary the replication factor and consistency level. So yes, Cassandra would be good as a replacement.
2. The insert would be really fast as Cassandra writes memory first. So yes, Cassandra would be good as a replacement.
3. Cassandra has linear horizontal scalability. So yes, Cassandra would be good as a replacement.
The drawbacks are that Cassandra is a key-value storage. So you should model the table structure around the queries. And PostgreSQL as RDBMS is more flexible as support the whole set of SQL operations.
You can read more about some pros and cons of using Cassandra with time series data here and here.

Mongodb and Cassandra data storing mechanism

I have been reading about MongoDB and Cassandra. MongoDB is a master/slave where as Cassandra is masterless (all nodes are equal). My doubt is about how the data is stored in these both.
Let's say a user is writing a request to MongoDB(a cluster with master and different slaves each in a separate machine). This means the master will decide(or through some application implementation) to which slave this update should be written to . That is same data will not be available in all the nodes in MongoDB. Each node size may vary. Am i right ? Also when queried will the master know to which node this request should be sent ?
In the case of cassandra, the same data will be written to all the nodes ie) effectively if one node size is 10GB, then the other nodes size is also 10GB. Because if only this is the case, then when one node fails, the user will not lose any data by querying in another node. Am i right here ? If I am right, the same data is available in all the nodes, then what is the advantage of using map/reduce function in Cassandra ? If I am wrong, then how availability is maintained in Cassandra since the same data will not be available in the other node ?
I was searching in stackoverflow about MongoDB vs cassandra and have read about some 10 posts but my questions could not be cleared with the answers in those posts. Please clear my doubts and If I had assumed wrongly, also correct me.
Regarding MongoDB, yep you're right, there is only one primary.
Any secondary can become primary as long as everything is in sync as this will mean the secondary has all the data. Each node doesn't have to be the same on-disk size and this can vary depending on when the replication was done, however, they do have the same data (as long as they're in sync).
I don't know much about Cassandra, sorry!
I've written a thesis about NoSQL stores and therefor I hope that I remember the most parts correctly for Cassandra:
Cassandra is a mixture of Amazon Dynamo, from which it inherit the replication and sharding, and Googles BigTable from which it got the datamodel. So Cassandra basically shards your data, while keeping copies of it on other nodes. Let's have a five node cluster, with nodes called A to E. Your keys are hashed to the keyring through consistent hashing, where continuous areas of your keyring are stored on a given node. So if we have a value range from 1 to 100, per default each node will get 1/5 of the ring. A will range from [1,20), B from [20,40) and so on.
An important Concept for Dynamo is the triple (R,W,N) which tells how many nodes have to read, write, and keep a given value.
Per default you have 3 (N) copies of your data, which is stored on the primary node and two following nodes, which hold backups. When I remember it right from the Dynamo paper your writes go per Default to the first W nodes of your N copies, the other nodes are updated through an Gossip Protocol eventually.
As long as everything is going fine you'll get consistent results, if your primary node is down for some time another node takes your data, through a hinted hand-off. Once the primary comes back your data will be merged, or tried to be merged (this part I can't really remember but check those Vector Clocks which are used to tell the update history).
So if not too big parts of your cluster go down, you'll have a consistent view on your data. If bigger parts of your node are down or you request from only a small parts of your copies you may see inconsistencies, which (may) eventually be consistent.
Hope that helped, I can highly recommend to read those original papers about Amazon Dynamo and Google BigTable, but I think you're mostly interested in Amazon Dynamo. Additionally this post from Werner Vogels may come handy as well.
As for the sharding size I think that those can vary depending on your machine and how hot given areas of your keyring are.
Cassandra does not, typically, keep all data on all nodes. As you suggest, this would defeat some of the advantages offered by it's distributed data model (in particular fast writes would be hampered). The amount of replication desired (how many nodes should keep copies of your data) is customizable by the client at write time. As such, you can set it up to replicate across all nodes, or just keep your data at a single node with no replication. It's up to you. The specific node(s) to which the data gets written is determined by the hash value of the key. Each node is assigned a range of hash values it will store, so when you go to look up a value, again the key is hashed and that indicates on which node to find the data.