Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra - scala

I guess I'm not yet fully understanding how Spark works.
Here is my setup:
I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.
I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).
The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:
Machine 1 Machine 2 Machine 3 Machine 4
Spark Master Spark Worker Spark Worker Spark Worker
Cassandra node Cassandra node Cassandra node
The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.
Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:
a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042
However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).
What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.
Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?
If someone can enlight me, that would be much appreciated.

Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.
At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.

Related

Apache Spark - how to see how many nodes are being used during a job run?

I am using Scala Spark 2.4 and want to know the usage of the queue.
How to display how many nodes from a big cluster (100+ nodes) are being utilized for any particular job that is running?
Thanks.

What is the StreamSets architecture?

I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?
If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?
Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?
StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.
In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.
In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.

MongoDB with the Spark connector

If I have a replica set with mongodb, than a primary server is receiving all the wirte/read operations and writing them to the server.
The secondary server are reading the operations from the oplog and replicating them.
Now I would like to analyze the data in mongodb replica set with spark-mongodb-connector. I can install a spark cluster on all three nodes and run analytics on it in memory.
I understand that spark cluster has a master node where I have to submit the spark job for analytics, or spark streaming. Both are installed on an application server in tomcat.
now I need to choose a master node to submit the job from my tomcat app server to the spark cluster.
Should the Primary Server be the Spark Master node? and than the driver of an application can connect to submit jobs on it?.
What would be the Spark master in a sharded cluster?
It doesn't really matter which node is the Spark Master in your cluster.
The Spark master will be responsible for assigning the tasks to the Spark executors, it will not receive all read/write requests.
Each executor will then be responsible for fetching the data it needs to process.
Be careful about data partitioning in Spark, it might happen that mongoDB only provides a single partition to start with, so you might want to do a repartition first.

Akka Persistence: migrating from jdbc (postgres) to cassandra

I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.

How many is the minimum server composition of HBase?

How many is the minimum server composition of HBase?
Full-distributed, use sharding, but not use Hadoop.
It's for production environment.
I'm looking forward to explain like this.
Server 1: Zookeeper
Server 2: Region server
... and more
Thank you.
The minimum is one- see pseudo-distributed mode. The moving parts involved are:
Assuming that you are running on HDFS (which you should be doing):
1 HDFS NameNode
1 or more HDFS Secondary NameNode(s)
1 or more HDFS DataNode(s)
For MapReduce (if you want it):
1 MapReduce JobTracker
1 or more MapReduce TaskTracker(s) (Usually same machines as datanodes)
For HBase itself
1 or more HBase Master(s) (Hot backups are a good idea)
1 or more HBase RegionServer(s) (Usually same machines as datanodes)
1 or more Thrift Servers (if you need to access HBase from the outside the network it is on)
For ZooKeeper
3 - 5 ZooKeeper node(s)
The number of machines that you need is really dependent on how much reliability you need in the face of hardware failure and for what kind of nodes. The only node of the above that does not (yet) support hot failover or other recovery in the face of hardware failure is the HDFS NameNode, though that is being fixed in the more recent Hadoop releases.
You typically want to set the HDFS replication factor of your RegionServers to 3, so that you can take advantage of rack awareness.
So after that long diatribe, I'd suggest at a minimum (for a production deployment):
1x HDFS NameNode
1x JobTracker / Secondary NameNode
3x ZK Nodes
3x DataNode / RegionServer nodes (And if you want to run MapReduce, TaskTracker)
1x Thrift Server (Only if accessing HBase from outside of the network it is running on)