Apache Spark - how to see how many nodes are being used during a job run? - scala

I am using Scala Spark 2.4 and want to know the usage of the queue.
How to display how many nodes from a big cluster (100+ nodes) are being utilized for any particular job that is running?
Thanks.

Related

What is the StreamSets architecture?

I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?
If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?
Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?
StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.
In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.
In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.

Partitioning of Apache Spark

I have a cluster consisting of 1 master 10 worker nodes. When I set number of partition as 3, I wonder that does the master node use only 3 worker nodes or use all of them? Because it shows that all of them are used.
The question is not so clear about what are you asking, However following things might help
When you start the job with 10 executors, spark application master gets all the resource from yarn. So all the executors are already associated with the spark job.
However if your data partition is less than the number of executors available, the rest of the executors will be sitting idle. Hence it is not a good idea keeping the number of partition less than the executor count.

Parallelism in Spark Job server

We are working on Qubole with Spark version 2.0.2.
We have a multi-step process in which all the intermediate steps write their output to HDFS and later this output is used in the reporting layer.
As per our use case, we want to avoid writing to HDFS and keep all the intermediate output as temporary tables in spark and directly write the final reporting layer output.
For this implementation, we wanted to use Job server provided by Qubole but when we try to trigger multiple queries on the Job server, Job server is running my jobs sequentially.
I have observed the same behavior in Databricks cluster as well.
The cluster we are using is a 30 node, r4.2xlarge.
Does anyone has experience in running multiple jobs using job server ?
Community's help will be greatly appreciated !

Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

I guess I'm not yet fully understanding how Spark works.
Here is my setup:
I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.
I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).
The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:
Machine 1 Machine 2 Machine 3 Machine 4
Spark Master Spark Worker Spark Worker Spark Worker
Cassandra node Cassandra node Cassandra node
The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.
Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:
a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042
However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).
What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.
Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?
If someone can enlight me, that would be much appreciated.
Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.
At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.

How to run Spark processing in parallel in Eclipse?

I would like to run a Spark application using multiple executors in parallel while trying the application on my development environment (Eclipse). It seems the Spark engine serializes all the tasks and run them using one executor.
Is there an option to run two or more tasks in parallel in Eclipse with spark.master=local?
Use spark.master="local[n]" where n is the number of cores you want to assign to spark or "*" for all cores.
Based on my understanding, RDD's are stored in the executor memory. That is the reason Spark performing all the tasks where data is available. I think there is no option to control the way Spark executes the tasks. Basically it creates stages based on the tasks and then determines the tasks execution.