Spark standalone cluster - scala

I have an spark-standalone cluster. The cluster consists of 2 workers and 1 master nodes. When I run an program on master node, jobs are only assigned to one worker. Another worker can not do something.
Workers appears on the picture. To run my code, I have used following command:
spark-submit --class Main.Main --master spark://172.19.0.2:7077 --deploy-mode cluster Main.jar ReadText.txt

From the above Image we notice you have 1 core system in your worker nodes
You can use the below command
spark-submit --class Main.Main --total-executor-cores 2 --executor-cores 1 --master spark://172.19.0.2:7077 --deploy-mode cluster Main.jar ReadText.txt
Hope this Helps!!!...

Can you please try once with the deploy mode client or just ignore that parameter because what is happening here if your deploy mode will be cluster, one of your worker run the driver task and the other worker will run the rdd task so thats why your one worker only execute the task and when you run your shell it was by default use the client mode and use both the workers for running tasks. Just try once below command to deploy the application and can you please once also share code snippet of your application.
spark-submit --class Main.Main --master spark://172.19.0.2:7077 Main.jar ReadText.txt

Related

Amazon EMR uses only one core node, but I have two core nodes

I'm trying to use EMR for crawling. The target server recognizes client IP, so I want to run one executor for each core node. Currently, I have one master node and two core nodes. The core nodes' type is c4.large which has two vcores each. So, I need to change settings. (Default setting would run two executors on one core node.)
Here is the configuration for my cluster.
[{"classification":"spark", "properties":{"maximizeResourceAllocation":"true"}, "configurations":[]},
{"classification":"yarn-site", "properties":{
"yarn.nodemanager.resource.cpu-vcores":"1",
"yarn.nodemanager.resource.memory-mb":"3584",
"yarn.scheduler.maximum-allocation-vcores":"1",
"yarn.scheduler.maximum-allocation-mb":"3584"}, "configurations":[]},
{"classification":"mapred-site", "properties":{
"mapreduce.map.memory.mb":"3584",
"mapreduce.map.cpu.vcores":"1"}, "configurations":[]}]
And here is the run script.
spark-submit \
--conf spark.hadoop.parquet.enable.dictionary=true \
--conf spark.hadoop.parquet.enable.summary-metadata=false \
--conf spark.sql.hive.metastorePartitionPruning=true \
--conf spark.sql.parquet.filterPushdown=true \
--conf spark.sql.parquet.mergeSchema=true \
--conf spark.worker.cleanup.enabled=true \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH" \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=3200m \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 \
extract_spark.py news_new data
Lastly, here is the code snipet.
numbers = sc.parallelize(list(range(100)))
contents = numbers.flatMap(lambda n: get_contents(args.id, n)).toDF()
contents.coalesce(2).write.mode('append').parquet(
os.path.join(args.path, args.id))
It just uses one core-node. Two map tasks are executed in sequence on a core-node. Used core-node is selected randomly, so I guess both core-nodes are ready to be used.
How can I run two tasks on two core nodes in parallel?
I found out that client mode is not supported on EMR, which means that the driver takes up the resources from core nodes. So, to answer my question, I need to increase the configured number of vcores, and decrease the required memory for each task. Or, I can just increase the number of core nodes.

How Can I submit multiple jobs in Spark Standalone cluster?

I have a Machine with Apache Spark. Machine is 64GB RAM 16 Cores.
My Objective in each spark job
1. Download a gz file from a remote server
2. Extract gz to get csv file (1GB max)
3. Process csv file in spark and save some stats.
Currently I am submitting one job for each file received by doing following
./spark-submit --class ClassName --executor-cores 14 --num-executors 3 --driver-memory 4g --executor-memory 4g jar_path
And wait for this job to complete and then start new job for new file.
Now I want to utilise 64GB RAM by running multiple jobs in parallel.
I can assign 4g RAM to each job and want to queue my jobs when there are enough jobs already running.
How Can I achieve this?
You should submit multiple jobs from different threads:
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
and configure pool properties (set schedulingMode to FAIR):
https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
From Spark Doc:
https://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling:
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use. By default, it will acquire all cores in the
cluster, which only makes sense if you just run one application at a
time. You can cap the number of cores by setting spark.cores.max ...
By default, it utilise all the resources for one single job.We need to define the resources so that their will be space to run other job as well.Below is the command you can use to submit spark job.
bin/spark-submit --class classname --master spark://hjvm1:6066 --deploy-mode cluster --driver-memory 500M --conf spark.executor.memory=1g --conf spark.cores.max=1 /data/test.jar

GCP Dataproc: Directly working with Spark over Yarn Cluster

I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:
spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
[options] <app jar> [app options]
without using GCP SDK.
I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari.
Is there a way to do the same?
Thank you
Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.
In a comment you mention that what you really want to do is
Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
The script has dependencies that mean it cannot run inside of the Dataproc master node.
Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.
If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.

spark write parquet to HDFS very slow on multi node

i run well a spark submit with --master local[*],
but when i run the spark submit on my multinode cluster
--master ip of master:port --deploy-mode client :
my app run well until writing to HDFS into parquet, it doesn't stop, no error messages, nothing, still running..
i detected in the app the blocking part, it's :
resultDataFrame.write.parquet(path)
i tried
with
resultDataFrame.repartition(1).write.parquet(path)
but still the same...
Thank you in advance for the help
I am able to see you are trying to use master as local[*], which will run spark job in local mode and unable to use cluster resources.
If you are running spark job on cluster, you can look for spark submit options such as, master as yarn and deploy mode is cluster, here command mentioned below.
spark-submit --class **--master yarn --deploy-mode
cluster ** --conf = ... # other options
[application-arguments]
once you run spark job with yarn master and deploy mode as cluster it will try to utilize all cluster resources.

Apache spark in cluster mode where to run the jobs. In Master or in worker node?

I have installed the spark in cluster mode. 1 master and 2 workers.And When I start spark shell in master node it is countinously running without getting the scala shell.
But when I run spark-shell on a worker node I am getting scala shell.And I am able to do the jobs.
val file=sc.textFile(“hdfs://192.168.1.20:9000/user/1gbdata”)
file.count()
And for this I got the output.
So My doubt is actually where to run the spark jobs.
Is it in worker nodes?
Based on the documentation, you need to connect your spark-shell to the master node with the following command : spark-shell --master spark://IP:PORT. This url can be retrieved from the master's UI or log file.
You should be able to launch the spark-shell on the master node (machine), make sure to check out the UI to see if the spark-shell is effectively running and that the prompt is shown (you might need to press enter on your keyboard after issuing spark-shell).
Please note that when you are using spark-submit in cluster mode, the driver will be submitted directly from one of the worker nodes, contrary to client mode where it will run as a client process. Refer to the documentation for more details.