Amazon EMR uses only one core node, but I have two core nodes - pyspark

I'm trying to use EMR for crawling. The target server recognizes client IP, so I want to run one executor for each core node. Currently, I have one master node and two core nodes. The core nodes' type is c4.large which has two vcores each. So, I need to change settings. (Default setting would run two executors on one core node.)
Here is the configuration for my cluster.
[{"classification":"spark", "properties":{"maximizeResourceAllocation":"true"}, "configurations":[]},
{"classification":"yarn-site", "properties":{
"yarn.nodemanager.resource.cpu-vcores":"1",
"yarn.nodemanager.resource.memory-mb":"3584",
"yarn.scheduler.maximum-allocation-vcores":"1",
"yarn.scheduler.maximum-allocation-mb":"3584"}, "configurations":[]},
{"classification":"mapred-site", "properties":{
"mapreduce.map.memory.mb":"3584",
"mapreduce.map.cpu.vcores":"1"}, "configurations":[]}]
And here is the run script.
spark-submit \
--conf spark.hadoop.parquet.enable.dictionary=true \
--conf spark.hadoop.parquet.enable.summary-metadata=false \
--conf spark.sql.hive.metastorePartitionPruning=true \
--conf spark.sql.parquet.filterPushdown=true \
--conf spark.sql.parquet.mergeSchema=true \
--conf spark.worker.cleanup.enabled=true \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH" \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=3200m \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 \
extract_spark.py news_new data
Lastly, here is the code snipet.
numbers = sc.parallelize(list(range(100)))
contents = numbers.flatMap(lambda n: get_contents(args.id, n)).toDF()
contents.coalesce(2).write.mode('append').parquet(
os.path.join(args.path, args.id))
It just uses one core-node. Two map tasks are executed in sequence on a core-node. Used core-node is selected randomly, so I guess both core-nodes are ready to be used.
How can I run two tasks on two core nodes in parallel?

I found out that client mode is not supported on EMR, which means that the driver takes up the resources from core nodes. So, to answer my question, I need to increase the configured number of vcores, and decrease the required memory for each task. Or, I can just increase the number of core nodes.

Related

Spark (v2) does not generate output if the size is more than 2 GB

My Spark application writes outputs that range from several KBs to GBs. I have been facing problem in generating output for certain cases when the file size appears to be more than 2 GB, wherein nothing seems to happen. I hardly see any CPU usage. However, in case where the output size is less than 2 GB, such as 1.3 GB, the same application works flawlessly. Also, please note that writing output is the last stage and all the computations using the data to be written gets correctly and completely processed (as can be seen from debug output) -- hence driver storing the data is not an issue. Besides, the size of the executor memory is also not an issue as I had increased it even to 90 GB while 30GB also seems to be adequate. The following is the code I am using to write the output. Please suggest any way to fix it.
var output = scala.collection.mutable.ListBuffer[String]()
...
output.toDF().coalesce(1).toDF().write.mode("overwrite")
.option("parserLib","univocity").option("ignoreLeadingWhiteSpace","false")
.option("ignoreTrailingWhiteSpace","false").format("csv").save(outputPath)
Other related parameters passed by spark-submit are as follows:
--driver-memory 150g \
--executor-cores 4 \
--executor-memory 30g \
--conf spark.cores.max=252 \
--conf spark.local.dir=/tmp \
--conf spark.rpc.message.maxSize=2047 \
--conf spark.driver.maxResultSize=50g \
The issue was observed on two different systems, one standalone and the other which is a spark cluster.
Based on Gabio's idea of reparitioning, I solved the problem as follows:
val tDF = output.toDF()
println("|#tDF partitions = " + tDF.rdd.partitions.size.toString)
tDF.write.mode("overwrite")
.option("parserLib","univocity").option("ignoreLeadingWhiteSpace","false")
.option("ignoreTrailingWhiteSpace","false").format("csv").save(outputPath)
The output ranged between 2.3 GB and 14 GB, so the source of the problem is elsewhere and perhaps not in spark.driver.maxResultSize.
A big thank you to #Gabio!

How Can I submit multiple jobs in Spark Standalone cluster?

I have a Machine with Apache Spark. Machine is 64GB RAM 16 Cores.
My Objective in each spark job
1. Download a gz file from a remote server
2. Extract gz to get csv file (1GB max)
3. Process csv file in spark and save some stats.
Currently I am submitting one job for each file received by doing following
./spark-submit --class ClassName --executor-cores 14 --num-executors 3 --driver-memory 4g --executor-memory 4g jar_path
And wait for this job to complete and then start new job for new file.
Now I want to utilise 64GB RAM by running multiple jobs in parallel.
I can assign 4g RAM to each job and want to queue my jobs when there are enough jobs already running.
How Can I achieve this?
You should submit multiple jobs from different threads:
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
and configure pool properties (set schedulingMode to FAIR):
https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
From Spark Doc:
https://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling:
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use. By default, it will acquire all cores in the
cluster, which only makes sense if you just run one application at a
time. You can cap the number of cores by setting spark.cores.max ...
By default, it utilise all the resources for one single job.We need to define the resources so that their will be space to run other job as well.Below is the command you can use to submit spark job.
bin/spark-submit --class classname --master spark://hjvm1:6066 --deploy-mode cluster --driver-memory 500M --conf spark.executor.memory=1g --conf spark.cores.max=1 /data/test.jar

spark write parquet to HDFS very slow on multi node

i run well a spark submit with --master local[*],
but when i run the spark submit on my multinode cluster
--master ip of master:port --deploy-mode client :
my app run well until writing to HDFS into parquet, it doesn't stop, no error messages, nothing, still running..
i detected in the app the blocking part, it's :
resultDataFrame.write.parquet(path)
i tried
with
resultDataFrame.repartition(1).write.parquet(path)
but still the same...
Thank you in advance for the help
I am able to see you are trying to use master as local[*], which will run spark job in local mode and unable to use cluster resources.
If you are running spark job on cluster, you can look for spark submit options such as, master as yarn and deploy mode is cluster, here command mentioned below.
spark-submit --class **--master yarn --deploy-mode
cluster ** --conf = ... # other options
[application-arguments]
once you run spark job with yarn master and deploy mode as cluster it will try to utilize all cluster resources.

Spark standalone cluster

I have an spark-standalone cluster. The cluster consists of 2 workers and 1 master nodes. When I run an program on master node, jobs are only assigned to one worker. Another worker can not do something.
Workers appears on the picture. To run my code, I have used following command:
spark-submit --class Main.Main --master spark://172.19.0.2:7077 --deploy-mode cluster Main.jar ReadText.txt
From the above Image we notice you have 1 core system in your worker nodes
You can use the below command
spark-submit --class Main.Main --total-executor-cores 2 --executor-cores 1 --master spark://172.19.0.2:7077 --deploy-mode cluster Main.jar ReadText.txt
Hope this Helps!!!...
Can you please try once with the deploy mode client or just ignore that parameter because what is happening here if your deploy mode will be cluster, one of your worker run the driver task and the other worker will run the rdd task so thats why your one worker only execute the task and when you run your shell it was by default use the client mode and use both the workers for running tasks. Just try once below command to deploy the application and can you please once also share code snippet of your application.
spark-submit --class Main.Main --master spark://172.19.0.2:7077 Main.jar ReadText.txt

Spark reparition() function increases number of tasks per executor, how to increase number of executor

I'm working on IBM Server of 30gb ram (12 cores engine), I have provided all the cores to spark but still, it uses only 1 core, I tried while loading the file and got successful with the command
val name_db_rdd = sc.textFile("input_file.csv",12)
and able to provide all the 12 cores to the processing for the starting jobs but I want to split the operation in between the intermediate operations to the executors, so that it can use all the 12 cores.
Image - description
val new_rdd = rdd.repartition(12)
As you can see in this image only 1 executor is running and repartition function split the data to many tasks at one executor.
It depends how you're launching the job, but you probably want to add --num-executors to your command line when you're launching your spark job.
Something like
spark-submit
--num-executors 10 \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
might work well for you.
Have a look on the Running Spark on Yarn for more details, though some of the switches they mention are Yarn specific.