I am using spark 1.6 cosine similarity algorithm in spark mllib.
Input: 50k documents' text with ids in a dataframe.
Processing :
Tokenized the texts
Removed stop words
Generated vectors(size=300) using word2Vec
Generated RowMatrix
Transposed it
Used columnSimilarities method with threshold 0.1.(Also tired higher
values)
Output is nxn matrix.
I am using this spark submit
spark-submit --master yarn --conf "spark.kryoserializer.buffer.max=256m" --num-executors 60 --driver-memory 10G --executor-memory 15G --executor-cores 5 --conf "spark.shuffle.service.enabled=true" --conf "spark.yarn.executor.memoryOverhead=2048" noname.jar xyzclass
I am also doing 400 partitions.
But I am getting out of memory issues. I have tired different combinations for partitions and number of executors but failed to run it successfully. However I am able to run it successfully for 7k records with vector size 50 in less than 7mins. Any suggestions how I can make it run on 50K records?
Related
I am executing spark-Scala job using Spark submit command. I have written my code in spark sql where i am joining 2 tables and loading data again in 3rd hive.
code is working fine,But sometimes i am getting some issue like OutofmemoryIssue: Java heap size issue,Timeout error.
So i want to control my job manually by passing number of executors, cores and memory.When i used 16 executor,1 core and 20 GB executor memory my spark application is getting stuck.
can someone please suggest me how should i control manually my spark application by providing correct parameter.and is there any other hive or spark specific parameter are there which i can use for fast execution.
below is configuration of my cluster.
Number of Nodes: 5
Number of Cores per Node: 6
RAM per Node: 125 gb
Spark Submit Command.
spark-submit --class org.apache.spark.examples.sparksc \
--master yarn-client \
--num-executors 16 \
--executor-memory 20g \
--executor-cores 1 \
examples/jars/spark-examples.jar
It depends on volume of your data. you can make dynamic parameters. This link has very nice explanation
How to tune spark executor number, cores and executor memory?
you can enable spark.shuffle.service.enabled, use spark.sql.shuffle.partitions=400, hive.exec.compress.intermediate=true, hive.exec.reducers.bytes.per.reducer=536870912, hive.exec.compress.output=true, hive.output.codec=snappy, mapred.output.compression.type=BLOCK
if your data >700MB you can enable spark.speculation properties
Until now, I have only used Spark on a Hadoop cluster with YARN as the resource manager. In that type of cluster, I know exactly how many executors to run and how the resource management works. However, know that I am trying to use a Standalone Spark Cluster, I have got a little bit confused. Correct me where I am wrong.
From this article, by default, a worker node uses all the memory of the node minus 1 GB. But I understand that by using SPARK_WORKER_MEMORY, we can use lesser memory. For example, if the total memory of the node is 32 GB, but I specify 16 GB, Spark worker is not going to use anymore than 16 GB on that node?
But what about executors? Let us say if I want to run 2 executors per node, can I do that by specifying executor memory during spark-submit to be half of SPARK_WORKER_MEMORY, and if I want to run 4 executors per node, by specifying executor memory to be the quarter of SPARK_WORKER_MEMORY?
If so, besides executor memory, I would also have to specify executor cores correctly, I think. For example, if I want to run 4 executors on a worker, I would have to specify executor cores to be the quarter of SPARK_WORKER_CORES? What happens, if I specify a bigger number than that? I mean if I specify executor memory to be the quarter of SPARK_WORKER_MEMORY, but executor cores to be only half of SPARK_WORKER_CORES? Would I get 2 or 4 executors running on that node in that case?
This is the best way to control number of executors, cores and memory in my experience.
Cores: You can set total number of cores across all executors and number of cores per each executor
Memory: Executor memory individually
--total-executor-cores 12 --executor-cores 2 --executor-memory 6G
This would give you 6 executors and 2 cores/6G per each executor, so in total you are looking at 12 Cores and 36G
You can set driver memory using
--driver-memory 2G
So, I experimented with the Spark Standalone cluster myself a bit, and this is what I noticed.
My intuition that muliple executors can be run inside a worker, by tuning executor cores was indeed correct. Let us say, your worker has 16 cores. Now if you specify 8 cores for executors, Spark would run 2 executors per worker.
How many executors run inside a worker also depend upon the executor memory you specify. For example, if worker memory is 24 GB, and you want to run 2 executors per worker, you cannot specify executor memory to be more than 12 GB.
A worker's memory can be limited when starting a slave by specifing the value for optional parameter--memory or by changing the value of SPARK_WORKER_MEMORY. Same with the number of cores (--cores/SPARK_WORKER_CORES).
If you want to be able to run multiple jobs on the Standalone Spark cluster, you could use the spark.cores.max configuration property while doing spark-submit. For example, like this.
spark-submit <other parameters> --conf="spark.cores.max=16" <other parameters>
So, if your Standalone Spark Cluster allows 64 cores in total, and you give only 16 cores to your program, other Spark jobs could use the remaining 48 cores.
Spark uses parallelism, however while testing my application and looking at the sparkUI, under the streaming tab I often notice under "active batches" that the status of one is "processing" and the rest are "queued". Is there a parameter I can configure to make Spark process multiple batches simultaneously?
Note: I am using spark.streaming.concurrentJobs greater than 1, but that doesn't seem to apply to batch processing (?)
I suppose that you are using Yarn to launch your spark stream.
Yarn queue your batches because he don't have enough resources to launch simultaneously your stream/spark batch.
you can try limit ressource use by yarn with :
-driver-memory -> memory for the driver
--executor-memory -> memory for worker
-num-executors -> number of distinct yarn containers
--executor-cores -> number of threads you get inside each executor
for exemple :
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 800m \
--executor-memory 800m \
--num-executors 4 \
--class my.class \
myjar
We have a Spark 2.2 job writte in Scala running in a YARN cluster that does the following:
Read several thousand small compressed parquet files (~15kb each) into two dataframes
Join the dataframes on one column
Foldleft over all columns to clean some data
Drop duplicates
Write result dataframe to parquet
The following configuration fails via java.lang.OutOfMemory java heap space:
--conf spark.yarn.am.memory=4g
--conf spark.executor.memory=20g
--conf spark.yarn.executor.memoryOverhead=1g
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.maxExecutors=5
--conf spark.executor.cores=4
--conf spark.network.timeout=2000
However, this job works reliably if we remove spark.executor.memory entirely. This gives each executor 1g of ram.
This job also fails if we do any of the following:
Increase executors
Increase default parallelism or spark.sql.shuffle.partitions
Can anyone help me understand why more memory and more executors leads to failed jobs due to OutOfMemory?
Manually setting these parameters disables dynamic allocation. Try leaving it alone, since it is recommended for beginners. It's also useful for experimentation before you can fine tune cluster size in a PROD setting.
Throwing more memory/executors at Spark seems like a good idea, but in your case it probably caused extra shuffles and/or decreased HDFS I/O throughput. This article, while slightly dated and geared towards Cloudera users, explains how to tune parallelism by right-sizing executors.
Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.
When running the following via spark-submit (spark.default.parallelism is not set)
val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)
It returns value 2 for partition size.
When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.
What can be the reason ?
Thanks.
spark.default.parallelism defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism.
When running spark-submit, you're probably running it locally. Try submitting your spark-submit with the same start up configs as you do the spark-shell.
Pulled this from the documentation:
spark.default.parallelism
For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.