How to make spark streaming process multiple batches? - scala

Spark uses parallelism, however while testing my application and looking at the sparkUI, under the streaming tab I often notice under "active batches" that the status of one is "processing" and the rest are "queued". Is there a parameter I can configure to make Spark process multiple batches simultaneously?
Note: I am using spark.streaming.concurrentJobs greater than 1, but that doesn't seem to apply to batch processing (?)

I suppose that you are using Yarn to launch your spark stream.
Yarn queue your batches because he don't have enough resources to launch simultaneously your stream/spark batch.
you can try limit ressource use by yarn with :
-driver-memory -> memory for the driver
--executor-memory -> memory for worker
-num-executors -> number of distinct yarn containers
--executor-cores -> number of threads you get inside each executor
for exemple :
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 800m \
--executor-memory 800m \
--num-executors 4 \
--class my.class \
myjar

Related

Memory Allocation In Spark-Scala Application:

I am executing spark-Scala job using Spark submit command. I have written my code in spark sql where i am joining 2 tables and loading data again in 3rd hive.
code is working fine,But sometimes i am getting some issue like OutofmemoryIssue: Java heap size issue,Timeout error.
So i want to control my job manually by passing number of executors, cores and memory.When i used 16 executor,1 core and 20 GB executor memory my spark application is getting stuck.
can someone please suggest me how should i control manually my spark application by providing correct parameter.and is there any other hive or spark specific parameter are there which i can use for fast execution.
below is configuration of my cluster.
Number of Nodes: 5
Number of Cores per Node: 6
RAM per Node: 125 gb
Spark Submit Command.
spark-submit --class org.apache.spark.examples.sparksc \
--master yarn-client \
--num-executors 16 \
--executor-memory 20g \
--executor-cores 1 \
examples/jars/spark-examples.jar
It depends on volume of your data. you can make dynamic parameters. This link has very nice explanation
How to tune spark executor number, cores and executor memory?
you can enable spark.shuffle.service.enabled, use spark.sql.shuffle.partitions=400, hive.exec.compress.intermediate=true, hive.exec.reducers.bytes.per.reducer=536870912, hive.exec.compress.output=true, hive.output.codec=snappy, mapred.output.compression.type=BLOCK
if your data >700MB you can enable spark.speculation properties

Accessing spark cluster mode in spark submit

I am trying to run my spark scala code using spark submit. I want to access the spark cluster for this purpose. So, what should I use for the master in the Spark Context? I have used like this
val spark = SparkSession.builder()
.master("spark://prod5:7077")
.appName("MyApp")
.getOrCreate;
But it doesn't seem to work. What should I use as the master for using spark cluster?
If you are trying to submit your job in IDE,just make sure that "prod5" is the master, and try to change the port to 6066 by default.
From official documentation -
spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]

Spark 2.2 fails with more memory or workers, succeeds with very little memory and few workers

We have a Spark 2.2 job writte in Scala running in a YARN cluster that does the following:
Read several thousand small compressed parquet files (~15kb each) into two dataframes
​Join the dataframes on one column
Foldleft over all columns to clean some data
Drop duplicates
Write result dataframe to parquet
The following configuration​ fails via java.lang.OutOfMemory java heap space:
​--conf spark.yarn.am.memory=4g
--conf spark.executor.memory=20g
--conf spark.yarn.executor.memoryOverhead=1g
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.maxExecutors=5
--conf spark.executor.cores=4
--conf spark.network.timeout=2000
However, this job works reliably if we remove spark.executor.memory entirely. This gives each executor 1g of ram.
This job also fails if we do any of the following:
Increase executors
Increase default parallelism or spark.sql.shuffle.partitions
Can anyone help me understand why more memory and more executors leads to failed jobs due to OutOfMemory? ​
​
Manually setting these parameters disables dynamic allocation. Try leaving it alone, since it is recommended for beginners. It's also useful for experimentation before you can fine tune cluster size in a PROD setting.
Throwing more memory/executors at Spark seems like a good idea, but in your case it probably caused extra shuffles and/or decreased HDFS I/O throughput. This article, while slightly dated and geared towards Cloudera users, explains how to tune parallelism by right-sizing executors.

spark-submit gets idle in local mode

I am trying to test a jar using spark-submit (Spark 1.6.0) into a Cloudera cluster, which has Kerberos enabled.
The fact is that if I launch this command:
spark-submit --master local --class myDriver myApp.jar -c myConfig.conf
In local or local[*], the process stops after a couple of stages. However, if I use yarn-client or yarn-cluster master modes the process ends correctly. The process reads and writes some files into HDFS.
Furthermore, these traces appear:
17/07/05 16:12:51 WARN spark.SparkContext: Requesting executors is only supported in coarse-grained mode
17/07/05 16:12:51 WARN spark.ExecutorAllocationManager: Unable to reach the cluster manager to request 1 total executors!
It is surely a matter of configuration, but the fact is that I don't know what is happenning. Any ideas? What configuration options should I change?

Issue running Cosine Similarity on 50k records

I am using spark 1.6 cosine similarity algorithm in spark mllib.
Input: 50k documents' text with ids in a dataframe.
Processing :
Tokenized the texts
Removed stop words
Generated vectors(size=300) using word2Vec
Generated RowMatrix
Transposed it
Used columnSimilarities method with threshold 0.1.(Also tired higher
values)
Output is nxn matrix.
I am using this spark submit
spark-submit --master yarn --conf "spark.kryoserializer.buffer.max=256m" --num-executors 60 --driver-memory 10G --executor-memory 15G --executor-cores 5 --conf "spark.shuffle.service.enabled=true" --conf "spark.yarn.executor.memoryOverhead=2048" noname.jar xyzclass
I am also doing 400 partitions.
But I am getting out of memory issues. I have tired different combinations for partitions and number of executors but failed to run it successfully. However I am able to run it successfully for 7k records with vector size 50 in less than 7mins. Any suggestions how I can make it run on 50K records?