Parallelization in Spark Scala - scala

We are using Azure Databricks cluster (E64s_v3 (432GB RAM, 64 Cores); Worker's = 12) to run the scala code. We are trying to parallelize the writing of data by worker node so we can have all the cores working for it. Using "new scala.concurrent.forkjoin.ForkJoinPool(8)" to parallelize the writing of data. We have 2B of records and all of them can be segregated by 50 Ids, our goal is to have 1 worker working on each Id of data with all the cores running so we can have 12 paralleling writings happening with all the cores working. Below is the piece of code that we have used
sTpids.tasksupport = new ForkJoinTaskSupport( new scala.concurrent.forkjoin.ForkJoinPool(8) ) sTpids.foreach(p => { trnMap.write.mode("overwrite").parquet(partitionPath) }
From our observation, the processes is not parallelizing instead all the workers are working on writing 1 Id at a given time and only 1 core per worker is being used. Can you some one help me understand why this is happening ?

Related

Unusually long time in writing parquet files to Google Cloud

I am using pyspark dataframe on dataproc cluster to generate features and writing parquet files as output to the Google Cloud Storage. There are two problems I am facing-
I have provided 22 executers, 3 cores per exec and ~13G RAM per executer. However only 10 executers are fired when I submit the jobs. The dataproc cluster contains 10 worker nodes and 8 cores per node and 30 GB ram per node.
When I write the individual feature files and record the total time, it is significantly lower then the time taken to write all the features together in a single file. I have tried changing the partitions but doesn't help either.
This is how I write the parquet file:
df.select([feature_lst]).write.parquet(gcs_path+outfile,mode='overwrite')
data size - 20M+ records, 30+ numerical features
Spark UI image:
The current stage is when I write all features together- significantly higher than all of previous stages combined.
If someone can provide any insight into the above two issues I will be grateful.

How to guarantee effective cluster resource utilization by Futures in spark

I want to run multiple spark SQL parallel in a spark cluster, so that I can utilize the complete resource cluster wide. I'm using sqlContext.sql(query).
I saw some sample code here like follows,
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit
As I understood, the ExecutionContext compute the available cores in the machine(using ForkJoinPool) and do the parallelism accordingly. But what happens if we consider the spark cluster other-than the single machine and How can it guarantee the complete cluster resource utilization.?
eg: If I have a 10 node cluster with each 4 cores, then how can the above code guarantees that the 40 cores will be utilized.
EDITS:-
Lets say there are 2 sql to be executed, we have 2 way to do this,
submit the queries sequentially, so that second query will be completed only after the execution of the first. (because sqlContext.sql(query) is a synchronous call)
Submit both the queries parallel using Futures, so that both the queries will executed independently and parallel in the cluster
assuming there are enough resources (in both cases).
I think the second one is better because it uses the maximum resources available in the cluster and if the first query fully utilized the resources the scheduler will wait for the completion of the job(depending upon the policy) which is fair in this case.
But as user9613318 mentioned 'increasing pool size will saturate the driver'
Then how can I efficiently control the threads for better resource utilization.
Parallelism will have a minimal impact here, and additional cluster resources don't really affect the approach. Futures (or Threads) are use not to parallelize execution, but to avoid blocking execution. Increasing pool size can only saturate the driver.
What you really should be looking at is Spark in-application scheduling pools and tuning of the number of partitions for narrow (How to change partition size in Spark SQL, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?) and wide (What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?) transformations.
If jobs are completely independent (the code structure suggests that) it could be preferred to submit each one separately, with its own set of allocated resources, and configure cluster scheduling pools accordingly.

Why is each Spark Task not utilizing all allocated cores?

Assume I have 36 cores per executor, one executor per node, and 3 nodes each with 48 cores available. The basic gist of what I've noticed is, when I set each task to use 1 core (the default), my CPU utilization over the workers is about 70% and 36 tasks will execute simultaneously per executor (as I would have expected). However, when I change my configuration to have 6 cores per task (--conf spark.task.cpus=6), I get the drop to 6 tasks at a time per executor (as expected), but my CPU utilization also drops below 10% utilization (unexpected). I would have assumed that Spark would know how to parallelize the workload over the 6 cores.
The implementation details that are important are that I am running a UDF function on a column of a DataFrame and appending the results as a new column on that dataframe. This UDF function uses a #transient object that provides a machine learning algorithm that I'm using. This UDF function is not part of an aggregation or coalesce operation, it is just a map operation over the column implemented like so:
def myUdf = udf { ... }
val resultSet = myUdf(dataFrame.col("originalCol"))
val dataFrameWithResults = dataFrame.withColumn("originalColMetric", resultSet)
I would have expected that Spark would execute 6 myUdf to process 6 records at a time, one for each core, but this doesn't appear to be the case. Is there a way to fix this (without submitting a PR to the Spark project), or at least, can someone explain why this might be happening?
Anticipating the question, I'm experimenting with increasing the number of cores per task in order to reduce the amount of RAM required per executor. Executing too many tasks at once exponentially increases the RAM usage, in this instance.
spark.task.cpus is a number of cores to allocate for each task. It is used to allocate multiple cores to a single task, in case when user code is multi-threaded. If your udf doesn't use multiple (doesn't spawn multiple threads in a single function call) threads then the cores are just wasted.
to process 6 records at a time
allocate 6 cores, with spark.task.cpus set to 1. If you want to limit number of tasks on node, then reduce number of cores offered by each node.
Essentially Spark can determine on its own how to split out mapping a UDF over multiple records concurrently by splitting the records up among each of the Tasks (according to the partitioning) and determining how many simultaneous Tasks each Executor can handle. However, Spark can NOT automatically split the work per Core per Task. To utilize multiple cores per task, the code in the UDF, which would get executed over one record at a time (sequentially) per Task, would need to be written to parallelize the computation in that UDF over a single record.

Increase network traffic between worker nodes through apache spark's shuffle process

I am trying to write a (wordcount) program to simulate a use case where the network traffic would be very high because of spark's shuffle process. I have a 3 node apache spark cluster (2 cores each, 8GB RAM each) configured with 1 master and 2 workers. I processed a 5GB file for wordcount and was able to see a network traffic between 2 worker nodes raise up to 1GB in 10-15 mins. I am looking for a way where i could increase the traffic between nodes raise up to atleast 1GB within 30s-60s. The inefficiency of the program or best practices doesn't matter in my current use case as I am just trying to simulate traffic.
This is the program i have written
val sc = new SparkContext(new SparkConf().setAppName("LetterCount-GroupBy-Map"))
val x = sc.textFile(args(0)).flatMap(t => t.split(" "))
val y = x.map(w => (w.charAt(0),w))
val z = y.groupByKey().mapValues(n => n.size)
z.collect().foreach(println)
More shuffled data can be generated by doing operations which do not combine data very well on each node. For eg: In the code you have written, the groupby will combine the common keys ( or do groupby locally). Instead choose a high cardinality of keys (in above example its 26 only). In addition, the size of values after the map operation can be increased. In your case, its the text line. You might want to put a very long string of values for each key.
Apart from this, if you take 2 different files/tables and apply join on some parameter, it will also cause shuffling.
Note: Am assuming the contents does not matter. You are only interested in generating highly shuffled data.

running a single job across multiple workers in apache spark

I am trying to get to know how Spark splits a single job (a scala file built using sbt package and the jar is run using spark-submit command) across multiple workers.
For example : I have two workers (512MB memory each). I submit a job and it gets allocated to one worker only (if driver memory is less than the worker memory). In case the driver memory is more than the worker memory, it doesn't get allocated to any worker (even though the combined memory of both workers is higher than the driver memory) and goes to submitted state. This job then goes to running state only when a worker with the required memory is available in the cluster.
I want to know whether one job can be split up across multiple workers and can be run in parallel. If so, can anyone help me with the specific steps involved in it.
Note : the scala program requires a lot of jvm memory since I would be using a large array buffer and hence trying to split the job across multiple workers
Thanks in advance!!
Please check if the array you would be using is parallelized. Then when you do some action on it, it should work in parallel across the nodes.
Check out this page for reference : http://spark.apache.org/docs/0.9.1/scala-programming-guide.html
Make sure your RDD has more than one partition (rdd.partitions.size). Make sure you have more than one executor connected to the driver (http://localhost:4040/executors/).
If both of these are fulfilled, your job should run on multiple executors in parallel. If not, please include code and logs in your question.