I recently discovered that adding parallel computing (e.g. using parallel-collections) inside UDFs increases performance considerable even when running spark in local[1] mode or using Yarn with 1 executor and 1 core.
E.g. in local[1] mode, the Spark-Jobs consumes as much CPU as possible (i.e. 800% if I have 8 cores, measured using top).
This seems strange because I thought Spark (or yarn) limits the CPU usage per Spark application?
So I wonder why that is and whether it's recommended to use parallel-processing/mutli-threading in spark or should I stick to sparks parallelizing pattern?
Here an example to play with (times measured in yarn client-mode with 1 instance and 1 core)
case class MyRow(id:Int,data:Seq[Double])
// create dataFrame
val rows = 10
val points = 10000
import scala.util.Random.nextDouble
val data = {1 to rows}.map{i => MyRow(i, Stream.continually(nextDouble()).take(points))}
val df = sc.parallelize(data).toDF().repartition($"id").cache()
df.show() // trigger computation and caching
// some expensive dummy-computation for each array-element
val expensive = (d:Double) => (1 to 10000).foldLeft(0.0){case(a,b) => a*b}*d
val serialUDF = udf((in:Seq[Double]) => in.map{expensive}.sum)
val parallelUDF = udf((in:Seq[Double]) => in.par.map{expensive}.sum)
df.withColumn("sum",serialUDF($"data")).show() // takes ~ 10 seconds
df.withColumn("sum",parallelUDF($"data")).show() // takes ~ 2.5 seconds
Spark does not limit CPU directly, instead it defines the number of concurrent threads spark creates. So for local[1] it would basically run one task at a time in parallel. When you are doing in.par.map{expensive} you are creating threads which spark does not manage and therefore are not handled by this limit. i.e. you told spark to limit itself to a single thread and then created other threads without spark knowing it.
In general, it is not a good idea to do parallel threads inside of a spark operation. Instead, it would be better to tell spark how many threads it can work with and make sure you have enough partitions for parallelism.
Spark are configuration of CPU usage
examople
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
change the local[*] it will utilization all of your CPU cores.
Related
I am new to scala/spark world and have recently started working on a task where it reads some data, processes it and saves it on S3. I have read several topics/questions on stackoverflow regarding repartition/coalesce performance and optimal number of partitions (like this one). Assuming that I have the right number of paritions, my questions is, would it be a good idea to repartition a rdd while converting it to dataframe? Here is how my code looks like at the moment:
val dataRdd = dataDf.rdd.repartition(partitions)
.map(x => ThreadedConcurrentContext.executeAsync(myFunction(x)))
.mapPartitions( it => ThreadedConcurrentContext.awaitSliding(it = it, batchSize = asyncThreadsPerTask, timeout = Duration(3600000, "millis")))
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.toDF()
Here is what I'm planning to do (repartition data after filter):
val finalDf = dataRdd
.filter(tpl => tpl._3 != "ERROR")
.repartition(partitions)
.toDF()
My questions is, would it be a good idea to do so? is there a performance gain here?
Note1: Filter usually removes 10% of original data.
Note2: Here is the first part of spark-submit command that I use to run the above code:
spark-submit --master yarn --deploy-mode client --num-executors 4 --executor-cores 4 --executor-memory 2G --driver-cores 4 --driver-memory 2G
The answer to your problem depends on the size of your dataRdd, number of partitions, executor-cores and processing power of your HDFS cluster.
With this in mind, you should run some tests on your cluster with different values of partitions and removing repartition altogether to fine tune it and find an accurate answer.
Example - if you specify partitions=8 and executor-cores=4 then you will be fully utilizing all your cores, however if the size of your dataRdd is only 1GB, then there is no advantage in repartitioning because it triggers shuffle which incurs performance impact. In addition to that if the processing power of your HDFS cluster is low or it is under a heavy load then there is an additional performance overhead due to that.
If you do have sufficient resources available on your HDFS cluster and you have a big (say over 100GB) dataRDD then a repartition should help in improving performance with config values in the example above.
So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory
I want to run multiple spark SQL parallel in a spark cluster, so that I can utilize the complete resource cluster wide. I'm using sqlContext.sql(query).
I saw some sample code here like follows,
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit
As I understood, the ExecutionContext compute the available cores in the machine(using ForkJoinPool) and do the parallelism accordingly. But what happens if we consider the spark cluster other-than the single machine and How can it guarantee the complete cluster resource utilization.?
eg: If I have a 10 node cluster with each 4 cores, then how can the above code guarantees that the 40 cores will be utilized.
EDITS:-
Lets say there are 2 sql to be executed, we have 2 way to do this,
submit the queries sequentially, so that second query will be completed only after the execution of the first. (because sqlContext.sql(query) is a synchronous call)
Submit both the queries parallel using Futures, so that both the queries will executed independently and parallel in the cluster
assuming there are enough resources (in both cases).
I think the second one is better because it uses the maximum resources available in the cluster and if the first query fully utilized the resources the scheduler will wait for the completion of the job(depending upon the policy) which is fair in this case.
But as user9613318 mentioned 'increasing pool size will saturate the driver'
Then how can I efficiently control the threads for better resource utilization.
Parallelism will have a minimal impact here, and additional cluster resources don't really affect the approach. Futures (or Threads) are use not to parallelize execution, but to avoid blocking execution. Increasing pool size can only saturate the driver.
What you really should be looking at is Spark in-application scheduling pools and tuning of the number of partitions for narrow (How to change partition size in Spark SQL, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?) and wide (What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?) transformations.
If jobs are completely independent (the code structure suggests that) it could be preferred to submit each one separately, with its own set of allocated resources, and configure cluster scheduling pools accordingly.
I'm trying to make writes through Spark.
I have 6 nodes in my cluster and in it I made keyspace in which I want to write data:
CREATE KEYSPACE traffic WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
When I'm trying to write from Spark, I'm getting this kind of error:
16/08/17 16:14:57 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBatchStatement#7409fd2d
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)
This is snippet of code what am I doing exactly:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types.{StructType, StructField, DateType, IntegerType};
object ff {
def main(string: Array[String]) {
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "127.0.0.1")
.set("spark.cassandra.connection.host","ONE")
.setMaster("local[4]")
.setAppName("ff")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true")
.load("test.csv")
df.registerTempTable("ff_table")
//df.printSchema()
df.count
time {
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ff_table", "keyspace" -> "traffic"))
.save()
}
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
}
}
Also, if I run nodetool describecluster I got this results:
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
bf6c3ae7-5c8b-3e5d-9794-8e34bee9278f: [127.0.0.1, 127.0.0.2, 127.0.0.3, 127.0.0.4, 127.0.0.5, 127.0.0.6]
I tried to insert in CLI on row for replication_factor:2 and it's working, so every node can see each other.
Why Spark can't insert anything than, why nodes can't see each other while trying to insert data from Spark, anyone idea?
It looks like you are running 6 nodes on one machine via loopback. This means there is a rather likely chance that the resources of this machine are being over subscribed. The various Cassandra instances are most likely taking turns or swapping which is causing them to go missing when under heavy load. Increasing the replication factor increases the chance that a valid target is up but will increase load even further.
C* requires at it's core several different resources from your system, if any of these become a bottleneck any one these there is a chance that a node will not respond to gossip in sufficent time.
These resources are
RAM - How much memory the JVM is able to acquire, this is affected by OS swap as well. This means if you allocate a large JVM but the OS swaps it to disk, you are likely to see massive performance issues. With multiple nodes on the same machine you need to make sure there is ample ram for the JVM of every node you are startin. In addition if any one instance's JVM is getting to close to full you will enter GC and possibly a GC Storm which will basically lock up that instance. Many of these details will be clear in the system.log.
CPU - Without exclusive access to at least one cpu you are almost guaranteed to have some important threads in C* scheduled with a long delay between them. This can cause gossip threads to be ignored and gossip to fail. This will give some nodes a view of a cluster with failed machines and cause unavailable errors.
DISK - Every Cassandra instance will maintain it's own CommitLog and HD files. The commit log flushes every 10 seconds and if you have multiple instances and only 1 harddrive the flushes between the commitlog and normal memtables can easily block one another. This is further compounded by compaction which requires another large amount of IO.
NETWORK - Although this isn't an issue with multiple nodes on the same machine.
In summation,
It is important to make sure the resources allocated to your C* instances are small enough that no instance will overrun the space/ram/cpu of another. If you do so you will end up with a cluster whose communication fails when under load because one of the above resources is bottlenecked. This doesn't mean it's impossible to run multiple nodes on the same machine but does mean you must take care in provisioning. You can also attempt to lessen the load by throttling your write speed which will give the nodes less of a chance of clobbering one another.
I am trying to write a (wordcount) program to simulate a use case where the network traffic would be very high because of spark's shuffle process. I have a 3 node apache spark cluster (2 cores each, 8GB RAM each) configured with 1 master and 2 workers. I processed a 5GB file for wordcount and was able to see a network traffic between 2 worker nodes raise up to 1GB in 10-15 mins. I am looking for a way where i could increase the traffic between nodes raise up to atleast 1GB within 30s-60s. The inefficiency of the program or best practices doesn't matter in my current use case as I am just trying to simulate traffic.
This is the program i have written
val sc = new SparkContext(new SparkConf().setAppName("LetterCount-GroupBy-Map"))
val x = sc.textFile(args(0)).flatMap(t => t.split(" "))
val y = x.map(w => (w.charAt(0),w))
val z = y.groupByKey().mapValues(n => n.size)
z.collect().foreach(println)
More shuffled data can be generated by doing operations which do not combine data very well on each node. For eg: In the code you have written, the groupby will combine the common keys ( or do groupby locally). Instead choose a high cardinality of keys (in above example its 26 only). In addition, the size of values after the map operation can be increased. In your case, its the text line. You might want to put a very long string of values for each key.
Apart from this, if you take 2 different files/tables and apply join on some parameter, it will also cause shuffling.
Note: Am assuming the contents does not matter. You are only interested in generating highly shuffled data.