How do I determine the optimal number of threads in Spark application? - scala

In my Scala/Spark application, I am trying correctly to use multiprocessing. As you can see from the code below, the number of threads is equal to the number of elements in the storage array. I tested the current code and it works. But as you can see there are only 2 elements in the storage array. It seems to me that if there are a large number of elements in the array, problems will occur. In my case, I don't know how many elements there will be in the array in the future. Perhaps I should limit the number of threads and start new threads only when the previous ones are processed.
Question: How do I determine the optimal number of threads?
Main.app:
import org.apache.spark.sql.DataFrame
import utils.CustomThread
object MainApp {
def main(args: Array[String]): Unit = {
// Create the main DataFrame with all information.
var baseDF: DataFrame = spark.read.option("delimiter", "|").csv("/path_to_the_files/")
// Cache the main DataFrame.
baseDF.persist(StorageLevel.MEMORY_AND_DISK)
// The first time DataFrame is computed in an action, it will be kept in memory on the nodes.
baseDF.count()
// Create arrays with the different identifiers
var array1 = Array("6fefc487-bd57-4fa2-808a-3845703b83d0", "9baba76b-07c2-48ec-a153-6cfb8b138ecf")
var array2 = Array("ab654369-77f5-478c-94e5-ee2755ae8571", "3b43e0a6-deba-4919-a2cc-9d450e28e0fe")
var storage = Array(array1, array2)
// Check if the main DataFrame is empty or not.
if (baseDF.head(1).nonEmpty) {
for (item <- storage) {
val thread = new Thread(new CustomThread(baseDF, item))
thread.start()
}
}
}
}
CustomThread.scala:
package utils
import org.apache.spark.sql.DataFrame
class CustomThread(baseDF: DataFrame, item: Array[String]) extends Runnable {
override def run(): Unit = {
val df = baseDF.filter(col("col1").isin(item:_*))
println("Count: " + df.count())
}
}
I use such configurations:
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max.mb: 1024
spark.executor.memory: 2g
spark.sql.autoBroadcastJoinThreshold: -1
spark.sql.files.ignoreCorruptFiles: true
spark.driver.memory: 30g
spark.driver.maxResultSize: 20g
spark.executor.cores: 1
spark.cores.max: 48
spark.scheduler.mode: FAIR

What do you want to achieve with your multi-threading? In a spark application, you should not worry about the number of threads etc. What your code does is the launch parallel jobs (your multi-threading is only on the driver), for the executors, there is no difference.
In my experience, I only use parallel job launches if I have several jobs which are small or skewed, such as the cluster resources are not fully utilized. If I do that, I use scalas parallel collections.
To answer your question : The optimal number if threads is probably 1
EDIT: I would suggest to rewrite your code completely with the goal to have all results in a new dataframe, this is better than implementing complicated multi-threading:
// testcase
val baseDf = Seq(
"6fefc487-bd57-4fa2-808a-3845703b83d0",
"9baba76b-07c2-48ec-a153-6cfb8b138ecf",
"ab654369-77f5-478c-94e5-ee2755ae8571",
"dummy"
).toDF("col1")
var array1 = Seq("6fefc487-bd57-4fa2-808a-3845703b83d0", "9baba76b-07c2-48ec-a153-6cfb8b138ecf")
var array2 = Seq("ab654369-77f5-478c-94e5-ee2755ae8571", "3b43e0a6-deba-4919-a2cc-9d450e28e0fe")
var storage = Seq(array1, array2)
broadcast(storage.toDF("storage"))
.join(baseDf,array_contains($"storage",$"col1"),"left")
.groupBy($"storage").agg(count($"col1").as("count"))
.show()
gives :
+--------------------+-----+
| storage|count|
+--------------------+-----+
|[ab654369-77f5-47...| 1|
|[6fefc487-bd57-4f...| 2|
+--------------------+-----+

Related

What is the most efficient way to copy a S3 "folder"?

I want to find an efficient way to copy an S3 folder/prefix with lots of objects to another folder/prefix on the same bucket. This is what I have tried.
Test data: around 200 objects, around 100 MB each.
1) aws s3 cp --recursive. It took around 150 secs.
2) s3-dist-cp. It took around 59 secs.
3) spark & aws jdk, 2 threads. It took around 440 secs.
4) spark & aws jdk, 64 threads. It took around 50 secs.
The threads definitely worked, but when it goes to a single thread, the aws java sdk approach seems not as efficient the aws s3 cp approach. Is there a single-threaded programming API that can have performance comparable to that of aws s3 cp? Or if there is a better to copy data?
Ideally I would prefer to use programming API to have more flexibility.
Below are the codes I used.
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
def listAllFiles(rootPath: String): Seq[String] = {
val fileSystem = FileSystem.get(URI.create(rootPath), new Configuration())
val it = fileSystem.listFiles(new Path(rootPath), true)
var files = List[String]()
while (it.hasNext) {
files = it.next().getPath.toString::files
}
files
}
def s3CopyFiles(spark: SparkSession, fromPath: String, toPath: String): Unit = {
val fromFiles = listAllFiles(fromPath)
val toFiles = fromFiles.map(_.replaceFirst(fromPath, toPath))
val fileMap = fromFiles.zip(toFiles)
s3CopyFiles(spark, fileMap)
}
def s3CopyFiles(spark: SparkSession, fileMap: Seq[(String, String)]): Unit = {
val sc = spark.sparkContext
val filePairRdd = sc.parallelize(fileMap.toList, sc.defaultParallelism)
filePairRdd.foreachPartition(it => {
val p = "s3://([^/]*)/(.*)".r
val s3 = AmazonS3ClientBuilder.defaultClient()
while (it.hasNext) {
val (p(fromBucket, fromKey), p(toBucket, toKey)) = it.next()
s3.copyObject(fromBucket, fromKey, toBucket, toKey)
}
})
}
The AWS SDK transfer manager is multithreaded; you tell it the block size you want to split the copy up by and it will do it across threads and coalesce the output at the end. Your code shouldn't have to care about how the thread/http pool is working.
Remember that the COPY call isn't doing IO; each thread issues the HTTP request and then blocks awaiting the answer...you can have many, many of them blocked at the same time
I would recommend async approach, for example reactive-aws-clients. You still will be limited to the S3 throttling bandwidth but you won't need brute force of huge number of threads on client side. For example, you could create a Monix app with structure like:
val future = listS3filesTask.flatMap(key => Task.now(getS3Object(key))).runAsync
Await.result(future, 100.seconds)
Another possible optimization could be using torrent protocol s3 feature if you have multiple consumers, so you can distribute data files across consumers with just one S3 GetObject operation per file.

Bulk Insert Data in HBase using Structured Spark Streaming

I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase.
I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3
I tried something like I've seen here.
eventhubs.writeStream
.foreach(new MyHBaseWriter[Row])
.option("checkpointLocation", checkpointDir)
.start()
.awaitTermination()
MyHBaseWriter looks like this :
class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] {
override def toPut(record: Row): Put = {
override val tableName: String = "hbase-table-name"
override def toPut(record: Row): Put = {
// Get Json
val data = JSON.parseFull(record.getString(0)).asInstanceOf[Some[Map[String, Object]]]
val key = data.getOrElse(Map())("key")+ ""
val val = data.getOrElse(Map())("val")+ ""
val p = new Put(Bytes.toBytes(key))
//Add columns ...
p.addColumn(Bytes.toBytes(columnFamaliyName),Bytes.toBytes(columnName), Bytes.toBytes(val))
p
}
}
And the HBaseForeachWriter class looks like this :
trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {
val tableName: String
def pool: Option[ExecutorService] = None
def user: Option[User] = None
private var hTable: Table = _
private var connection: Connection = _
override def open(partitionId: Long, version: Long): Boolean = {
connection = createConnection()
hTable = getHTable(connection)
true
}
def createConnection(): Connection = {
// I create HBase Connection Here
}
def getHTable(connection: Connection): Table = {
connection.getTable(TableName.valueOf(Variables.getTableName()))
}
override def process(record: RECORD): Unit = {
val put = toPut(record)
hTable.put(put)
}
override def close(errorOrNull: Throwable): Unit = {
hTable.close()
connection.close()
}
def toPut(record: RECORD): Put
}
So here I'm doing a put line by line, even if I allow 20 executors and 4 cores for each, I don't have the data inserted immediatly in HBase. So what I need to do is a bulk load ut I'm struggled because all what I find in the internet is to realize it with RDDs and Map/Reduce.
What I understand is slow rate of record ingestion in to hbase. I have few suggestions to you.
1) hbase.client.write.buffer .
the below property may help you.
hbase.client.write.buffer
Description Default size of the BufferedMutator write buffer in bytes. A bigger buffer takes more memory — on both the client and
server side since server instantiates the passed write buffer to
process it — but a larger buffer size reduces the number of RPCs made.
For an estimate of server-side memory-used, evaluate
hbase.client.write.buffer * hbase.regionserver.handler.count
Default 2097152 (around 2 mb )
I prefer foreachBatch see spark docs (its kind of foreachPartition in spark core) rather foreach
Also in your hbase writer extends ForeachWriter
open method intialize array list of put
in process add the put to the arraylist of puts
in close table.put(listofputs); and then reset the arraylist once you updated the table...
what it does basically your buffer size mentioned above is filled with 2 mb then it will flush in to hbase table. till then records wont go to hbase table.
you can increase that to 10mb and so....
In this way number of RPCs will be reduced. and huge chunk of data will be flushed and will be in hbase table.
when write buffer is filled up and a flushCommits in to hbase table is triggered.
Example code : in my answer
2) switch off WAL you can switch off WAL(write ahead log - Danger is no recovery) but it will speed up writes... if dont want to recover the data.
Note : if you are using solr or cloudera search on hbase tables you
should not turn it off since Solr will work on WAL. if you switch it
off then, Solr indexing wont work.. this is one common mistake many of
us does.
How to swtich off : https://hbase.apache.org/1.1/apidocs/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean)
Basic architechture and link for further study :
http://hbase.apache.org/book.html#perf.writing
as I mentioned list of puts is good way... this is the old way (foreachPartition with list of puts) of doing before structured streaming example is like below .. where foreachPartition operates for each partition not every row.
def writeHbase(mydataframe: DataFrame) = {
val columnFamilyName: String = "c"
mydataframe.foreachPartition(rows => {
val puts = new util.ArrayList[ Put ]
rows.foreach(row => {
val key = row.getAs[ String ]("rowKey")
val p = new Put(Bytes.toBytes(key))
val columnV = row.getAs[ Double ]("x")
val columnT = row.getAs[ Long ]("y")
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("x"),
Bytes.toBytes(columnX)
)
p.addColumn(
Bytes.toBytes(columnFamilyName),
Bytes.toBytes("y"),
Bytes.toBytes(columnY)
)
puts.add(p)
})
HBaseUtil.putRows(hbaseZookeeperQuorum, hbaseTableName, puts)
})
}
To sumup :
What I feel is we need to understand the psycology of spark and hbase
to make then effective pair.

How to get Number of executors and Number of tasks per executors programmatically in Spark [duplicate]

I run my spark application in yarn cluster. In my code I use number available cores of queue for creating partitions on my dataset:
Dataset ds = ...
ds.coalesce(config.getNumberOfCores());
My question: how can I get number available cores of queue by programmatically way and not by configuration?
There are ways to get both the number of executors and the number of cores in a cluster from Spark. Here is a bit of Scala utility code that I've used in the past. You should easily be able to adapt it to Java. There are two key ideas:
The number of workers is the number of executors minus one or sc.getExecutorStorageStatus.length - 1.
The number of cores per worker can be obtained by executing java.lang.Runtime.getRuntime.availableProcessors on a worker.
The rest of the code is boilerplate for adding convenience methods to SparkContext using Scala implicits. I wrote the code for 1.x years ago, which is why it is not using SparkSession.
One final point: it is often a good idea to coalesce to a multiple of your cores as this can improve performance in the case of skewed data. In practice, I use anywhere between 1.5x and 4x, depending on the size of data and whether the job is running on a shared cluster or not.
import org.apache.spark.SparkContext
import scala.language.implicitConversions
class RichSparkContext(val sc: SparkContext) {
def executorCount: Int =
sc.getExecutorStorageStatus.length - 1 // one is the driver
def coresPerExecutor: Int =
RichSparkContext.coresPerExecutor(sc)
def coreCount: Int =
executorCount * coresPerExecutor
def coreCount(coresPerExecutor: Int): Int =
executorCount * coresPerExecutor
}
object RichSparkContext {
trait Enrichment {
implicit def enrichMetadata(sc: SparkContext): RichSparkContext =
new RichSparkContext(sc)
}
object implicits extends Enrichment
private var _coresPerExecutor: Int = 0
def coresPerExecutor(sc: SparkContext): Int =
synchronized {
if (_coresPerExecutor == 0)
sc.range(0, 1).map(_ => java.lang.Runtime.getRuntime.availableProcessors).collect.head
else _coresPerExecutor
}
}
Update
Recently, getExecutorStorageStatus has been removed. We have switched to using SparkEnv's blockManager.master.getStorageStatus.length - 1 (the minus one is for the driver again). The normal way to get to it, via env of SparkContext is not accessible outside of the org.apache.spark package. Therefore, we use an encapsulation violation pattern:
package org.apache.spark
object EncapsulationViolator {
def sparkEnv(sc: SparkContext): SparkEnv = sc.env
}
Found this while looking for the answer to pretty much the same question.
I found that:
Dataset ds = ...
ds.coalesce(sc.defaultParallelism());
does exactly what the OP was looking for.
For example, my 5 node x 8 core cluster returns 40 for the defaultParallelism.
According to Databricks if the driver and executors are of the same node type, this is the way to go:
java.lang.Runtime.getRuntime.availableProcessors * (sc.statusTracker.getExecutorInfos.length -1)
You could run jobs on every machine and ask it for the number of cores, but that's not necessarily what's available for Spark (as pointed out by #tribbloid in a comment on another answer):
import spark.implicits._
import scala.collection.JavaConverters._
import sys.process._
val procs = (1 to 1000).toDF.map(_ => "hostname".!!.trim -> java.lang.Runtime.getRuntime.availableProcessors).collectAsList().asScala.toMap
val nCpus = procs.values.sum
Running it in the shell (on a tiny test cluster with two workers) gives:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import spark.implicits._
import scala.collection.JavaConverters._
import sys.process._
val procs = (1 to 1000).toDF.map(_ => "hostname".!!.trim -> java.lang.Runtime.getRuntime.availableProcessors).collectAsList().asScala.toMap
val nCpus = procs.values.sum
// Exiting paste mode, now interpreting.
import spark.implicits._
import scala.collection.JavaConverters._
import sys.process._
procs: scala.collection.immutable.Map[String,Int] = Map(ip-172-31-76-201.ec2.internal -> 2, ip-172-31-74-242.ec2.internal -> 2)
nCpus: Int = 4
Add zeros to your range if you typically have lots of machines in your cluster. Even on my two-machine cluster 10000 completes in a couple seconds.
This is probably only useful if you want more information than sc.defaultParallelism() will give you (as in #SteveC 's answer)
For all of those that aren't using yarn clusters: If you are doing it in Python/Databricks here is a function I wrote that will help solve the opportunity. This will get you both the number of worker nodes as well as the number of CPU's and return the multiplied final CPU count of your worker distribution.
def GetDistCPUCount():
nWorkers = int(spark.sparkContext.getConf().get('spark.databricks.clusterUsageTags.clusterTargetWorkers'))
GetType = spark.sparkContext.getConf().get('spark.databricks.clusterUsageTags.clusterNodeType')
GetSubString = pd.Series(test).str.split(pat = '_', expand = True)
GetNumber = GetSubString[1].str.extract('(\d+)')
ParseOutString = GetNumber.iloc[0,0]
WorkerCPUs = int(ParseOutString)
nCPUs = nWorkers * WorkerCPUs
return nCPUs

How to use COGROUP for large datasets

I have two rdd's namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I'm using cogroup for those datasets like:
val tab_c = tab_a.cogroup(tab_b).collect.toArray
val updated = tab_c.map { x =>
{
//somecode
}
}
I'm using tab_c cogrouped values for map function and it works fine for small datasets but in case of huge datasets it throws Out Of Memory exception.
I have tried converting the final value to RDD but no luck same error
val newcos = spark.sparkContext.parallelize(tab_c)
1.How to use Cogroup for large datasets ?
2.Can we persist the cogrouped value ?
Code
val source_primary_key = source.map(rec => (rec.split(",")(0), rec))
source_primary_key.persist(StorageLevel.DISK_ONLY)
val destination_primary_key = destination.map(rec => (rec.split(",")(0), rec))
destination_primary_key.persist(StorageLevel.DISK_ONLY)
val cos = source_primary_key.cogroup(destination_primary_key).repartition(10).collect()
var srcmis: Array[String] = new Array[String](0)
var destmis: Array[String] = new Array[String](0)
var extrainsrc: Array[String] = new Array[String](0)
var extraindest: Array[String] = new Array[String](0)
var srcs: String = Seq("")(0)
var destt: String = Seq("")(0)
val updated = cos.map { x =>
{
val key = x._1
val value = x._2
srcs = value._1.mkString(",")
destt = value._2.mkString(",")
if (srcs.equalsIgnoreCase(destt) == false && destt != "") {
srcmis :+= srcs
destmis :+= destt
}
if (srcs == "") {
extraindest :+= destt.mkString("")
}
if (destt == "") {
extrainsrc :+= srcs.mkString("")
}
}
}
Code Updated:
val tab_c = tab_a.cogroup(tab_b).filter(x => x._2._1 =!= x => x._2._2)
// tab_c = {1,Compactbuffer(1,john,US),Compactbuffer(1,john,UK)}
{2,Compactbuffer(2,john,US),Compactbuffer(2,johnson,UK)}..
ERROR:
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(4,3,ResultTask,FetchFailed(null,0,-1,27,org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
ERROR YarnScheduler: Lost executor 8 on datanode1: Container killed by YARN for exceeding memory limits. 1.0 GB of 1020 MB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Thank you
When you use collect() you are basically telling spark to move all the resulting data back to the master node, which can easily produce a bottleneck. You are no longer using Spark at that point, just a plain array in a single machine.
To trigger computation just use something that requires the data at every node, that's why executors live on top of a distributed file system. For instance saveAsTextFile().
Here are some basic examples.
Remember, the entire objective here (that is, if you have big data) is to move the code to your data and compute there, not to bring all the data to the computation.
TL;DR Don't collect.
To run this code safely, without additional assumptions (on average requirements for worker nodes might be significantly smaller), every node (driver and each executor) would require memory significantly exceeding total memory requirements for all data.
If you were to run it outside Spark you would need only one node. Therefore Spark provides no benefits here.
However if you skip collect.toArray and make some assumptions about data distribution you might run it just fine.

Large task size for simplest program

I am trying to run the simplest program with Spark
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (1 to 10000000).toList
val data = sc.parallelize(dat).cache()
for(i <- 1 to 100){
println(data.reduce(_ + _))
}
}
}
I get the following error message, after each iteration :
WARN TaskSetManager: Stage 0 contains a task of very large size (9767
KB). The maximum recommended task size is 100 KB.
Increasing the data size increases said task size. This suggests to me that the driver is shipping the "dat" object to all executors, but I can't for the life of me see why, as the only operation on my RDD is reduce, which basically has no closure. Any ideas ?
Because you create the very large list locally first, the Spark parallelize method is trying to ship this list to the Spark workers as a single unit, as part of a task. Hence the warning message you receive. As an alternative, you could parallelize a much smaller list, then use flatMap to explode it into the larger list. this also has the benefit of creating the larger set of numbers in parallel. For example:
import org.apache.spark.{SparkContext, SparkConf}
object LargeTaskTest extends App {
val conf = new SparkConf().setAppName("DataTest").setMaster("local[*]")
val sc = new SparkContext(conf)
val dat = (0 to 99).toList
val data = sc.parallelize(dat).cache().flatMap(i => (1 to 1000000).map(j => j * 100 + i))
println(data.count()) //100000000
println(data.reduce(_ + _))
sc.stop()
}
EDIT:
Ultimately the local collection being parallelized has to be pushed to the executors. The parallelize method creates an instance of ParallelCollectionRDD:
def parallelize[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
assertNotStopped()
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L730
ParallelCollectionRDD creates a number of partitions equal to numSlices:
override def getPartitions: Array[Partition] = {
val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
}
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L96
numSlices defaults to sc.defaultParallelism which on my machine is 4. So even when split, each partition contains a very large list which needs to be pushed to an executor.
SparkContext.parallelize contains the note #note Parallelize acts lazily and ParallelCollectionRDD contains the comment;
// TODO: Right now, each split sends along its full data, even if
later down the RDD chain it gets // cached. It might be worthwhile
to write the data to a file in the DFS and read it in the split //
instead.
So it appears that the problem happens when you call reduce because this is the point that the partitions are sent to the executors, but the root cause is that you are calling parallelize on a very big list. Generating the large list within the executors is a better approach, IMHO.
Reduce function sends all the data to one single node. When you run sc.parallelize the data is distributed by default to 100 partitions. To make use of the already distributed data use something like this:
data.map(el=> el%100 -> el).reduceByKey(_+_)
or you can do the reduce at partition level.
data.mapPartitions(p => Iterator(p.reduce(_ + _))).reduce(_ + _)
or just use sum :)