Apache Spark using running one task on one executor - scala

I have a spark job that reads from database and performs a filter, union, 2 joins and finally writing the result back to the database.
However, the last stage only run one task on just one executor, out of 50 executors. I've tried to increase the number of partitions, use hash partition but no luck.
After several hours of Googling, it seems my data may be skewed but I don't know how to fix it.
Any suggestion please ?
Specs:
Standalone cluster
120 cores
400G Memory
Executors:
30 executors (4 cores/executor)
13G per executor
4G driver memory
Code snippet
...
def main(args: Array[String]) {
....
import sparkSession.implicits._
val similarityDs = sparkSession.read.format("jdbc").options(opts).load
similarityDs.createOrReplaceTempView("locator_clusters")
val ClassifierDs = sparkSession.sql("select * " +
"from locator_clusters where " +
"relative_score >= 0.9 and " +
"((content_hash_id is not NULL or content_hash_id <> '') " +
"or (ref_hash_id is not NULL or ref_hash_id <> ''))").as[Hash].cache()
def nnHash(tag: String) = (tag.hashCode & 0x7FFFFF).toLong
val contentHashes = ClassifierDs.map(locator => (nnHash(locator.app_hash_id), Member(locator.app_hash_id,locator.app_hash_id, 0, 0, 0))).toDF("id", "member").dropDuplicates().alias("ch").as[IdMember]
val similarHashes = ClassifierDs.map(locator => (nnHash(locator.content_hash_id), Member(locator.app_hash_id, locator.content_hash_id, 0, 0, 0))).toDF("id", "member").dropDuplicates().alias("sh").as[IdMember]
val missingContentHashes = similarHashes.join(contentHashes, similarHashes("id") === contentHashes("id"), "right_outer").select("ch.*").toDF("id", "member").as[IdMember]
val locatorHashesRdd = similarHashes.union(missingContentHashes).cache()
val vertices = locatorHashesRdd.map{ case row: IdMember=> (row.id, row.member) }.cache()
val toHashId = udf(nnHash(_:String))
val edgesDf = ClassifierDs.select(toHashId($"app_hash_id"), toHashId($"content_hash_id"), $"raw_score", $"relative_score").cache()
val edges = edgesDf.map(e => Edge(e.getLong(0), e.getLong(1), (e.getDouble(2), e.getDouble(2)))).cache()
val graph = Graph(vertices.rdd, edges.rdd).cache()
val sc = sparkSession.sparkContext
val ccVertices = graph.connectedComponents.vertices.cache()
val ccByClusters = vertices.rdd.join(ccVertices).map({
case (id, (hash, compId)) => (compId, hash.content_hash_id, hash.raw_score, hash.relative_score, hash.size)
}).toDF("id", "content_hash_id", "raw_score", "relative_score", "size").alias("cc")
val verticesDf = vertices.map(x => (x._1, x._2.app_hash_id, x._2.content_hash_id, x._2.raw_score, x._2.relative_score, x._2.size))
.toDF("id", "app_hash_id", "content_hash_id", "raw_score", "relative_score", "size").alias("v")
val superClusters = verticesDf.join(ccByClusters, "id")
.select($"v.app_hash_id", $"v.app_hash_id", $"cc.content_hash_id", $"cc.raw_score", $"cc.relative_score", $"cc.size")
.toDF("ref_hash_id", "app_hash_id", "content_hash_id", "raw_score", "relative_score", "size")
val prop = new Properties()
prop.setProperty("user", M_DB_USER)
prop.setProperty("password", M_DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
superClusters.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$M_DB_HOST:$M_DB_PORT/$M_DATABASE", MERGED_TABLE, prop)
sparkSession.stop()
Screenshot showing one executor
Stderr from the executor
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Getting 409 non-empty blocks out of 2000 blocks
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Started 59 remote fetches in 5 ms
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Getting 2000 non-empty blocks out of 2000 blocks
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Started 59 remote fetches in 9 ms
16/10/01 18:53:43 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 896.0 MB to disk (1 time so far)
16/10/01 18:53:46 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 896.0 MB to disk (2 times so far)
16/10/01 18:53:48 INFO Executor: Finished task 1906.0 in stage 769.0 (TID 260306). 3119 bytes result sent to driver
16/10/01 18:53:51 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (3 times so far)
16/10/01 18:53:57 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (4 times so far)
16/10/01 18:54:03 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (5 times so far)
16/10/01 18:54:09 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (6 times so far)
16/10/01 18:54:15 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (7 times so far)
16/10/01 18:54:21 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (8 times so far)
16/10/01 18:54:27 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (9 times so far)
16/10/01 18:54:33 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (10 times so far)
16/10/01 18:54:39 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (11 times so far)
16/10/01 18:54:44 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (12 times so far)

If data skew is indeed the problem here and all keys hash to a single partition then what you can try is either full Cartesian product or broadcast join with prefiltered data. Let's consider following example:
val left = spark.range(1L, 100000L).select(lit(1L), rand(1)).toDF("k", "v")
left.select(countDistinct($"k")).show
// +-----------------+
// |count(DISTINCT k)|
// +-----------------+
// | 1|
// +-----------------+
Any attempt to join with data like this would result in a serious data skew. Now let's say we can another table as follows:
val right = spark.range(1L, 100000L).select(
(rand(3) * 1000).cast("bigint"), rand(1)
).toDF("k", "v")
right.select(countDistinct($"k")).show
// +-----------------+
// |count(DISTINCT k)|
// +-----------------+
// | 1000|
// +-----------------+
As mentioned above we there are two methods we can try:
If we expect that number of records in right corresponding to the key left is small we can use broadcast join:
type KeyType = Long
val keys = left.select($"k").distinct.as[KeyType].collect
val rightFiltered = broadcast(right.where($"k".isin(keys: _*)))
left.join(broadcast(rightFiltered), Seq("k"))
Otherwise we can perform crossJoin followed by filter:
left.as("left")
.crossJoin(rightFiltered.as("right"))
.where($"left.k" === $"right.k")
or
spark.conf.set("spark.sql.crossJoin.enabled", true)
left.as("left")
.join(rightFiltered.as("right"))
.where($"left.k" === $"right.k")
If there is a mix of rare and common keys you can separate computation by performing standard join on rare keys and using one of the methods shown above for common.
Another possible issue is jdbc format. If you don't provide predicates or partitioning column, bounds and number of partitions all data is loaded by a single executor.

Related

Handling dataskew without salting the join key in spark

I am trying to inner Join a million rows dataframe with a 30 rows dataframe and both the tables have same join key, spark is trying to perform sort merge join and due to which all my data ends up in the same executor and Job never finishes, for example
DF1(million rows dataframe registered as TempView DF1)
+-------+-----------+
| id | price |
+-------+-----------+
| 1 | 30 |
| 1 | 10 |
| 1 | 12 |
| 1 | 15 |
+-------+-----------+
DF2(30 rows dataframe registered as TempView DF2)
+-------+-----------+
| id | Month |
+-------+-----------+
| 1 | Jan |
| 1 | Feb |
+-------+-----------+
I tried following
Broadcasting
spark.sql("Select /*+ BROADCAST(Df2) */ Df1.* from Df1 inner join Df2 on Df1.id=Df2.id").createTempView("temp")
Repartitioned
Df1.repartition(200)
Query Execution Plan
00 Project [.......................]
01 +- SortMergeJoin [.............................],Inner
02 :- Project [.............................]
03 : +-Filter is notnull[JoinKey]
04 : +- FileScan orc[..........................]
05 +-Project [.............................]
06 +-BroadcastHashJoin [..........................], LeftOuter, BuildRight
07 :- BroadCastHashJoin [......................],LeftSemi, BuildRight
Output of the number of partitions
spark.table("temp").withColumn("partition_id",spark_partition_id).groupBy
("partition_id").count
+-------+---------------+
| 21 |300,00,000 |
+-------+---------------+
Even though i re-partition/broadcast the data, spark is bringing all the data to one executor while joining and data gets skewed at one executor. I also tried turning off the spark.sql.join.preferSortMergeJoin to false. But i still see my data getting skewed at one executor. Can anyone help me ?
Just doing it like this, it works fine. Data is as is, no partitioning as such.
import org.apache.spark.sql.functions.broadcast
// Simulate some data
val df1 = spark.range(1000000).rdd.map(x => (1, "xxx")).toDF("one", "val")
val df2 = spark.range(30).rdd.map(x => (1, "yyy")).toDF("one", "val2")
// Data is as is, has no partitioning applied
val df3 = df1.join(broadcast(df2), "one")
df3.count // An action to kick it all along
// Look at final counts of partitions
val rddcounts = df3.rdd.mapPartitions(iter => Array(iter.size).iterator, true)
rddcounts.collect
returns:
res26: Array[Int] = Array(3750000, 3750000, 3750000, 3750000, 3750000, 3750000, 3750000, 3750000)
This relies on default parallelism, 8 on a CE Databricks cluster.
Broadcast should work in any event as the small table is SMALL.
Even with this:
val df = spark.range(1000000).rdd.map(x => (1, "xxx")).toDF("one", "val")
val df1 = df.repartition(50)
It works in parallel with 50 partitions. This is round-robin partitioning meaning the cluster will get partitions distributed over N Workers with at least N Executors. It is not hashed, the hash is invoked by specifying a column causing skewness if all values same. I.e. the same partition on 1 Worker for all the data.
QED: So, not all working on only one Executor, unless you have only one Executor for the Spark App or hashing applied.
I ran afterwards on my experimental laptop with local[4] and the data was being serviced by 4 cores, thus 4 Executors as it were. No salting, parallel 4. So, it is odd you cannot get that, unless you hashed.
You can see 4 parallel Tasks and thus not all on 1 Executor if on a real cluster.
why dose all data move to one executor ? If you only have same
id(id:1) in DF1 and use id to join DF2 . according the
HashPartitioner the data with id=1 will always move together .
have you occur Broadcast join ? check it in spark UI

Spark table joins - Resource allocation issue

I am stuck while working with hive tables using spark cluster (Yarn is in placed). I have some 7 tables which I need to join and then replace some null values and writing back the result to Hive final DF.
I use spark SQL (Scala) ,creating 6 different data frame first. and then join all the dataframes and writing back the result to hive table.
After five minutes my code throws below error, which I know is due to not setting my resource allocation properly.
19/10/13 06:46:53 ERROR client.TransportResponseHandler: Still have 2 requests outstanding when connection from /100.66.0.1:36467 is closed
19/10/13 06:46:53 ERROR cluster.YarnScheduler: Lost executor 401 on aaaa-bd10.pq.internal.myfove.com: Container container_e33_1570683426425_4555_01_000414 exited from explicit termination request.
19/10/13 06:47:02 ERROR cluster.YarnScheduler: Lost executor 391 on aaaa-bd10.pq.internal.myfove.com: Container marked as failed: container_e33_1570683426425_4555_01_000403 on host: aaaa-bd10.pq.internal.myfove.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
My hardware specification
HostName Memory in GB CPU Memory for Yarn CPU For Yarn
Node 1 126 32 90 26
Node 2 126 32 90 26
Node 3 126 32 90 26
Node 4 126 32 90 26
How to set below variables properly, so that my code doesn't throw an error (container marked as failed - killed by request 143)?
I am trying different configuration, but nothing helped yet.
val spark = (SparkSession.builder
.appName("Final Table")
.config("spark.driver.memory", "5g")
.config("spark.executor.memory", "15g")
.config("spark.dynamicAllocation.maxExecutors","6")
.config("spark.executor.cores", "5")
.enableHiveSupport()
.getOrCreate())
DF1 = spark.sqk("Select * from table_1") //1.4 million records and 10 var
DF2 = spark.sqk("Select * from table_2") //1.4 million records and 3000
DF3 = spark.sqk("Select * from table_3") //1.4 million records and 300
DF4 = spark.sqk("Select * from table_4") //1.4 million records and 600
DF5 = spark.sqk("Select * from table_5") //1.4 million records and 150
DF6 = spark.sqk("Select * from table_6") //1.4 million records and 2
DF7 = spark.sqk("Select * from table_7") //1.4 million records and 12
val joinDF1 = df1.join(df2, df1("number") === df2("number"), "left_outer").drop(df2("number"))
val joinDF2 = joinDF1.join(df3,joinDF1("number") === df3("number"), "left_outer").drop(df3("number"))
val joinDF3 = joinDF2.join(df4,joinDF2("number") === df4("number"), "left_outer").drop(df4("number"))
val joinDF4 = joinDF3.join(df5,joinDF3("number") === df5("number"), "left_outer").drop(df5("number"))
val joinDF5 = joinDF4.join(df6,joinDF4("number") === df6("number"), "left_outer").drop(df6("number")).drop("Dt")
val joinDF6 = joinDF5.join(df7,joinDF5("number") === df7("number"), "left_outer").drop(df7("number")).drop("Dt")
joinDF6.createOrReplaceTempView("joinDF6")
spark.sql("create table hive table as select * from joinDF6")
Please check your yarn.nodemanager.log-dirs in Ambari if you are using Ambari. If not, try to find out this property and if its pointing to directory where you have very less space, Please change it to some other directory which have more space.
While running tasks cotainers create blocks which gets store in yarn.nodemanger.log-dirs location, if its not enough to store the blocks container starts to fail.

Spark SQL freeze

I have a problem with Spark SQL. I read some data from csv files. Next to I do groupBy and join operation, and finished task is write joined data to file. My problem is time gap (I show that on log below with space).
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1069
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1003
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 965
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1073
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1038
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 900
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 903
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 938
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on 10.4.110.24:36423 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on omm104.in.nawras.com.om:43133 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 969
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1036
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 970
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1006
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1039
18/08/07 23:39:47 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
18/08/07 23:39:54 INFO parquet.ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
Dataframes are small sized ~5000 records, and ~800 columns.
I using following code:
val parentDF = ...
val childADF = ...
val childBDF = ...
val aggregatedAColName = "CHILD_A"
val aggregatedBColName = "CHILD_B"
val columns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3", "val_0")
val keyColumns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3")
val nestedAColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedAColName)
val childADataFrame = childADF
.select(nestedAColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedAColName).alias(aggregatedAColName))
val joinedWithA = parentDF.join(childADataFrame, keyColumns, "left")
val nestedBColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedBColName)
val childBDataFrame = childBDF
.select(nestedBColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedBColName).alias(aggregatedBColName))
val joinedWithB = joinedWithA.join(childBDataFrame, keyColumns, "left")
Processing time on 30 files (~85 k records all) is strange high ~38 min.
Have you ever seen similar problem?
Try to avoid repartition() call as it causes unnecessary data movements within the nodes.
According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.
In a simple way COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions.

Spark Dataframe: Applying groupBy(over 2 columns).sum(another column) over takes 50GB of space and a lot of time

I am running Scala (2.11.8) based spark (2.0.0) program over an AWS cluster (1 master with 4 worker nodes) with 15gb of memory for each node.
Using the following config to run :
spark-submit --master "spark://ip-xyz:xyz" --deploy-mode cluster
--driver-memory 2g --executor-memory 10g --queue default myProject.jar
The dataframe ( df) consists of " 16098400000 " rows .
| a_id| i_id| category| name| u| p| score|
+--------+-------+-----------+--------------------+--------+-------+---------+
|12000035|8119221| ARTIST|ASHBY, LINDEN: CA...| 0.01251|0.93166|0.0116613|
|12000081|8119221| ARTIST|ASHBY, LINDEN: CA...| 0.2672|0.93166| 0.248998|
|12000009|8111111| ARTIST|ASHBY, LINDEN: CA...| 0.0236|0.93160| 0.022008|
|12000091|8111111| ARTIST|ASHBY, LINDEN: CA...| 0.5|0.93126| 0.46583|
|13200000|8100000| ARTIST|ASHBY, LINDEN: CA...| 0.0944|0.93166| 0.088034|
+--------+-------+-----------+--------------------+--------+-------+---------+
Running the following code:
val df1 = df.groupBy("a_id", "i_id").sum("score")
Did a checkpointing with a capacity of 50GB.
While monitering the checkpoint, the used space increases from 500mb to 45-50 gb over a period of 2 hours .
After that the application fails with the error :
Exception in thread "main" java.lang.reflect.InvocationTargetException
Caused by: java.io.IOException: No space left on device
I know it's a huge dataset, can I do it efficiently with the same resource?
NOTE: tried groupBy by just 1 column, the operation finished successfully in just 5 minutes.

Spark streaming with broadcast joins

I have a spark streaming use case where I plan to keep a dataset broadcasted and cached on each executor. Every micro batch in streaming will create a dataframe out of the RDD and join the batch. My test code given below will perform the broadcast operation for each batch. Is there a way to broadcast it just once?
val testDF = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load("file:///shared/data/test-data.txt")
val lines = ssc.socketTextStream("DevNode", 9999)
lines.foreachRDD((rdd, timestamp) => {
val recordDF = rdd.map(_.split(",")).map(l => Record(l(0).toInt, l(1))).toDF()
val resultDF = recordDF.join(broadcast(testDF), "Age")
resultDF.write.format("com.databricks.spark.csv").save("file:///shared/data/output/streaming/"+timestamp)
}
For every batch this file was read and broadcast was performed.
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
Any suggestion on broadcast dataset only once?
It looks like for now broadcasted tables are not reused. See: SPARK-3863
Perform broadcasting outside foreachRDD loop:
val testDF = broadcast(sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load(...))
lines.foreachRDD((rdd, timestamp) => {
val recordDF = ???
val resultDF = recordDF.join(testDF, "Age")
resultDF.write.format("com.databricks.spark.csv").save(...)
}