I have a problem with Spark SQL. I read some data from csv files. Next to I do groupBy and join operation, and finished task is write joined data to file. My problem is time gap (I show that on log below with space).
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1069
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1003
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 965
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1073
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1038
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 900
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 903
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 938
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on 10.4.110.24:36423 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on omm104.in.nawras.com.om:43133 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 969
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1036
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 970
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1006
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1039
18/08/07 23:39:47 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
18/08/07 23:39:54 INFO parquet.ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
Dataframes are small sized ~5000 records, and ~800 columns.
I using following code:
val parentDF = ...
val childADF = ...
val childBDF = ...
val aggregatedAColName = "CHILD_A"
val aggregatedBColName = "CHILD_B"
val columns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3", "val_0")
val keyColumns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3")
val nestedAColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedAColName)
val childADataFrame = childADF
.select(nestedAColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedAColName).alias(aggregatedAColName))
val joinedWithA = parentDF.join(childADataFrame, keyColumns, "left")
val nestedBColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedBColName)
val childBDataFrame = childBDF
.select(nestedBColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedBColName).alias(aggregatedBColName))
val joinedWithB = joinedWithA.join(childBDataFrame, keyColumns, "left")
Processing time on 30 files (~85 k records all) is strange high ~38 min.
Have you ever seen similar problem?
Try to avoid repartition() call as it causes unnecessary data movements within the nodes.
According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.
In a simple way COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions.
Related
I understand that there are many SO answers related to what I am asking, but since I am very new to scala, I am not able to understand those answer. Would really appreciate if someone please help me correct my UDF.
I have this UDF which is meant to do the timezone conversion from GMT to MST:
val Gmt2Mst = (dtm_str: String, inFmt: String, outFmt: String) => {
if ("".equals(dtm_str) || dtm_str == null || dtm_str.length() < inFmt.length()) {
null
}
else {
val gmtZoneId = ZoneId.of("GMT", ZoneId.SHORT_IDS);
val mstZoneId = ZoneId.of("MST", ZoneId.SHORT_IDS);
val inFormatter = DateTimeFormatter.ofPattern(inFmt);
val outFormatter = DateTimeFormatter.ofPattern(outFmt);
val dateTime = LocalDateTime.parse(dtm_str, inFormatter);
val gmt = ZonedDateTime.of(dateTime, gmtZoneId)
val mst = gmt.withZoneSameInstant(mstZoneId)
mst.format(outFormatter)
}
}
spark.udf.register("Gmt2Mst", Gmt2Mst)
But whenever there is NULL encountered it fails to handle that. I am trying to handle it using dtm_str == null but it still fails. Can some please help me with what correction do I have to make instead of dtm_str == null which can help me achieve my goal?
To give an example, if I run the below spark-sql:
spark.sql("select null as col1, Gmt2Mst(null,'yyyyMMddHHmm', 'yyyyMMddHHmm') as col2").show()
I am getting this error:
22/09/20 14:10:31 INFO TaskSetManager: Starting task 101.1 in stage 27.0 (TID 1809) (10.243.37.204, executor 18, partition 101, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 100.0 in stage 27.0 (TID 1801) on 10.243.37.204, executor 18: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 1]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 100.1 in stage 27.0 (TID 1810) (10.243.37.241, executor 1, partition 100, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 102.0 in stage 27.0 (TID 1803) on 10.243.37.241, executor 1: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 2]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 102.1 in stage 27.0 (TID 1811) (10.243.36.183, executor 22, partition 102, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Finished task 80.0 in stage 27.0 (TID 1781) in 2301 ms on 10.243.36.183 (executor 22) (81/355)
22/09/20 14:10:31 INFO TaskSetManager: Starting task 108.0 in stage 27.0 (TID 1812) (10.243.36.156, executor 4, partition 108, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 103.0 in stage 27.0 (TID 1804) on 10.243.36.156, executor 4: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 3]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 103.1 in stage 27.0 (TID 1813) (10.243.36.180, executor 9, partition 103, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 WARN TaskSetManager: Lost task 105.0 in stage 27.0 (TID 1806) (10.243.36.180 executor 9): org.apache.spark.SparkException: Failed to execute user defined function (anonfun$3: (string, string) => string)
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:136)
at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at org.commonspirit.sepsis_bio.recovery.SepsisRecoveryBundle$$anonfun$3.apply(SepsisRecoveryBundle.scala:123)
at org.commonspirit.sepsis_bio.recovery.SepsisRecoveryBundle$$anonfun$3.apply(SepsisRecoveryBundle.scala:122)
... 15 more
I did the following test and it seems that it works:
Create a dataframe with a null type. The Schema would be:
root
|-- v0: string (nullable = true)
|-- v1: string (nullable = true)
|-- null: null (nullable = true)
for example:
+----+-----+----+
| v0| v1|null|
+----+-----+----+
|hola|adios|null|
+----+-----+----+
Create the udf:
val udf1 = udf{ v1: Any => { if(v1 != null) s"${v1}_transformed" else null } }
Note that working with Any in Scala is a bad practice, but this is Spark Sql and to handle a value that could be of two different types you would need to work with this supertype.
Register the udf:
spark.udf.register("udf1", udf1)
Create the view:
df2.createTempView("df2")
Apply the udf to the view:
spark.sql("select udf1(null) from df").show()
it shows:
+---------+
|UDF(null)|
+---------+
| null|
+---------+
Apply to a column with not null value:
spark.sql("select udf1(v0) from df2").show()
it shows:
+----------------+
| UDF(v0)|
+----------------+
|hola_transformed|
+----------------+
I am trying to get some data from my Cassandra database using in a program, but the request never completes.
My Spark configuration looks like this:
object ExternalConf {
var cassandraHost : String = "cassandra_cassandra-001,cassandra_cassandra-002,cassandra_cassandra-003,cassandra_cassandra-004"
var masterSpark: String ="local[*]"
}
object Spark {
val session : SparkSession = SparkSession
.builder()
.appName("KStreaming")
// .config("spark.cassandra.connection.host", ExternalConf.cassandraHost) //default value or args
.config("spark.cassandra.connection.host", "cassandra_node") //preprod
.config("spark.cassandra.auth.username", "cassandra")
.config("spark.cassandra.auth.password", "cassandra")
.config("output.batch.grouping.buffer.size", "50")
.config("output.batch.size.bytes", "102400")
.config("spark.driver.maxResultSize", "4g")
.config("spark.sql.broadcastTimeout", "1800")
.master(ExternalConf.masterSpark)
.getOrCreate();
session.sql("CREATE OR REPLACE TEMPORARY VIEW dbv2_product_categories USING org.apache.spark.sql.cassandra OPTIONS (table 'dbv2_product_categories', keyspace 'preprod', pushdown 'true')")
import session.implicits._
}
And the code that makes a SELECT:
def extractData(data: RDD[ConsumerRecord[String, String]]) = {
import Spark.session.implicits._
data
.foreach(message => {
var persistedProductCategory: DataFrame = Spark.session.sql("SELECT * FROM dbv2_product_categories WHERE account_id = '" + accountId + "' AND name = '" + shopifyProduct.product_type + "'")
})
}
The request never ends. Here is my stderr (I stripped the beginning of it to make it shorter):
22/08/29 09:50:04 INFO ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [kafka:9092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id = consumer-4
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
group.id = spark-executor--183698333
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = none
22/08/29 09:50:04 WARN ConsumerConfig: The configuration schema.registry.url = http://schema-registry:8081 was supplied but isn't a known config.
22/08/29 09:50:04 INFO AppInfoParser: Kafka version : 0.10.0.1
22/08/29 09:50:04 INFO AppInfoParser: Kafka commitId : a7a17cdec9eaa6c5
22/08/29 09:50:04 INFO CachedKafkaConsumer: Initial fetch for spark-executor--183698333 62c44e48be54d0002900bd62_products 0 0
22/08/29 09:50:04 INFO AbstractCoordinator: Discovered coordinator kafka:9092 (id: 2147483646 rack: null) for group spark-executor--183698333.
22/08/29 09:50:04 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
22/08/29 09:50:04 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work/driver-20220829094956-7907/spark-warehouse').
22/08/29 09:50:04 INFO SharedState: Warehouse path is 'file:/opt/spark/work/driver-20220829094956-7907/spark-warehouse'.
22/08/29 09:50:05 INFO JobScheduler: Added jobs for time 1661766605000 ms
22/08/29 09:50:05 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
22/08/29 09:50:05 INFO SparkSqlParser: Parsing command: CREATE OR REPLACE TEMPORARY VIEW dbv2_product_categories USING org.apache.spark.sql.cassandra OPTIONS (table 'dbv2_product_categories', keyspace 'kiliba', pushdown 'true')
22/08/29 09:50:07 INFO ClockFactory: Using native clock to generate timestamps.
22/08/29 09:50:07 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
22/08/29 09:50:07 INFO Cluster: New Cassandra host cassandra_node/10.0.0.177:9042 added
22/08/29 09:50:07 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
22/08/29 09:50:08 INFO SparkSqlParser: Parsing command: SELECT * FROM dbv2_product_categories WHERE account_id = 'shopifytest_62c44e48be54d0002900bd62' AND name = ''
22/08/29 09:50:08 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(account_id), IsNotNull(name), EqualTo(account_id,shopifytest_62c44e48be54d0002900bd62), EqualTo(name,)]
22/08/29 09:50:08 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(account_id), IsNotNull(name), EqualTo(account_id,shopifytest_62c44e48be54d0002900bd62), EqualTo(name,)]
22/08/29 09:50:08 INFO CodeGenerator: Code generated in 175.144922 ms
22/08/29 09:50:08 INFO CodeGenerator: Code generated in 16.62666 ms
22/08/29 09:50:08 INFO SparkContext: Starting job: count at Kstreaming.scala:361
22/08/29 09:50:08 INFO DAGScheduler: Registering RDD 7 (count at Kstreaming.scala:361)
22/08/29 09:50:08 INFO DAGScheduler: Got job 1 (count at Kstreaming.scala:361) with 1 output partitions
22/08/29 09:50:08 INFO DAGScheduler: Final stage: ResultStage 2 (count at Kstreaming.scala:361)
22/08/29 09:50:08 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
22/08/29 09:50:08 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
22/08/29 09:50:09 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[7] at count at Kstreaming.scala:361), which has no missing parents
22/08/29 09:50:09 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.8 KB, free 5.2 GB)
22/08/29 09:50:09 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.8 KB, free 5.2 GB)
22/08/29 09:50:09 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.0.229:43123 (size: 7.8 KB, free: 5.2 GB)
22/08/29 09:50:09 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
22/08/29 09:50:09 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[7] at count at Kstreaming.scala:361) (first 15 tasks are for partitions Vector(0))
22/08/29 09:50:09 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
22/08/29 09:50:10 INFO JobScheduler: Added jobs for time 1661766610000 ms
22/08/29 09:50:15 INFO JobScheduler: Added jobs for time 1661766615000 ms
22/08/29 09:50:16 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
22/08/29 09:50:20 INFO JobScheduler: Added jobs for time 1661766620000 ms
22/08/29 09:50:25 INFO JobScheduler: Added jobs for time 1661766625000 ms
22/08/29 09:50:30 INFO JobScheduler: Added jobs for time 1661766630000 ms
22/08/29 09:50:35 INFO JobScheduler: Added jobs for time 1661766635000 ms
22/08/29 09:50:40 INFO JobScheduler: Added jobs for time 1661766640000 ms
22/08/29 09:50:45 INFO JobScheduler: Added jobs for time 1661766645000 ms
22/08/29 09:50:50 INFO JobScheduler: Added jobs for time 1661766650000 ms
22/08/29 09:50:55 INFO JobScheduler: Added jobs for time 1661766655000 ms
22/08/29 09:51:00 INFO JobScheduler: Added jobs for time 1661766660000 ms
22/08/29 09:51:05 INFO JobScheduler: Added jobs for time 1661766665000 ms
22/08/29 09:51:10 INFO JobScheduler: Added jobs for time 1661766670000 ms
22/08/29 09:51:15 INFO JobScheduler: Added jobs for time 1661766675000 ms
22/08/29 09:51:20 INFO JobScheduler: Added jobs for time 1661766680000 ms
22/08/29 09:51:25 INFO JobScheduler: Added jobs for time 1661766685000 ms
22/08/29 09:51:30 INFO JobScheduler: Added jobs for time 1661766690000 ms
When I execute the SQL command directly in cqlsh, it works and I get a result almost instantly:
cqlsh:kiliba> SELECT * FROM dbv2_product_categories WHERE account_id = 'shopifytest_62c44e48be54d0002900bd62' AND name = '';
account_id | name | id | breadcrumb | parent_id
------------+------+----+------------+-----------
(0 rows)
How comes my program fails to give me a response and seems to hang eternally?
I'm currently running Analyze command for particular table and could see the statistics being printed in the Spark-Console
However when I try to write the output to a DF I could not see the statistics.
Spark Version : 1.6.3
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS").collect()
Output in spark Shell
Partition sample{company=aaa, market=aab, etdate=2019-01-03, p=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
19/03/22 02:49:33 INFO Task: Partition sample{company=aaa, market=aab, edate=2019-01-03, pdate=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
Output of dataframe
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=runTasks start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=Driver.execute start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO Driver: OK
19/03/22 02:49:40 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
19/03/22 02:49:40 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 940 bytes result sent to driver
19/03/22 02:49:40 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 4 ms on localhost (1/1)
19/03/22 02:49:40 INFO DAGScheduler: ResultStage 2 (show at <console>:47) finished in 0.004 s
19/03/22 02:49:40 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/03/22 02:49:40 INFO DAGScheduler: Job 2 finished: show at <console>:47, took 0.007774 s
+------+
|result|
+------+
+------+
Could you please let me know how to get the same statistics output into the Dataframe.
Thanks.!
If you want to print from a Dataframe the way you are using, you can use,
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS")
a.select("*").show()
I have a spark job that reads from database and performs a filter, union, 2 joins and finally writing the result back to the database.
However, the last stage only run one task on just one executor, out of 50 executors. I've tried to increase the number of partitions, use hash partition but no luck.
After several hours of Googling, it seems my data may be skewed but I don't know how to fix it.
Any suggestion please ?
Specs:
Standalone cluster
120 cores
400G Memory
Executors:
30 executors (4 cores/executor)
13G per executor
4G driver memory
Code snippet
...
def main(args: Array[String]) {
....
import sparkSession.implicits._
val similarityDs = sparkSession.read.format("jdbc").options(opts).load
similarityDs.createOrReplaceTempView("locator_clusters")
val ClassifierDs = sparkSession.sql("select * " +
"from locator_clusters where " +
"relative_score >= 0.9 and " +
"((content_hash_id is not NULL or content_hash_id <> '') " +
"or (ref_hash_id is not NULL or ref_hash_id <> ''))").as[Hash].cache()
def nnHash(tag: String) = (tag.hashCode & 0x7FFFFF).toLong
val contentHashes = ClassifierDs.map(locator => (nnHash(locator.app_hash_id), Member(locator.app_hash_id,locator.app_hash_id, 0, 0, 0))).toDF("id", "member").dropDuplicates().alias("ch").as[IdMember]
val similarHashes = ClassifierDs.map(locator => (nnHash(locator.content_hash_id), Member(locator.app_hash_id, locator.content_hash_id, 0, 0, 0))).toDF("id", "member").dropDuplicates().alias("sh").as[IdMember]
val missingContentHashes = similarHashes.join(contentHashes, similarHashes("id") === contentHashes("id"), "right_outer").select("ch.*").toDF("id", "member").as[IdMember]
val locatorHashesRdd = similarHashes.union(missingContentHashes).cache()
val vertices = locatorHashesRdd.map{ case row: IdMember=> (row.id, row.member) }.cache()
val toHashId = udf(nnHash(_:String))
val edgesDf = ClassifierDs.select(toHashId($"app_hash_id"), toHashId($"content_hash_id"), $"raw_score", $"relative_score").cache()
val edges = edgesDf.map(e => Edge(e.getLong(0), e.getLong(1), (e.getDouble(2), e.getDouble(2)))).cache()
val graph = Graph(vertices.rdd, edges.rdd).cache()
val sc = sparkSession.sparkContext
val ccVertices = graph.connectedComponents.vertices.cache()
val ccByClusters = vertices.rdd.join(ccVertices).map({
case (id, (hash, compId)) => (compId, hash.content_hash_id, hash.raw_score, hash.relative_score, hash.size)
}).toDF("id", "content_hash_id", "raw_score", "relative_score", "size").alias("cc")
val verticesDf = vertices.map(x => (x._1, x._2.app_hash_id, x._2.content_hash_id, x._2.raw_score, x._2.relative_score, x._2.size))
.toDF("id", "app_hash_id", "content_hash_id", "raw_score", "relative_score", "size").alias("v")
val superClusters = verticesDf.join(ccByClusters, "id")
.select($"v.app_hash_id", $"v.app_hash_id", $"cc.content_hash_id", $"cc.raw_score", $"cc.relative_score", $"cc.size")
.toDF("ref_hash_id", "app_hash_id", "content_hash_id", "raw_score", "relative_score", "size")
val prop = new Properties()
prop.setProperty("user", M_DB_USER)
prop.setProperty("password", M_DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
superClusters.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$M_DB_HOST:$M_DB_PORT/$M_DATABASE", MERGED_TABLE, prop)
sparkSession.stop()
Screenshot showing one executor
Stderr from the executor
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Getting 409 non-empty blocks out of 2000 blocks
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Started 59 remote fetches in 5 ms
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Getting 2000 non-empty blocks out of 2000 blocks
16/10/01 18:53:42 INFO ShuffleBlockFetcherIterator: Started 59 remote fetches in 9 ms
16/10/01 18:53:43 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 896.0 MB to disk (1 time so far)
16/10/01 18:53:46 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 896.0 MB to disk (2 times so far)
16/10/01 18:53:48 INFO Executor: Finished task 1906.0 in stage 769.0 (TID 260306). 3119 bytes result sent to driver
16/10/01 18:53:51 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (3 times so far)
16/10/01 18:53:57 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (4 times so far)
16/10/01 18:54:03 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (5 times so far)
16/10/01 18:54:09 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (6 times so far)
16/10/01 18:54:15 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (7 times so far)
16/10/01 18:54:21 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (8 times so far)
16/10/01 18:54:27 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (9 times so far)
16/10/01 18:54:33 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (10 times so far)
16/10/01 18:54:39 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (11 times so far)
16/10/01 18:54:44 INFO UnsafeExternalSorter: Thread 123 spilling sort data of 1792.0 MB to disk (12 times so far)
If data skew is indeed the problem here and all keys hash to a single partition then what you can try is either full Cartesian product or broadcast join with prefiltered data. Let's consider following example:
val left = spark.range(1L, 100000L).select(lit(1L), rand(1)).toDF("k", "v")
left.select(countDistinct($"k")).show
// +-----------------+
// |count(DISTINCT k)|
// +-----------------+
// | 1|
// +-----------------+
Any attempt to join with data like this would result in a serious data skew. Now let's say we can another table as follows:
val right = spark.range(1L, 100000L).select(
(rand(3) * 1000).cast("bigint"), rand(1)
).toDF("k", "v")
right.select(countDistinct($"k")).show
// +-----------------+
// |count(DISTINCT k)|
// +-----------------+
// | 1000|
// +-----------------+
As mentioned above we there are two methods we can try:
If we expect that number of records in right corresponding to the key left is small we can use broadcast join:
type KeyType = Long
val keys = left.select($"k").distinct.as[KeyType].collect
val rightFiltered = broadcast(right.where($"k".isin(keys: _*)))
left.join(broadcast(rightFiltered), Seq("k"))
Otherwise we can perform crossJoin followed by filter:
left.as("left")
.crossJoin(rightFiltered.as("right"))
.where($"left.k" === $"right.k")
or
spark.conf.set("spark.sql.crossJoin.enabled", true)
left.as("left")
.join(rightFiltered.as("right"))
.where($"left.k" === $"right.k")
If there is a mix of rare and common keys you can separate computation by performing standard join on rare keys and using one of the methods shown above for common.
Another possible issue is jdbc format. If you don't provide predicates or partitioning column, bounds and number of partitions all data is loaded by a single executor.
I have a spark streaming use case where I plan to keep a dataset broadcasted and cached on each executor. Every micro batch in streaming will create a dataframe out of the RDD and join the batch. My test code given below will perform the broadcast operation for each batch. Is there a way to broadcast it just once?
val testDF = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load("file:///shared/data/test-data.txt")
val lines = ssc.socketTextStream("DevNode", 9999)
lines.foreachRDD((rdd, timestamp) => {
val recordDF = rdd.map(_.split(",")).map(l => Record(l(0).toInt, l(1))).toDF()
val resultDF = recordDF.join(broadcast(testDF), "Age")
resultDF.write.format("com.databricks.spark.csv").save("file:///shared/data/output/streaming/"+timestamp)
}
For every batch this file was read and broadcast was performed.
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:24:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:27+28
16/02/18 12:25:00 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27
Any suggestion on broadcast dataset only once?
It looks like for now broadcasted tables are not reused. See: SPARK-3863
Perform broadcasting outside foreachRDD loop:
val testDF = broadcast(sqlContext.read.format("com.databricks.spark.csv")
.schema(schema).load(...))
lines.foreachRDD((rdd, timestamp) => {
val recordDF = ???
val resultDF = recordDF.join(testDF, "Age")
resultDF.write.format("com.databricks.spark.csv").save(...)
}