To print output of SparkSQL to dataframe - scala

I'm currently running Analyze command for particular table and could see the statistics being printed in the Spark-Console
However when I try to write the output to a DF I could not see the statistics.
Spark Version : 1.6.3
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS").collect()
Output in spark Shell
Partition sample{company=aaa, market=aab, etdate=2019-01-03, p=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
19/03/22 02:49:33 INFO Task: Partition sample{company=aaa, market=aab, edate=2019-01-03, pdate=2019-01-10} stats: [numFiles=1, numRows=215, totalSize=7551, rawDataSize=461390]
Output of dataframe
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=runTasks start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO PerfLogger: </PERFLOG method=Driver.execute start=1553237373445 end=1553237373606 duration=161 from=org.apache.hadoop.hive.ql.Driver>
19/03/22 02:49:33 INFO Driver: OK
19/03/22 02:49:40 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
19/03/22 02:49:40 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 940 bytes result sent to driver
19/03/22 02:49:40 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 4 ms on localhost (1/1)
19/03/22 02:49:40 INFO DAGScheduler: ResultStage 2 (show at <console>:47) finished in 0.004 s
19/03/22 02:49:40 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
19/03/22 02:49:40 INFO DAGScheduler: Job 2 finished: show at <console>:47, took 0.007774 s
+------+
|result|
+------+
+------+
Could you please let me know how to get the same statistics output into the Dataframe.
Thanks.!

If you want to print from a Dataframe the way you are using, you can use,
val a : DataFrame = sqlContext.sql("ANALYZE TABLE sample PARTITION (company='aaa', market='aab', edate='2019-01-03', pdate='2019-01-10') COMPUTE STATISTICS")
a.select("*").show()

Related

How do I handle a null value in the a scala UDF?

I understand that there are many SO answers related to what I am asking, but since I am very new to scala, I am not able to understand those answer. Would really appreciate if someone please help me correct my UDF.
I have this UDF which is meant to do the timezone conversion from GMT to MST:
val Gmt2Mst = (dtm_str: String, inFmt: String, outFmt: String) => {
if ("".equals(dtm_str) || dtm_str == null || dtm_str.length() < inFmt.length()) {
null
}
else {
val gmtZoneId = ZoneId.of("GMT", ZoneId.SHORT_IDS);
val mstZoneId = ZoneId.of("MST", ZoneId.SHORT_IDS);
val inFormatter = DateTimeFormatter.ofPattern(inFmt);
val outFormatter = DateTimeFormatter.ofPattern(outFmt);
val dateTime = LocalDateTime.parse(dtm_str, inFormatter);
val gmt = ZonedDateTime.of(dateTime, gmtZoneId)
val mst = gmt.withZoneSameInstant(mstZoneId)
mst.format(outFormatter)
}
}
spark.udf.register("Gmt2Mst", Gmt2Mst)
But whenever there is NULL encountered it fails to handle that. I am trying to handle it using dtm_str == null but it still fails. Can some please help me with what correction do I have to make instead of dtm_str == null which can help me achieve my goal?
To give an example, if I run the below spark-sql:
spark.sql("select null as col1, Gmt2Mst(null,'yyyyMMddHHmm', 'yyyyMMddHHmm') as col2").show()
I am getting this error:
22/09/20 14:10:31 INFO TaskSetManager: Starting task 101.1 in stage 27.0 (TID 1809) (10.243.37.204, executor 18, partition 101, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 100.0 in stage 27.0 (TID 1801) on 10.243.37.204, executor 18: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 1]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 100.1 in stage 27.0 (TID 1810) (10.243.37.241, executor 1, partition 100, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 102.0 in stage 27.0 (TID 1803) on 10.243.37.241, executor 1: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 2]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 102.1 in stage 27.0 (TID 1811) (10.243.36.183, executor 22, partition 102, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Finished task 80.0 in stage 27.0 (TID 1781) in 2301 ms on 10.243.36.183 (executor 22) (81/355)
22/09/20 14:10:31 INFO TaskSetManager: Starting task 108.0 in stage 27.0 (TID 1812) (10.243.36.156, executor 4, partition 108, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 INFO TaskSetManager: Lost task 103.0 in stage 27.0 (TID 1804) on 10.243.36.156, executor 4: org.apache.spark.SparkException (Failed to execute user defined function (anonfun$4: (string, string) => string)) [duplicate 3]
22/09/20 14:10:31 INFO TaskSetManager: Starting task 103.1 in stage 27.0 (TID 1813) (10.243.36.180, executor 9, partition 103, PROCESS_LOCAL, 5648 bytes) taskResourceAssignments Map()
22/09/20 14:10:31 WARN TaskSetManager: Lost task 105.0 in stage 27.0 (TID 1806) (10.243.36.180 executor 9): org.apache.spark.SparkException: Failed to execute user defined function (anonfun$3: (string, string) => string)
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:136)
at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
at org.commonspirit.sepsis_bio.recovery.SepsisRecoveryBundle$$anonfun$3.apply(SepsisRecoveryBundle.scala:123)
at org.commonspirit.sepsis_bio.recovery.SepsisRecoveryBundle$$anonfun$3.apply(SepsisRecoveryBundle.scala:122)
... 15 more
I did the following test and it seems that it works:
Create a dataframe with a null type. The Schema would be:
root
|-- v0: string (nullable = true)
|-- v1: string (nullable = true)
|-- null: null (nullable = true)
for example:
+----+-----+----+
| v0| v1|null|
+----+-----+----+
|hola|adios|null|
+----+-----+----+
Create the udf:
val udf1 = udf{ v1: Any => { if(v1 != null) s"${v1}_transformed" else null } }
Note that working with Any in Scala is a bad practice, but this is Spark Sql and to handle a value that could be of two different types you would need to work with this supertype.
Register the udf:
spark.udf.register("udf1", udf1)
Create the view:
df2.createTempView("df2")
Apply the udf to the view:
spark.sql("select udf1(null) from df").show()
it shows:
+---------+
|UDF(null)|
+---------+
| null|
+---------+
Apply to a column with not null value:
spark.sql("select udf1(v0) from df2").show()
it shows:
+----------------+
| UDF(v0)|
+----------------+
|hola_transformed|
+----------------+

SQL request never ends

I am trying to get some data from my Cassandra database using in a program, but the request never completes.
My Spark configuration looks like this:
object ExternalConf {
var cassandraHost : String = "cassandra_cassandra-001,cassandra_cassandra-002,cassandra_cassandra-003,cassandra_cassandra-004"
var masterSpark: String ="local[*]"
}
object Spark {
val session : SparkSession = SparkSession
.builder()
.appName("KStreaming")
// .config("spark.cassandra.connection.host", ExternalConf.cassandraHost) //default value or args
.config("spark.cassandra.connection.host", "cassandra_node") //preprod
.config("spark.cassandra.auth.username", "cassandra")
.config("spark.cassandra.auth.password", "cassandra")
.config("output.batch.grouping.buffer.size", "50")
.config("output.batch.size.bytes", "102400")
.config("spark.driver.maxResultSize", "4g")
.config("spark.sql.broadcastTimeout", "1800")
.master(ExternalConf.masterSpark)
.getOrCreate();
session.sql("CREATE OR REPLACE TEMPORARY VIEW dbv2_product_categories USING org.apache.spark.sql.cassandra OPTIONS (table 'dbv2_product_categories', keyspace 'preprod', pushdown 'true')")
import session.implicits._
}
And the code that makes a SELECT:
def extractData(data: RDD[ConsumerRecord[String, String]]) = {
import Spark.session.implicits._
data
.foreach(message => {
var persistedProductCategory: DataFrame = Spark.session.sql("SELECT * FROM dbv2_product_categories WHERE account_id = '" + accountId + "' AND name = '" + shopifyProduct.product_type + "'")
})
}
The request never ends. Here is my stderr (I stripped the beginning of it to make it shorter):
22/08/29 09:50:04 INFO ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [kafka:9092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id = consumer-4
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
group.id = spark-executor--183698333
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = none
22/08/29 09:50:04 WARN ConsumerConfig: The configuration schema.registry.url = http://schema-registry:8081 was supplied but isn't a known config.
22/08/29 09:50:04 INFO AppInfoParser: Kafka version : 0.10.0.1
22/08/29 09:50:04 INFO AppInfoParser: Kafka commitId : a7a17cdec9eaa6c5
22/08/29 09:50:04 INFO CachedKafkaConsumer: Initial fetch for spark-executor--183698333 62c44e48be54d0002900bd62_products 0 0
22/08/29 09:50:04 INFO AbstractCoordinator: Discovered coordinator kafka:9092 (id: 2147483646 rack: null) for group spark-executor--183698333.
22/08/29 09:50:04 WARN SparkContext: Using an existing SparkContext; some configuration may not take effect.
22/08/29 09:50:04 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work/driver-20220829094956-7907/spark-warehouse').
22/08/29 09:50:04 INFO SharedState: Warehouse path is 'file:/opt/spark/work/driver-20220829094956-7907/spark-warehouse'.
22/08/29 09:50:05 INFO JobScheduler: Added jobs for time 1661766605000 ms
22/08/29 09:50:05 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
22/08/29 09:50:05 INFO SparkSqlParser: Parsing command: CREATE OR REPLACE TEMPORARY VIEW dbv2_product_categories USING org.apache.spark.sql.cassandra OPTIONS (table 'dbv2_product_categories', keyspace 'kiliba', pushdown 'true')
22/08/29 09:50:07 INFO ClockFactory: Using native clock to generate timestamps.
22/08/29 09:50:07 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
22/08/29 09:50:07 INFO Cluster: New Cassandra host cassandra_node/10.0.0.177:9042 added
22/08/29 09:50:07 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
22/08/29 09:50:08 INFO SparkSqlParser: Parsing command: SELECT * FROM dbv2_product_categories WHERE account_id = 'shopifytest_62c44e48be54d0002900bd62' AND name = ''
22/08/29 09:50:08 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(account_id), IsNotNull(name), EqualTo(account_id,shopifytest_62c44e48be54d0002900bd62), EqualTo(name,)]
22/08/29 09:50:08 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(account_id), IsNotNull(name), EqualTo(account_id,shopifytest_62c44e48be54d0002900bd62), EqualTo(name,)]
22/08/29 09:50:08 INFO CodeGenerator: Code generated in 175.144922 ms
22/08/29 09:50:08 INFO CodeGenerator: Code generated in 16.62666 ms
22/08/29 09:50:08 INFO SparkContext: Starting job: count at Kstreaming.scala:361
22/08/29 09:50:08 INFO DAGScheduler: Registering RDD 7 (count at Kstreaming.scala:361)
22/08/29 09:50:08 INFO DAGScheduler: Got job 1 (count at Kstreaming.scala:361) with 1 output partitions
22/08/29 09:50:08 INFO DAGScheduler: Final stage: ResultStage 2 (count at Kstreaming.scala:361)
22/08/29 09:50:08 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
22/08/29 09:50:08 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)
22/08/29 09:50:09 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[7] at count at Kstreaming.scala:361), which has no missing parents
22/08/29 09:50:09 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.8 KB, free 5.2 GB)
22/08/29 09:50:09 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.8 KB, free 5.2 GB)
22/08/29 09:50:09 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.0.229:43123 (size: 7.8 KB, free: 5.2 GB)
22/08/29 09:50:09 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
22/08/29 09:50:09 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[7] at count at Kstreaming.scala:361) (first 15 tasks are for partitions Vector(0))
22/08/29 09:50:09 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
22/08/29 09:50:10 INFO JobScheduler: Added jobs for time 1661766610000 ms
22/08/29 09:50:15 INFO JobScheduler: Added jobs for time 1661766615000 ms
22/08/29 09:50:16 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
22/08/29 09:50:20 INFO JobScheduler: Added jobs for time 1661766620000 ms
22/08/29 09:50:25 INFO JobScheduler: Added jobs for time 1661766625000 ms
22/08/29 09:50:30 INFO JobScheduler: Added jobs for time 1661766630000 ms
22/08/29 09:50:35 INFO JobScheduler: Added jobs for time 1661766635000 ms
22/08/29 09:50:40 INFO JobScheduler: Added jobs for time 1661766640000 ms
22/08/29 09:50:45 INFO JobScheduler: Added jobs for time 1661766645000 ms
22/08/29 09:50:50 INFO JobScheduler: Added jobs for time 1661766650000 ms
22/08/29 09:50:55 INFO JobScheduler: Added jobs for time 1661766655000 ms
22/08/29 09:51:00 INFO JobScheduler: Added jobs for time 1661766660000 ms
22/08/29 09:51:05 INFO JobScheduler: Added jobs for time 1661766665000 ms
22/08/29 09:51:10 INFO JobScheduler: Added jobs for time 1661766670000 ms
22/08/29 09:51:15 INFO JobScheduler: Added jobs for time 1661766675000 ms
22/08/29 09:51:20 INFO JobScheduler: Added jobs for time 1661766680000 ms
22/08/29 09:51:25 INFO JobScheduler: Added jobs for time 1661766685000 ms
22/08/29 09:51:30 INFO JobScheduler: Added jobs for time 1661766690000 ms
When I execute the SQL command directly in cqlsh, it works and I get a result almost instantly:
cqlsh:kiliba> SELECT * FROM dbv2_product_categories WHERE account_id = 'shopifytest_62c44e48be54d0002900bd62' AND name = '';
account_id | name | id | breadcrumb | parent_id
------------+------+----+------------+-----------
(0 rows)
How comes my program fails to give me a response and seems to hang eternally?

Spark SQL freeze

I have a problem with Spark SQL. I read some data from csv files. Next to I do groupBy and join operation, and finished task is write joined data to file. My problem is time gap (I show that on log below with space).
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1069
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1003
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 965
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1073
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1038
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 900
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 903
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 938
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on 10.4.110.24:36423 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO storage.BlockManagerInfo: Removed broadcast_84_piece0 on omm104.in.nawras.com.om:43133 in memory (size: 32.8 KB, free: 4.1 GB)
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 969
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1036
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 970
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1006
18/08/07 23:39:40 INFO spark.ContextCleaner: Cleaned accumulator 1039
18/08/07 23:39:47 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
18/08/07 23:39:54 INFO parquet.ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Post-Scan Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Output Data Schema: struct<_c0: string, _c1: string, _c2: string, _c3: string, _c4: string ... 802 more fields>
18/08/08 00:14:22 INFO execution.FileSourceScanExec: Pushed Filters:
18/08/08 00:14:22 INFO datasources.FileSourceStrategy: Pruning directories with:
Dataframes are small sized ~5000 records, and ~800 columns.
I using following code:
val parentDF = ...
val childADF = ...
val childBDF = ...
val aggregatedAColName = "CHILD_A"
val aggregatedBColName = "CHILD_B"
val columns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3", "val_0")
val keyColumns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3")
val nestedAColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedAColName)
val childADataFrame = childADF
.select(nestedAColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedAColName).alias(aggregatedAColName))
val joinedWithA = parentDF.join(childADataFrame, keyColumns, "left")
val nestedBColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedBColName)
val childBDataFrame = childBDF
.select(nestedBColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedBColName).alias(aggregatedBColName))
val joinedWithB = joinedWithA.join(childBDataFrame, keyColumns, "left")
Processing time on 30 files (~85 k records all) is strange high ~38 min.
Have you ever seen similar problem?
Try to avoid repartition() call as it causes unnecessary data movements within the nodes.
According to Learning Spark
Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.
In a simple way COALESCE :- is only for decreases the no of partitions , No shuffling of data it just compress the partitions.

dataframe map and hivecontext issue

Env: Spark 1.6 and Scala
Hi,
I have dataframe DF and tried to run
val configTable= hivecontext.table("mydb.myTable")
configTable.rdd.map(row=>{
val abc =hivecontext.sql("select count(*) as num_rows from mydb2.mytable2")
}).collect()
I am getting exception
17/03/28 22:47:04 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NullPointerException
Is it not allowed to use SparkSQL in rdd.map? Any work around for this?
Thanks

ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data

I'm attempting to use spark-avro with Google Analytics avro data files, from one of our clients. Also I'm new to spark/scala, so my apologies if I've got anything wrong or done anything stupid. I'm using Spark 1.3.1.
I'm experimenting with the data in the spark-shell which I'm kicking off like this:
spark-shell --packages com.databricks:spark-avro_2.10:1.0.0
Then I'm running the following commands:
import com.databricks.spark.avro._
import scala.collection.mutable._
val gadata = sqlContext.avroFile("[client]/data")
gadata: org.apache.spark.sql.DataFrame = [visitorId: bigint, visitNumber: bigint, visitId: bigint, visitStartTime: bigint, date: string, totals: struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,tr ansactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScre en:bigint,totalTransactionRevenue:bigint>, trafficSource: struct<referralPath:string,campaign:string,source:string, medium:string,keyword:string,adContent:string>, device: struct<browser:string,browserVersion:string,operatingSystem :string,operatingSystemVersion:string,isMobile:boolean,mobileDeviceBranding:string,flashVersion:string,javaEnabled: boolean,language:string,screenColors:string,screenResolution:string,deviceCategory:string>, geoNetwork: str...
val gaIds = gadata.map(ga => ga.getString(11)).collect()
I get the following error:
[Stage 2:=> (8 + 4) / 430]15/05/14 11:14:04 ERROR Executor: Exception in task 12.0 in stage 2.0 (TID 27)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 WARN TaskSetManager: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:14:04 ERROR TaskSetManager: Task 12 in stage 2.0 failed 1 times; aborting job
15/05/14 11:14:04 WARN TaskSetManager: Lost task 11.0 in stage 2.0 (TID 26, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 25, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 9.0 in stage 2.0 (TID 24, localhost): TaskKilled (killed intentionally)
15/05/14 11:14:04 WARN TaskSetManager: Lost task 13.0 in stage 2.0 (TID 28, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 2.0 failed 1 times, most recent failure: Lost task 12.0 in stage 2.0 (TID 27, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
I though this might be too do with the index I was using, but the following statement works OK.
scala> gadata.first().getString(11)
res12: String = 29456309767885
So I though that maybe some of the records might be empty or have different amount of columns... so I attempted to run the following statement to get a list of all the record lengths:
scala> gadata.map(ga => ga.length).collect()
But I get a similar error:
[Stage 4:=> (8 + 4) / 430]15/05/14 11:20:04 ERROR Executor: Exception in task 12.0 in stage 4.0 (TID 42)
java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 WARN TaskSetManager: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
15/05/14 11:20:04 ERROR TaskSetManager: Task 12 in stage 4.0 failed 1 times; aborting job
15/05/14 11:20:04 WARN TaskSetManager: Lost task 11.0 in stage 4.0 (TID 41, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 ERROR Executor: Exception in task 13.0 in stage 4.0 (TID 43)
org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 9.0 in stage 4.0 (TID 39, localhost): TaskKilled (killed intentionally)
15/05/14 11:20:04 WARN TaskSetManager: Lost task 10.0 in stage 4.0 (TID 40, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 4.0 failed 1 times, most recent failure: Lost task 12.0 in stage 4.0 (TID 42, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Is this an Issue with Spark-Avro or Spark?
Not sure what the underlying issue was, but I've managed to fix the error by breaking up my data into monthly sets. I had 4 months worth of GA data in a single folder and was operating on all the data. The data ranged from 70MB to 150MB per day.
Creating 4 folders for January, February, March & April and loading them up individually the map succeeds without any issues. Once loaded I can join the data set together (only tried two so far) and work on them, without issue.
I'm using Spark on a Pseudo Hadoop distribution, not sure if this makes a difference to the volume of data Spark can handle.
UPDATE:
Found the root issue with the error. I loaded up each months data and printout the schema. Both January and February are identical but after this a field goes walk about in March and Aprils schemas:
root
|-- visitorId: long (nullable = true)
|-- visitNumber: long (nullable = true)
|-- visitId: long (nullable = true)
|-- visitStartTime: long (nullable = true)
|-- date: string (nullable = true)
|-- totals: struct (nullable = true)
| |-- visits: long (nullable = true)
| |-- hits: long (nullable = true)
| |-- pageviews: long (nullable = true)
| |-- timeOnSite: long (nullable = true)
| |-- bounces: long (nullable = true)
| |-- transactions: long (nullable = true)
| |-- transactionRevenue: long (nullable = true)
| |-- newVisits: long (nullable = true)
| |-- screenviews: long (nullable = true)
| |-- uniqueScreenviews: long (nullable = true)
| |-- timeOnScreen: long (nullable = true)
| |-- totalTransactionRevenue: long (nullable = true)
(snipped)
After February the totalTransactionRevenuse at the bottom is not present anymore. So I assume this is causing the error and is related to this issue