I tried to write dataframe to vertica using the following documentation :https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SparkConnector/WritingtoVerticaUsingDefaultSource.htm?tocpath=Integrating%20with%20Apache%20Spark%7CSaving%20an%20Apache%20Spark%20DataFrame%20to%20a%20Vertica%20Table%7C_____1 provide by vertica and it worked.
The dataframe gets written into the table after loading with the desired libraries.
Now when I tried to do the same exact code in Intellij or without writing the code from directly the spark shell, there are some errors with it :
The code is :
val rows: RDD[Row] = sc.parallelize(Array(
Row(1,"hello", true),
Row(2,"goodbye", false)
))
val schema = StructType(Array(
StructField("id",IntegerType, false),
StructField("sings",StringType,true),
StructField("still_here",BooleanType,true)
))
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.createDataFrame(rows, schema) // Spark 2.0// View the sample data and schema
df.show
df.schema// Setup the user options, defaults are shown where applicable for optional values.
// Replace the values in italics with the settings for your Vertica instance.
val opts: Map[String, String] = Map(
"table" -> "signs",
"db" -> "dbadmin",
"user" -> "dbadmin",
"password" -> "password",
"host" -> "localhost",
"hdfs_url" -> "hdfs://localhost:9000/user",
"web_hdfs_url" -> "webhdfs://localhost:9870/user",
// "failed_rows_percent_tolerance"-> "0.00" // OPTIONAL (default val shown)
"dbschema" -> "public" // OPTIONAL (default val shown)
// "port" -> "5433" // OPTIONAL (default val shown)
// "strlen" -> "1024" // OPTIONAL (default val shown)
// "fileformat" -> "orc" // OPTIONAL (default val shown)
)// SaveMode can be either Overwrite, Append, ErrorIfExists, Ignore
val mode = SaveMode.Append
df
.write
.format("com.vertica.spark.datasource.DefaultSource")
.options(opts)
.mode(mode)
.save()
This is the same as provided in the documentation. ANd this error comes.
I have set up my hdfs and vertica.
The question is if it is working as expected from the spark shell why is it not working outside from it ?
20/04/27 01:55:50 INFO S2V: Load by name. Column list: ("name","miles_per_gallon","cylinders","displacement","horsepower","weight_in_lbs","acceleration","year","origin")
20/04/27 01:55:50 INFO S2V: Writing intermediate data files to path: hdfs://localhost:9000/user/S2V_job2509086937642333836
20/04/27 01:55:50 ERROR S2VUtils: Unable to delete the HDFS path: hdfs://localhost:9000/user/S2V_job2509086937642333836
20/04/27 01:55:50 ERROR S2V: Failed to save DataFrame to Vertica table: second0.car with SaveMode: Append
20/04/27 01:55:50 ERROR JobScheduler: Error running job streaming job 1587932740000 ms.2
java.lang.Exception: S2V: FATAL ERROR for job S2V_job2509086937642333836. Job status information is available in the Vertica table second0.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table: second0.car with SaveMode: Append. ERROR MESSAGE: ERROR: java.lang.Exception: S2V: FATAL ERROR for job S2V_job2509086937642333836. Unable to save intermediate orc files to HDFS path: hdfs://localhost:9000/user/S2V_job2509086937642333836. Error message: The ORC data source must be used with Hive support enabled;
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:446)
at com.vertica.spark.s2v.S2V.save(S2V.scala:496)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at replica_nimble_spark.SparkVerticaHelper$$anonfun$applyPipeline$1$$anonfun$apply$3.apply(SparkVerticaHelper.scala:85)
at replica_nimble_spark.SparkVerticaHelper$$anonfun$applyPipeline$1$$anonfun$apply$3.apply(SparkVerticaHelper.scala:76)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The question is if it is working as expected from the spark shell why
is it not working outside from it ?
The answer is your error message :
Error message: The ORC data source must be used with Hive support enabled;
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:446)
Means you have to enable hive support like this example to fix this error.
val spark = SparkSession
.builder()
.appName("Mr potterpod wants to test spark hive support")
.master("local[*]")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport() // this is what I was talking about
.getOrCreate()
Why from spark-shell its working ?
Answer : spark-shell enables Hive support by default greater than or equal to Spark 2.0.
Proof :
To test the default nature open spark-shell with out any options then do this...
scala> spark.sparkContext.getConf.get("spark.sql.catalogImplementation")
res3: String = hive
If you want to test this feature by disabling hive support in spark-shell using spark.sql.catalogImplementation
Options for this property are (in-memory or hive)
spark-shell --conf spark.sql.catalogImplementation=in-memory
then you will hit the same error in spark-shell also
Further reading How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?
Related
Getting error even after setting configuration
config("spark.sql.catalogImplementation","hive")
override def beforeAll(): Unit = {
super[SharedSparkContext].beforeAll()
SparkSessionProvider._sparkSession = SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.getOrCreate()
}
Edited:
This is how am setting up my local db and tables for testing.
val stgDb = "test_stagingDB"
val stgTbl_exp ="test_stagingDB_expected"
val stgTbl_result="test_stg_table_result"
val trgtDb = "test_activeDB"
val trgtTbl_exp ="test_activeDB_expected"
val trgtTbl_result ="test_activeDB_results"
def setUpDb ={
println("Set up DB started")
val localPath="file:/C:/Users/vmurthyms/Code-prdb/prdb/com.rxcorp.prdb"
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_stagingDB LOCATION '$localPath/test_stagingDB.db'")
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_activeDB LOCATION '$localPath/test_sctiveDB.db'")
spark.sql(s"CREATE TABLE IF NOT EXISTS $trgtDb.${trgtTbl_exp}_ina (Id String, Name String)")
println("Set up DB done")
}
setUpDb
While running spark.sql("CREATE TABLE.., ") cmd , am getting below error:
Error:
Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:392)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:390)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:390)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:388)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.setUpDb$1(SitecoreAPIExtractTest.scala:127)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.apply$mcV$sp(SitecoreAPIExtractTest.scala:130)
It seems you are almost there(your error message is also giving you the clue), you need to call enableHiveSupport() when you are creating spark session. Eg.
SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.enableHiveSupport()
.getOrCreate()
And also when using enableHiveSupport(), setting config("spark.sql.catalogImplementation","hive") looks redundant. I think you can safely comment out that part.
I have a scala based structured streaming program that needs to execute a Python based model.
In previous version of spark (1.6.x), I used to do that by converting DStream to RDD and than invoking rdd.pipe method.
However, this approach does not work on structured streaming. It gives the following error:
Queries with streaming sources must be executed with writeStream.start()
The snippet of code is as follows:
val sourceDF = spark.readStream.option("header","true").schema(schema).csv("/Users/user/Desktop/spark_tutorial/")
val rdd: RDD[String] = sourceDF.rdd.map(row => row.mkString(","))
val pipedRDD: RDD[String] = rdd.pipe("/Users/user/Desktop/test.py")
import org.apache.spark.sql._
val rowRDD : RDD[Row] = pipedRDD.map(row => Row.fromSeq(row.split(",")))
val newSchema = <code to create new schema>
val newDF = spark.createDataFrame(rowRDD, newSchema)
val query = newDF.writeStream.format("console").outputMode(OutputMode.Append()).start
query.awaitTermination()
The Exception stack trace:
19/01/22 00:10:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[/Users/user/Desktop/spark_tutorial/]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:62)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:2975)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:2973)
at Test$.main(Test.scala:20)
at Test.main(Test.scala)
Any suggestions?
I'm trying to write Spark Dataframe into the HBase and followed several other blogs and one among of them is this but it's not working.
However I can read the data from HBase successfully as Dataframe. Also some post has used org.apache.hadoop.hbase.spark format and others org.apache.spark.sql.execution.datasources.hbase. I'm not sure which one to use. Spark - 2.2.2; HBase - 1.4.7; Scala - 2.11.12 and Hortonworks SHC 1.1.0-2.1-s_2.11 from here.
The code is as follows:
case class UserMessageRecord(
rowkey: String,
Name: String,
Number: String,
message: String,
lastTS: String
)//this has been defined outside of the object scope
val exmple = List(UserMessageRecord("86325980047644033486","enrique","123455678",msgTemplate,timeStamp))
import spark.sqlContext.implicits._
val userDF = exmple.toDF()
//write to HBase
userDF.write
.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase").save() //exception here
//read from HBase and it's working fine
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val df = withCatalog(catalog)
df.show()
Here's the exception:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:122)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:177)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:76)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.insert(HBaseRelation.scala:218)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at HbaseConnectionTest.HbaseLoadUsingSpark$.main(HbaseLoadUsingSpark.scala:85)
at HbaseConnectionTest.HbaseLoadUsingSpark.main(HbaseLoadUsingSpark.scala)
As discussed over here I made additional configuration changes to SparkSession builder and the exception is gone. However, I am not clear on the cause and the fix.
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HbaseSparkWrite")
.config("spark.hadoop.validateOutputSpecs", false)
.getOrCreate()
Kinda an offshoot of a post I had a month ago.
I have a spark structured steaming application that I'm reading in from Kafka. Here is the basic structure of my code.
I create the spark session.
val spark = SparkSession
.builder
.appName("app_name")
.getOrCreate()
Then I read from the stream
val data_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server_list")
.option("subscribe", "topic")
.load()
In Kafka record, I cast the "value" as a string. It converts from binary to string.
val df = data_stream
.select($"value".cast("string") as "json")
Based off of a pre-defined schema, I try to parse out the json structure into columns. However, the problem here is if the data is "bad" or a different format then it doesn't match the defined schema. I need to filter out row's that do not match my schema. Whether they are null, numbers, some random text like "hello". If it is not a json then it should not proceed through to the next dataframe process
val df2 = df.select(from_json($"json", schema) as "data")
.select("data.*")
if I pass in an empty kafka message through console producer the Spark query crashes giving
java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:420)
at scala.collection.immutable.Nil$.head(List.scala:417)
at org.apache.spark.sql.catalyst.expressions.JsonToStruct.nullSafeEval(jsonExpressions.scala:500)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:325)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
Source)
at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:219)
at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:218)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
I am joining two DataFrames created by reading two very large CSV files to compute some statistics. The code is running on a web server and it is triggered by a request, that's why Spark's session is kept always alive without calling sparkSession.close().
Sporadically, the code throws java.lang.IllegalArgumentException: spark.sql.execution.id is already set. I tried to make sure that the code doesn't get executed more than once at a time and but the problem wasn't resolved.
I am using Spark 2.1.0 and I know that there is an issue here which would hopefully be resolved in Spark 2.2.0.
Could you please suggest any workarounds in the mean time to avoid this problem?
A simplified version of the code that throws the exception:
val spark = SparkSession.builder().appName("application").master("local[*]").getOrCreate()
val itemCountry = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("countryId", IntegerType, false))))
.csv("/item_country.csv") // This file matches the schema provided
val itemPerformance = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("date", TimestampType, false),
StructField("performance", IntegerType, false))))
.csv("/item_performance.csv") // This file matches the schema provided
itemCountry.join(itemPerformance, itemCountry("itemId") === itemPerformance("itemId"))
.groupBy("countryId")
.agg(sum(when(to_date(itemPerformance("date")) > to_date(lit("2017-01-01")), itemPerformance("performance")).otherwise(0)).alias("performance")).show()
The stack trace for the exception:
java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
at .... [Custom caller functions]
Sample CSV files:
item_country.csv
itemId,countryId
1,1
2,1
3,2
4,3
item_performance.csv
itemId,date,performance
1,2017-04-15,10
1,2017-04-16,10
1,2017-04-17,10
2,2017-04-15,15
3,2017-04-20,12
4,2017-04-18,18