Spark write dataframe to vertica giving error

Spark write dataframe to vertica giving error - scala

I tried to write dataframe to vertica using the following documentation :https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SparkConnector/WritingtoVerticaUsingDefaultSource.htm?tocpath=Integrating%20with%20Apache%20Spark%7CSaving%20an%20Apache%20Spark%20DataFrame%20to%20a%20Vertica%20Table%7C_____1 provide by vertica and it worked.
The dataframe gets written into the table after loading with the desired libraries.
Now when I tried to do the same exact code in Intellij or without writing the code from directly the spark shell, there are some errors with it :
The code is :
val rows: RDD[Row] = sc.parallelize(Array(
Row(1,"hello", true),
Row(2,"goodbye", false)
))
val schema = StructType(Array(
StructField("id",IntegerType, false),
StructField("sings",StringType,true),
StructField("still_here",BooleanType,true)
))
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.createDataFrame(rows, schema) // Spark 2.0// View the sample data and schema
df.show
df.schema// Setup the user options, defaults are shown where applicable for optional values.
// Replace the values in italics with the settings for your Vertica instance.
val opts: Map[String, String] = Map(
"table" -> "signs",
"db" -> "dbadmin",
"user" -> "dbadmin",
"password" -> "password",
"host" -> "localhost",
"hdfs_url" -> "hdfs://localhost:9000/user",
"web_hdfs_url" -> "webhdfs://localhost:9870/user",
// "failed_rows_percent_tolerance"-> "0.00" // OPTIONAL (default val shown)
"dbschema" -> "public" // OPTIONAL (default val shown)
// "port" -> "5433" // OPTIONAL (default val shown)
// "strlen" -> "1024" // OPTIONAL (default val shown)
// "fileformat" -> "orc" // OPTIONAL (default val shown)
)// SaveMode can be either Overwrite, Append, ErrorIfExists, Ignore
val mode = SaveMode.Append
df
.write
.format("com.vertica.spark.datasource.DefaultSource")
.options(opts)
.mode(mode)
.save()
This is the same as provided in the documentation. ANd this error comes.
I have set up my hdfs and vertica.
The question is if it is working as expected from the spark shell why is it not working outside from it ?
20/04/27 01:55:50 INFO S2V: Load by name. Column list: ("name","miles_per_gallon","cylinders","displacement","horsepower","weight_in_lbs","acceleration","year","origin")
20/04/27 01:55:50 INFO S2V: Writing intermediate data files to path: hdfs://localhost:9000/user/S2V_job2509086937642333836
20/04/27 01:55:50 ERROR S2VUtils: Unable to delete the HDFS path: hdfs://localhost:9000/user/S2V_job2509086937642333836
20/04/27 01:55:50 ERROR S2V: Failed to save DataFrame to Vertica table: second0.car with SaveMode: Append
20/04/27 01:55:50 ERROR JobScheduler: Error running job streaming job 1587932740000 ms.2
java.lang.Exception: S2V: FATAL ERROR for job S2V_job2509086937642333836. Job status information is available in the Vertica table second0.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table: second0.car with SaveMode: Append. ERROR MESSAGE: ERROR: java.lang.Exception: S2V: FATAL ERROR for job S2V_job2509086937642333836. Unable to save intermediate orc files to HDFS path: hdfs://localhost:9000/user/S2V_job2509086937642333836. Error message: The ORC data source must be used with Hive support enabled;
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:446)
at com.vertica.spark.s2v.S2V.save(S2V.scala:496)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at replica_nimble_spark.SparkVerticaHelper$$anonfun$applyPipeline$1$$anonfun$apply$3.apply(SparkVerticaHelper.scala:85)
at replica_nimble_spark.SparkVerticaHelper$$anonfun$applyPipeline$1$$anonfun$apply$3.apply(SparkVerticaHelper.scala:76)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The question is if it is working as expected from the spark shell why
is it not working outside from it ?
The answer is your error message :
Error message: The ORC data source must be used with Hive support enabled;
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:446)
Means you have to enable hive support like this example to fix this error.
val spark = SparkSession
.builder()
.appName("Mr potterpod wants to test spark hive support")
.master("local[*]")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport() // this is what I was talking about
.getOrCreate()
Why from spark-shell its working ?
Answer : spark-shell enables Hive support by default greater than or equal to Spark 2.0.
Proof :
To test the default nature open spark-shell with out any options then do this...
scala> spark.sparkContext.getConf.get("spark.sql.catalogImplementation")
res3: String = hive
If you want to test this feature by disabling hive support in spark-shell using spark.sql.catalogImplementation
Options for this property are (in-memory or hive)
spark-shell --conf spark.sql.catalogImplementation=in-memory
then you will hit the same error in spark-shell also
Further reading How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

Related

Not able to create a table locally, getting Hive support is required

Getting error even after setting configuration
config("spark.sql.catalogImplementation","hive")
override def beforeAll(): Unit = {
super[SharedSparkContext].beforeAll()
SparkSessionProvider._sparkSession = SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.getOrCreate()
}
Edited:
This is how am setting up my local db and tables for testing.
val stgDb = "test_stagingDB"
val stgTbl_exp ="test_stagingDB_expected"
val stgTbl_result="test_stg_table_result"
val trgtDb = "test_activeDB"
val trgtTbl_exp ="test_activeDB_expected"
val trgtTbl_result ="test_activeDB_results"
def setUpDb ={
println("Set up DB started")
val localPath="file:/C:/Users/vmurthyms/Code-prdb/prdb/com.rxcorp.prdb"
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_stagingDB LOCATION '$localPath/test_stagingDB.db'")
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_activeDB LOCATION '$localPath/test_sctiveDB.db'")
spark.sql(s"CREATE TABLE IF NOT EXISTS $trgtDb.${trgtTbl_exp}_ina (Id String, Name String)")
println("Set up DB done")
}
setUpDb
While running spark.sql("CREATE TABLE.., ") cmd , am getting below error:
Error:
Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:392)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:390)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:390)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:388)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.setUpDb$1(SitecoreAPIExtractTest.scala:127)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.apply$mcV$sp(SitecoreAPIExtractTest.scala:130)

It seems you are almost there(your error message is also giving you the clue), you need to call enableHiveSupport() when you are creating spark session. Eg.
SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.enableHiveSupport()
.getOrCreate()
And also when using enableHiveSupport(), setting config("spark.sql.catalogImplementation","hive") looks redundant. I think you can safely comment out that part.

Execute python based model in scala based spark Structured Streaming program

I have a scala based structured streaming program that needs to execute a Python based model.
In previous version of spark (1.6.x), I used to do that by converting DStream to RDD and than invoking rdd.pipe method.
However, this approach does not work on structured streaming. It gives the following error:
Queries with streaming sources must be executed with writeStream.start()
The snippet of code is as follows:
val sourceDF = spark.readStream.option("header","true").schema(schema).csv("/Users/user/Desktop/spark_tutorial/")
val rdd: RDD[String] = sourceDF.rdd.map(row => row.mkString(","))
val pipedRDD: RDD[String] = rdd.pipe("/Users/user/Desktop/test.py")
import org.apache.spark.sql._
val rowRDD : RDD[Row] = pipedRDD.map(row => Row.fromSeq(row.split(",")))
val newSchema = <code to create new schema>
val newDF = spark.createDataFrame(rowRDD, newSchema)
val query = newDF.writeStream.format("console").outputMode(OutputMode.Append()).start
query.awaitTermination()
The Exception stack trace:
19/01/22 00:10:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[/Users/user/Desktop/spark_tutorial/]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:62)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:2975)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:2973)
at Test$.main(Test.scala:20)
at Test.main(Test.scala)
Any suggestions?

How to write Spark Dataframe into HBase?

I'm trying to write Spark Dataframe into the HBase and followed several other blogs and one among of them is this but it's not working.
However I can read the data from HBase successfully as Dataframe. Also some post has used org.apache.hadoop.hbase.spark format and others org.apache.spark.sql.execution.datasources.hbase. I'm not sure which one to use. Spark - 2.2.2; HBase - 1.4.7; Scala - 2.11.12 and Hortonworks SHC 1.1.0-2.1-s_2.11 from here.
The code is as follows:
case class UserMessageRecord(
rowkey: String,
Name: String,
Number: String,
message: String,
lastTS: String
)//this has been defined outside of the object scope
val exmple = List(UserMessageRecord("86325980047644033486","enrique","123455678",msgTemplate,timeStamp))
import spark.sqlContext.implicits._
val userDF = exmple.toDF()
//write to HBase
userDF.write
.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase").save() //exception here
//read from HBase and it's working fine
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val df = withCatalog(catalog)
df.show()
Here's the exception:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:122)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:177)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:76)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.insert(HBaseRelation.scala:218)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at HbaseConnectionTest.HbaseLoadUsingSpark$.main(HbaseLoadUsingSpark.scala:85)
at HbaseConnectionTest.HbaseLoadUsingSpark.main(HbaseLoadUsingSpark.scala)

As discussed over here I made additional configuration changes to SparkSession builder and the exception is gone. However, I am not clear on the cause and the fix.
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HbaseSparkWrite")
.config("spark.hadoop.validateOutputSpecs", false)
.getOrCreate()

Spark Structured Streaming + Kafka: Stop query from crashing when Kafka message doesn't match JSON schema

Kinda an offshoot of a post I had a month ago.
I have a spark structured steaming application that I'm reading in from Kafka. Here is the basic structure of my code.
I create the spark session.
val spark = SparkSession
.builder
.appName("app_name")
.getOrCreate()
Then I read from the stream
val data_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server_list")
.option("subscribe", "topic")
.load()
In Kafka record, I cast the "value" as a string. It converts from binary to string.
val df = data_stream
.select($"value".cast("string") as "json")
Based off of a pre-defined schema, I try to parse out the json structure into columns. However, the problem here is if the data is "bad" or a different format then it doesn't match the defined schema. I need to filter out row's that do not match my schema. Whether they are null, numbers, some random text like "hello". If it is not a json then it should not proceed through to the next dataframe process
val df2 = df.select(from_json($"json", schema) as "data")
.select("data.*")
if I pass in an empty kafka message through console producer the Spark query crashes giving
java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:420)
at scala.collection.immutable.Nil$.head(List.scala:417)
at org.apache.spark.sql.catalyst.expressions.JsonToStruct.nullSafeEval(jsonExpressions.scala:500)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:325)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
Source)
at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:219)
at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:218)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)

Joining two DataFrames sporadically throws "java.lang.IllegalArgumentException: spark.sql.execution.id is already set"

I am joining two DataFrames created by reading two very large CSV files to compute some statistics. The code is running on a web server and it is triggered by a request, that's why Spark's session is kept always alive without calling sparkSession.close().
Sporadically, the code throws java.lang.IllegalArgumentException: spark.sql.execution.id is already set. I tried to make sure that the code doesn't get executed more than once at a time and but the problem wasn't resolved.
I am using Spark 2.1.0 and I know that there is an issue here which would hopefully be resolved in Spark 2.2.0.
Could you please suggest any workarounds in the mean time to avoid this problem?
A simplified version of the code that throws the exception:
val spark = SparkSession.builder().appName("application").master("local[*]").getOrCreate()
val itemCountry = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("countryId", IntegerType, false))))
.csv("/item_country.csv") // This file matches the schema provided
val itemPerformance = spark.read.format("csv")
.option("header", "true")
.schema(StructType(Array(
StructField("itemId", IntegerType, false),
StructField("date", TimestampType, false),
StructField("performance", IntegerType, false))))
.csv("/item_performance.csv") // This file matches the schema provided
itemCountry.join(itemPerformance, itemCountry("itemId") === itemPerformance("itemId"))
.groupBy("countryId")
.agg(sum(when(to_date(itemPerformance("date")) > to_date(lit("2017-01-01")), itemPerformance("performance")).otherwise(0)).alias("performance")).show()
The stack trace for the exception:
java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
at .... [Custom caller functions]
Sample CSV files:
item_country.csv
itemId,countryId
1,1
2,1
3,2
4,3
item_performance.csv
itemId,date,performance
1,2017-04-15,10
1,2017-04-16,10
1,2017-04-17,10
2,2017-04-15,15
3,2017-04-20,12
4,2017-04-18,18

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark write dataframe to vertica giving error - scala

Related

Not able to create a table locally, getting Hive support is required

Execute python based model in scala based spark Structured Streaming program

How to write Spark Dataframe into HBase?

Spark Structured Streaming + Kafka: Stop query from crashing when Kafka message doesn't match JSON schema

Joining two DataFrames sporadically throws "java.lang.IllegalArgumentException: spark.sql.execution.id is already set"

Categories

Resources