Execute python based model in scala based spark Structured Streaming program - scala

I have a scala based structured streaming program that needs to execute a Python based model.
In previous version of spark (1.6.x), I used to do that by converting DStream to RDD and than invoking rdd.pipe method.
However, this approach does not work on structured streaming. It gives the following error:
Queries with streaming sources must be executed with writeStream.start()
The snippet of code is as follows:
val sourceDF = spark.readStream.option("header","true").schema(schema).csv("/Users/user/Desktop/spark_tutorial/")
val rdd: RDD[String] = sourceDF.rdd.map(row => row.mkString(","))
val pipedRDD: RDD[String] = rdd.pipe("/Users/user/Desktop/test.py")
import org.apache.spark.sql._
val rowRDD : RDD[Row] = pipedRDD.map(row => Row.fromSeq(row.split(",")))
val newSchema = <code to create new schema>
val newDF = spark.createDataFrame(rowRDD, newSchema)
val query = newDF.writeStream.format("console").outputMode(OutputMode.Append()).start
query.awaitTermination()
The Exception stack trace:
19/01/22 00:10:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[/Users/user/Desktop/spark_tutorial/]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:51)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:62)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:60)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:2975)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:2973)
at Test$.main(Test.scala:20)
at Test.main(Test.scala)
Any suggestions?

Related

Extracting data from spark streaming dataframe

I'm new to Spark Streaming.
I'm getting an event similar to below from Kafka. I have to extract the path from the dataframe, read the data from the path and write it to a destination.
{"path":["/tmp/file_path/file.parquet"],"format":"parquet","entries":null}
Any idea on how to extract the path and format the spark streaming dataframe?
Here's what I'm trying to achieve,
val df: DataFrame = spark.readStream.format("kafka").
option("kafka.bootstrap.servers", "localhost:9092").
option("subscribe", "kafka-test-event").
option("startingOffsets", "earliest").load()
df.printSchema()
val valDf = df.selectExpr("CAST(value AS STRING)")
val path = valDf.collect()(0).getString(0)
println("path - "+ path)
val newDf = spark.read.parquet(path)
newDf.selectExpr("CAST(value AS STRING)").writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
When I try to do a collect on the dataframe it throws an Unsupported operation exception.
You are getting that error because you are trying to do static operations on a streaming dataframe.
you can try something like below.
After you read the streaming dataframe from Kafka, try the below
Create a schema class for parsing the incoming data
val incomingSchema = new StructType()
.add("path", StringType)
.add("format", StringType)
.add("entries", StringType)
Associate that schema on top of incoming data and you can select the required fields from your data and do transformations on top of it.
val valDf = df.selectExpr("CAST(value AS STRING) as jsonEntry").select(from_json($"jsonEntry",incomingSchema).select("path")

Spark write dataframe to vertica giving error

I tried to write dataframe to vertica using the following documentation :https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SparkConnector/WritingtoVerticaUsingDefaultSource.htm?tocpath=Integrating%20with%20Apache%20Spark%7CSaving%20an%20Apache%20Spark%20DataFrame%20to%20a%20Vertica%20Table%7C_____1 provide by vertica and it worked.
The dataframe gets written into the table after loading with the desired libraries.
Now when I tried to do the same exact code in Intellij or without writing the code from directly the spark shell, there are some errors with it :
The code is :
val rows: RDD[Row] = sc.parallelize(Array(
Row(1,"hello", true),
Row(2,"goodbye", false)
))
val schema = StructType(Array(
StructField("id",IntegerType, false),
StructField("sings",StringType,true),
StructField("still_here",BooleanType,true)
))
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.createDataFrame(rows, schema) // Spark 2.0// View the sample data and schema
df.show
df.schema// Setup the user options, defaults are shown where applicable for optional values.
// Replace the values in italics with the settings for your Vertica instance.
val opts: Map[String, String] = Map(
"table" -> "signs",
"db" -> "dbadmin",
"user" -> "dbadmin",
"password" -> "password",
"host" -> "localhost",
"hdfs_url" -> "hdfs://localhost:9000/user",
"web_hdfs_url" -> "webhdfs://localhost:9870/user",
// "failed_rows_percent_tolerance"-> "0.00" // OPTIONAL (default val shown)
"dbschema" -> "public" // OPTIONAL (default val shown)
// "port" -> "5433" // OPTIONAL (default val shown)
// "strlen" -> "1024" // OPTIONAL (default val shown)
// "fileformat" -> "orc" // OPTIONAL (default val shown)
)// SaveMode can be either Overwrite, Append, ErrorIfExists, Ignore
val mode = SaveMode.Append
df
.write
.format("com.vertica.spark.datasource.DefaultSource")
.options(opts)
.mode(mode)
.save()
This is the same as provided in the documentation. ANd this error comes.
I have set up my hdfs and vertica.
The question is if it is working as expected from the spark shell why is it not working outside from it ?
20/04/27 01:55:50 INFO S2V: Load by name. Column list: ("name","miles_per_gallon","cylinders","displacement","horsepower","weight_in_lbs","acceleration","year","origin")
20/04/27 01:55:50 INFO S2V: Writing intermediate data files to path: hdfs://localhost:9000/user/S2V_job2509086937642333836
20/04/27 01:55:50 ERROR S2VUtils: Unable to delete the HDFS path: hdfs://localhost:9000/user/S2V_job2509086937642333836
20/04/27 01:55:50 ERROR S2V: Failed to save DataFrame to Vertica table: second0.car with SaveMode: Append
20/04/27 01:55:50 ERROR JobScheduler: Error running job streaming job 1587932740000 ms.2
java.lang.Exception: S2V: FATAL ERROR for job S2V_job2509086937642333836. Job status information is available in the Vertica table second0.S2V_JOB_STATUS_USER_DBADMIN. Unable to create/insert into target table: second0.car with SaveMode: Append. ERROR MESSAGE: ERROR: java.lang.Exception: S2V: FATAL ERROR for job S2V_job2509086937642333836. Unable to save intermediate orc files to HDFS path: hdfs://localhost:9000/user/S2V_job2509086937642333836. Error message: The ORC data source must be used with Hive support enabled;
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:446)
at com.vertica.spark.s2v.S2V.save(S2V.scala:496)
at com.vertica.spark.datasource.DefaultSource.createRelation(VerticaSource.scala:100)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at replica_nimble_spark.SparkVerticaHelper$$anonfun$applyPipeline$1$$anonfun$apply$3.apply(SparkVerticaHelper.scala:85)
at replica_nimble_spark.SparkVerticaHelper$$anonfun$applyPipeline$1$$anonfun$apply$3.apply(SparkVerticaHelper.scala:76)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The question is if it is working as expected from the spark shell why
is it not working outside from it ?
The answer is your error message :
Error message: The ORC data source must be used with Hive support enabled;
at com.vertica.spark.s2v.S2V.do2Stage(S2V.scala:446)
Means you have to enable hive support like this example to fix this error.
val spark = SparkSession
.builder()
.appName("Mr potterpod wants to test spark hive support")
.master("local[*]")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport() // this is what I was talking about
.getOrCreate()
Why from spark-shell its working ?
Answer : spark-shell enables Hive support by default greater than or equal to Spark 2.0.
Proof :
To test the default nature open spark-shell with out any options then do this...
scala> spark.sparkContext.getConf.get("spark.sql.catalogImplementation")
res3: String = hive
If you want to test this feature by disabling hive support in spark-shell using spark.sql.catalogImplementation
Options for this property are (in-memory or hive)
spark-shell --conf spark.sql.catalogImplementation=in-memory
then you will hit the same error in spark-shell also
Further reading How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

How to write Spark Dataframe into HBase?

I'm trying to write Spark Dataframe into the HBase and followed several other blogs and one among of them is this but it's not working.
However I can read the data from HBase successfully as Dataframe. Also some post has used org.apache.hadoop.hbase.spark format and others org.apache.spark.sql.execution.datasources.hbase. I'm not sure which one to use. Spark - 2.2.2; HBase - 1.4.7; Scala - 2.11.12 and Hortonworks SHC 1.1.0-2.1-s_2.11 from here.
The code is as follows:
case class UserMessageRecord(
rowkey: String,
Name: String,
Number: String,
message: String,
lastTS: String
)//this has been defined outside of the object scope
val exmple = List(UserMessageRecord("86325980047644033486","enrique","123455678",msgTemplate,timeStamp))
import spark.sqlContext.implicits._
val userDF = exmple.toDF()
//write to HBase
userDF.write
.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase").save() //exception here
//read from HBase and it's working fine
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val df = withCatalog(catalog)
df.show()
Here's the exception:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:122)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:177)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:76)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.insert(HBaseRelation.scala:218)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at HbaseConnectionTest.HbaseLoadUsingSpark$.main(HbaseLoadUsingSpark.scala:85)
at HbaseConnectionTest.HbaseLoadUsingSpark.main(HbaseLoadUsingSpark.scala)
As discussed over here I made additional configuration changes to SparkSession builder and the exception is gone. However, I am not clear on the cause and the fix.
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HbaseSparkWrite")
.config("spark.hadoop.validateOutputSpecs", false)
.getOrCreate()

How to query Spark StreamingContext with spark sql in zeppelin?

I am trying to use spark sql to query the data coming from kafka using zeppelin for real time trend analysis but without success.
here is the simple code snippets that I am running in zeppelin
//Load Dependency
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://repo1.maven.org/maven2/")
z.load("org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1")
z.load("org.apache.spark:spark-core_2.11:2.0.1")
z.load("org.apache.spark:spark-sql_2.11:2.0.1")
z.load("org.apache.spark:spark-streaming_2.11:2.0.1"
//simple streaming
%spark
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setAppName("clickstream")
.setMaster("local[*]")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.driver.allowMultipleContexts","true")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config(conf)
.getOrCreate()
val ssc = new StreamingContext(conf, Seconds(1))
val topicsSet = Set("timer")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.25.1:9091,192.168.25.1:9092,192.168.25.1:9093")
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
lines.window(Seconds(60)).foreachRDD{ rdd =>
val clickDF = spark.read.json(rdd) //doesn't have to be json
clickDF.createOrReplaceTempView("testjson1")
//olderway
//clickDF.registerTempTable("testjson2")
clickDF.show
}
lines.print()
ssc.start()
ssc.awaitTermination()
I am able to print each kafka message but when I run simple sql %sql select * from testjson1 // or testjson2, I get the following error
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
In the this post Streaming Data is being queried (with twitter example). So I am thinking it should be possible with kafka streaming. So I guess, maybe, I am doing something wrong OR missing some point?
Any ideas, suggestions, recommendation is welcomed
The error message does not tell that the temp view is missing. The error message tells, that the type None does not provide an element with name 'get'.
With spark the calculations based on the RDDs are performed when an action is called. So up to the point where you are creating the temporary table no calculation is performed. All the calculations are performed when you execute your query on the table. If your table would not exist you would get another error message.
Maybe the Kafka messages could be printed, but your exception tells, that the None instance does not know 'get'. So I believe that your source JSON data contains items without data and those items are represented by None and therefore cause the execption while spark performs the calculations.
I would suggest that you verify if your solution works in general, by testing if it works with a sample data that does not contain empty JSON elements.

structured streaming with Spark 2.0.2, Kafka source and scalapb

I am using structured streaming (Spark 2.0.2) to consume kafka messages. Using scalapb, messages in protobuf. I am getting the following error. Please help..
Exception in thread "main" scala.ScalaReflectionException: is
not a term at
scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199)
at
scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:84)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.constructParams(ScalaReflection.scala:811)
at
org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:800)
at
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:582)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:460)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344) at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
at
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:274) at
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
at PersonConsumer$.main(PersonConsumer.scala:33) at
PersonConsumer.main(PersonConsumer.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
The following is my code ...
object PersonConsumer {
import org.apache.spark.rdd.RDD
import com.trueaccord.scalapb.spark._
import org.apache.spark.sql.{SQLContext, SparkSession}
import com.example.protos.demo._
def main(args : Array[String]) {
def parseLine(s: String): Person =
Person.parseFrom(
org.apache.commons.codec.binary.Base64.decodeBase64(s))
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
import spark.implicits._
val ds1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","person").load()
val ds2 = ds1.selectExpr("CAST(value AS STRING)").as[String]
val ds3 = ds2.map(str => parseLine(str)).createOrReplaceTempView("persons")
val ds4 = spark.sqlContext.sql("select name from persons")
val query = ds4.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
}
}
The line with val ds3 should be:
val ds3 = ds2.map(str => parseLine(str))
sqlContext.protoToDataFrame(ds3).registerTempTable("persons")
The RDD needs to be converted to a data frame before it is saved as temp table.
In Person class, gender is a enum and this was the cause for this problem. After removing this field, it works fine.
The following is the answer I got from Shixiong(Ryan) of DataBricks.
The problem is "optional Gender gender = 3;". The generated class "Gender" is a trait, and Spark cannot know how to create a trait so it's not supported. You can define your class which is supported by SQL Encoder, and convert this generated class to the new class in parseLine.