Spark: How to create streaming Dataset with RowEncoder? - scala

I have a streaming dataframe, created using spark structured streaming. Like this-
val dataStream =
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", topic)
.load()
Now, when I try to create a Dataset from datastream with an additional column named newKey, it gives me following error-
[error] (run-main-0) java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
[error] - field (class: "org.apache.spark.sql.Row", name: "_2")
[error] - root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_2")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:642)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:444)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:820)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:444)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$8.apply(ScalaReflection.scala:636)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1$$anonfun$8.apply(ScalaReflection.scala:624)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:624)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:444)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:820)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:444)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:433)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
The code that I am using is as follows:
import spark.implicits._
implicit val rowEncoder: ExpressionEncoder[Row] = RowEncoder(dataStream.schema)
val dsStream =
dataStream
.select(lit("a").as("newKey"), col("*"))
.as[(String, Row)]
.writeStream
.format("console")
.start()
Can anyone help me resolve it?

Either of the following will work:
map:
import org.apache.spark.sql.{Encoder, Encoders, Row}
import org.apache.spark.sql.functions._
val df = Seq((1L, "a", 4.0)).toDF("x", "y", "z")
val encoder = Encoders.tuple(Encoders.STRING, RowEncoder(df.schema))
df.map(row => ("a", row))(encoder)
select with struct and as:
df.select(lit("a"), struct(df.columns map col: _*)).as[(String, Row)](encoder)

Related

How to create a PolygonRDD from H3 boundary?

I'm using Apache Spark with Apache Sedona (previously called GeoSpark), and I'm trying to do the following:
Take a DataFrame containing latitude and longitude in each row (it comes from an arbitrary source, it neither is a PointRDD nor comes from a specific file format) and transform it into a DataFrame with the H3 index of each point.
Take that DataFrame and create a PolygonRDD containing the H3 cell boundaries of each distinct H3 index.
This is what I have so far:
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.sedona.core.spatialRDD.PolygonRDD
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.locationtech.jts.geom.{Polygon, GeometryFactory, Coordinate}
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
object Main {
def main(args: Array[String]) {
val sparkSession: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
.master("local[*]")
.appName("Sedona-Analysis")
.getOrCreate()
import sparkSession.implicits._
SedonaSQLRegistrator.registerAll(sparkSession)
SedonaVizRegistrator.registerAll(sparkSession)
val df = Seq(
(-8.01681, -34.92618),
(-25.59306, -49.39895),
(-7.17897, -34.86518),
(-20.24521, -42.14273),
(-20.24628, -42.14785),
(-27.01641, -50.94109),
(-19.72987, -47.94319)
).toDF("latitude", "longitude")
val core: H3Core = H3Core.newInstance()
val geoFactory = new GeometryFactory()
val geoToH3 = udf((lat: Double, lng: Double, res: Int) => core.geoToH3(lat, lng, res))
val trdd = df
.select(geoToH3($"latitude", $"longitude", lit(7)).as("h3index"))
.distinct()
.rdd
.map(row => {
val h3 = row.getAs[Long](0)
val lboundary = core.h3ToGeoBoundary(h3)
val aboundary = lboundary.toArray(Array.ofDim[GeoCoord](lboundary.size))
val poly = geoFactory.createPolygon(
aboundary.map((c: GeoCoord) => new Coordinate(c.lat, c.lng))
)
poly.setUserData(h3)
poly
})
val polyRDD = new PolygonRDD(trdd)
polyRDD.rawSpatialRDD.foreach(println)
sparkSession.stop()
}
}
However, after running sbt assembly and submitting the output jar to spark-submit, I get this error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2362)
at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.map(RDD.scala:395)
at Main$.main(Main.scala:44)
at Main.main(Main.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: com.uber.h3core.H3Core
Serialization stack:
- object not serializable (class: com.uber.h3core.H3Core, value: com.uber.h3core.H3Core#3407ded1)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 2)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class Main$, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic Main$.$anonfun$main$2:(Lcom/uber/h3core/H3Core;Lorg/locationtech/jts/geom/GeometryFactory;Lorg/apache/spark/sql/Row;)Lorg/locationtech/jts/geom/Polygon;, instantiatedMethodType=(Lorg/apache/spark/sql/Row;)Lorg/locationtech/jts/geom/Polygon;, numCaptured=2])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class Main$$$Lambda$1710/0x0000000840d7f040, Main$$$Lambda$1710/0x0000000840d7f040#4853f592)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
... 22 more
What is the proper way to achieve what I'm trying to do?
So, basically just adding the Serializable trait to an object containing the H3Core was enough. Also, I had to adjust the Coordinate array to begin and end with the same point.
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.sedona.core.spatialRDD.PolygonRDD
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.locationtech.jts.geom.{Polygon, GeometryFactory, Coordinate}
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
object H3 extends Serializable {
val core = H3Core.newInstance()
val geoFactory = new GeometryFactory()
}
object Main {
def main(args: Array[String]) {
val sparkSession: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
.master("local[*]")
.appName("Sedona-Analysis")
.getOrCreate()
import sparkSession.implicits._
SedonaSQLRegistrator.registerAll(sparkSession)
SedonaVizRegistrator.registerAll(sparkSession)
val df = Seq(
(-8.01681, -34.92618),
(-25.59306, -49.39895),
(-7.17897, -34.86518),
(-20.24521, -42.14273),
(-20.24628, -42.14785),
(-27.01641, -50.94109),
(-19.72987, -47.94319)
).toDF("latitude", "longitude")
val geoToH3 = udf((lat: Double, lng: Double, res: Int) => H3.core.geoToH3(lat, lng, res))
val trdd = df
.select(geoToH3($"latitude", $"longitude", lit(7)).as("h3index"))
.distinct()
.rdd
.map(row => {
val h3 = row.getAs[Long](0)
val lboundary = H3.core.h3ToGeoBoundary(h3)
val aboundary = lboundary.toArray(Array.ofDim[GeoCoord](lboundary.size))
val poly = H3.geoFactory.createPolygon({
val ps = aboundary.map((c: GeoCoord) => new Coordinate(c.lat, c.lng))
ps :+ ps(0)
})
poly.setUserData(h3)
poly
})
val polyRDD = new PolygonRDD(trdd)
polyRDD.rawSpatialRDD.foreach(println)
sparkSession.stop()
}
}

InvalidJobConfException. Output directory not set

I'm trying to write some data into bigtable using a SparkSession
val spark = SparkSession
.builder
.config(conf)
.appName("my-job")
.getOrCreate()
val hadoopConf = spark.sparkContext.hadoopConfiguration
import spark.implicits._
case class BestSellerRecord(skuNbr: String, slsQty: String, slsDollar: String, dmaNbr: String, productId: String)
val seq: DataFrame = Seq(("foo", "1", "foo1"), ("bar", "2", "bar1")).toDF("key", "value1", "value2")
val bigtablePuts = seq.toDF.rdd.map((row: Row) => {
val put = new Put(Bytes.toBytes(row.getString(0)))
put.addColumn(Bytes.toBytes(columnFamily), Bytes.toBytes("nbr"), Bytes.toBytes(row.getString(0)))
(new ImmutableBytesWritable(), put)
})
bigtablePuts.saveAsNewAPIHadoopDataset(hadoopConf)
But this gives me the following exception.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:138)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:391)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
which is coming from
bigtablePuts.saveAsNewAPIHadoopDataset(hadoopConf)
this line. Also I tried to set the different configurations using hadoopConf.set such as conf.set("spark.hadoop.validateOutputSpecs", "false") but this gives me a NullPointerException.
How may I fix this issue?
Can you try to upgrade to the mapreduce api, as the mapred is deprecated.
This question here shows an example of rewriting this code segment: Output directory not set exception when save RDD to hbase with spark
Hope this is helpful.

How to write Spark Dataframe into HBase?

I'm trying to write Spark Dataframe into the HBase and followed several other blogs and one among of them is this but it's not working.
However I can read the data from HBase successfully as Dataframe. Also some post has used org.apache.hadoop.hbase.spark format and others org.apache.spark.sql.execution.datasources.hbase. I'm not sure which one to use. Spark - 2.2.2; HBase - 1.4.7; Scala - 2.11.12 and Hortonworks SHC 1.1.0-2.1-s_2.11 from here.
The code is as follows:
case class UserMessageRecord(
rowkey: String,
Name: String,
Number: String,
message: String,
lastTS: String
)//this has been defined outside of the object scope
val exmple = List(UserMessageRecord("86325980047644033486","enrique","123455678",msgTemplate,timeStamp))
import spark.sqlContext.implicits._
val userDF = exmple.toDF()
//write to HBase
userDF.write
.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase").save() //exception here
//read from HBase and it's working fine
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val df = withCatalog(catalog)
df.show()
Here's the exception:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:122)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:177)
at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:76)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.insert(HBaseRelation.scala:218)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:61)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at HbaseConnectionTest.HbaseLoadUsingSpark$.main(HbaseLoadUsingSpark.scala:85)
at HbaseConnectionTest.HbaseLoadUsingSpark.main(HbaseLoadUsingSpark.scala)
As discussed over here I made additional configuration changes to SparkSession builder and the exception is gone. However, I am not clear on the cause and the fix.
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("HbaseSparkWrite")
.config("spark.hadoop.validateOutputSpecs", false)
.getOrCreate()

structured streaming with Spark 2.0.2, Kafka source and scalapb

I am using structured streaming (Spark 2.0.2) to consume kafka messages. Using scalapb, messages in protobuf. I am getting the following error. Please help..
Exception in thread "main" scala.ScalaReflectionException: is
not a term at
scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199)
at
scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:84)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.constructParams(ScalaReflection.scala:811)
at
org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:800)
at
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:582)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:460)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344) at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
at
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:274) at
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
at PersonConsumer$.main(PersonConsumer.scala:33) at
PersonConsumer.main(PersonConsumer.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
The following is my code ...
object PersonConsumer {
import org.apache.spark.rdd.RDD
import com.trueaccord.scalapb.spark._
import org.apache.spark.sql.{SQLContext, SparkSession}
import com.example.protos.demo._
def main(args : Array[String]) {
def parseLine(s: String): Person =
Person.parseFrom(
org.apache.commons.codec.binary.Base64.decodeBase64(s))
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
import spark.implicits._
val ds1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","person").load()
val ds2 = ds1.selectExpr("CAST(value AS STRING)").as[String]
val ds3 = ds2.map(str => parseLine(str)).createOrReplaceTempView("persons")
val ds4 = spark.sqlContext.sql("select name from persons")
val query = ds4.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
}
}
The line with val ds3 should be:
val ds3 = ds2.map(str => parseLine(str))
sqlContext.protoToDataFrame(ds3).registerTempTable("persons")
The RDD needs to be converted to a data frame before it is saved as temp table.
In Person class, gender is a enum and this was the cause for this problem. After removing this field, it works fine.
The following is the answer I got from Shixiong(Ryan) of DataBricks.
The problem is "optional Gender gender = 3;". The generated class "Gender" is a trait, and Spark cannot know how to create a trait so it's not supported. You can define your class which is supported by SQL Encoder, and convert this generated class to the new class in parseLine.

How to query data stored in Hive table using SparkSession of Spark2?

I am trying to query data stored in Hive table from Spark2. Environment: 1.cloudera-quickstart-vm-5.7.0-0-vmware 2. Eclipse with Scala2.11.8 plugin 3. Spark2 and Maven under
I did not change spark default configuration. Do I need configure anything in Spark or Hive?
Code
import org.apache.spark._
import org.apache.spark.sql.SparkSession
object hiveTest {
def main (args: Array[String]){
val sparkSession = SparkSession.builder.
master("local")
.appName("HiveSQL")
.enableHiveSupport()
.getOrCreate()
val data= sparkSession2.sql("select * from test.mark")
}
}
Getting error
16/08/29 00:18:10 INFO SparkSqlParser: Parsing command: select * from test.mark
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:48)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:47)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at hiveTest$.main(hiveTest.scala:34)
at hiveTest.main(hiveTest.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Duplicate SQLConfigEntry. spark.sql.hive.convertCTAS has been registered
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.internal.SQLConf$.org$apache$spark$sql$internal$SQLConf$$register(SQLConf.scala:44)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:122)
at org.apache.spark.sql.hive.HiveUtils$.<init>(HiveUtils.scala:103)
at org.apache.spark.sql.hive.HiveUtils$.<clinit>(HiveUtils.scala)
... 14 more
Any suggestion is appreciated
Thanks
Robin
This is what I am using:
import org.apache.spark.sql.SparkSession
object LoadCortexDataLake extends App {
val spark = SparkSession.builder().appName("Cortex-Batch").enableHiveSupport().getOrCreate()
spark.read.parquet(file).createOrReplaceTempView("temp")
spark.sql(s"insert overwrite table $table_nm partition(year='$yr',month='$mth',day='$dt') select * from temp")
I think you should use 'sparkSession.sql' instead of 'sparkSession2.sql'
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
val spark = SparkSession.
builder().
appName("Connect to Hive").
config("hive.metastore.warehouse.uris","thrift://cdh-hadoop-master:Port").
enableHiveSupport().
getOrCreate()
val df = spark.sql("SELECT * FROM table_name")