structured streaming with Spark 2.0.2, Kafka source and scalapb - scala

I am using structured streaming (Spark 2.0.2) to consume kafka messages. Using scalapb, messages in protobuf. I am getting the following error. Please help..
Exception in thread "main" scala.ScalaReflectionException: is
not a term at
scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199)
at
scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:84)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.constructParams(ScalaReflection.scala:811)
at
org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:800)
at
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:39)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:582)
at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:460)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381) at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344) at
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
at
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:274) at
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
at PersonConsumer$.main(PersonConsumer.scala:33) at
PersonConsumer.main(PersonConsumer.scala) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
The following is my code ...
object PersonConsumer {
import org.apache.spark.rdd.RDD
import com.trueaccord.scalapb.spark._
import org.apache.spark.sql.{SQLContext, SparkSession}
import com.example.protos.demo._
def main(args : Array[String]) {
def parseLine(s: String): Person =
Person.parseFrom(
org.apache.commons.codec.binary.Base64.decodeBase64(s))
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
import spark.implicits._
val ds1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("subscribe","person").load()
val ds2 = ds1.selectExpr("CAST(value AS STRING)").as[String]
val ds3 = ds2.map(str => parseLine(str)).createOrReplaceTempView("persons")
val ds4 = spark.sqlContext.sql("select name from persons")
val query = ds4.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
}
}

The line with val ds3 should be:
val ds3 = ds2.map(str => parseLine(str))
sqlContext.protoToDataFrame(ds3).registerTempTable("persons")
The RDD needs to be converted to a data frame before it is saved as temp table.

In Person class, gender is a enum and this was the cause for this problem. After removing this field, it works fine.
The following is the answer I got from Shixiong(Ryan) of DataBricks.
The problem is "optional Gender gender = 3;". The generated class "Gender" is a trait, and Spark cannot know how to create a trait so it's not supported. You can define your class which is supported by SQL Encoder, and convert this generated class to the new class in parseLine.

Related

Flink Cassandra No support for the type of the given DataStream: GenericType<scala.Tuple2>

I'm trying to write to Cassandra a list of tuple 2.
Scala 2.11.12
Flink 1.12.1
import com.datastax.driver.core.Cluster
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.cassandra.{CassandraSink, ClusterBuilder, MapperOptions}
import java.util.Properties
object Kafka {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "broker:9092")
properties.setProperty("group.id", "FlinkTest")
implicit val typeInfo = TypeInformation.of(classOf[(String)])
implicit val typeInfo_2 = TypeInformation.of(classOf[(String, Int)])
val list = List(new Tuple2("a", 1), new Tuple2("b", 2), new Tuple2("c", 3))
val sink: CassandraSink[(String, Int)] = CassandraSink.addSink(env.fromCollection(list))
.setQuery("INSERT INTO flink_test.example (id, counter) values (?, ?);")
.setClusterBuilder(new ClusterBuilder {
override def buildCluster(builder: Cluster.Builder): Cluster = {
builder.addContactPoint("cluster1").addContactPoint("cluster2").addContactPoint("cluster3").withCredentials("cassandra", "ABCXYZ123").build()
}
})
.build()
sink.name("sample").disableChaining.setParallelism(1).uid("hello")
env.execute("Flink Kafka Example refresh")
}
}
I got an error about type input
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: No support for the type of the given DataStream: GenericType<scala.Tuple2>
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:360)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:213)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:816)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:248)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1058)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1136)
at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1136)
Caused by: java.lang.IllegalArgumentException: No support for the type of the given DataStream: GenericType<scala.Tuple2>
at org.apache.flink.streaming.connectors.cassandra.CassandraSink.addSink(CassandraSink.java:228)
at org.apache.flink.streaming.connectors.cassandra.CassandraSink.addSink(CassandraSink.java:201)
at Kafka$.main(Kafka.scala:18)
at Kafka.main(Kafka.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:343)
... 8 more
I'm tried convert Datastream to javaStream by "env.fromCollection(list).javaStream" but this error still happen.
Why addSink function accept Datastream input type but this error happen?

scala - Cannot create SparkContext and SparkSession [duplicate]

This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I am new to scala and Spark. I am trying to read in a csv file therefore I create a SparkSession to read the csv. Also I create a SparkContext to work later with rdd. I am using scala-ide.
The appearing error is maybe a common error in java, but I am not able to solve them.
code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
object Solution1 {
def main(args: Array[String]){
println("Create contex for rdd ")
val conf = new SparkConf().setAppName("Problem1")
val cont = new SparkContext(conf)
println("create SparkSession and read csv")
val spark = SparkSession.builder().appName("Problem1").getOrCreate()
val data = spark.read.option("header", false).csv("file.csv")
// further processing
cont.stop()
}
}
The error:
Create contex for rdd
Exception in thread "main" java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheLoader
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:68)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at Solution1$.main(Solution1.scala:13)
at Solution1.main(Solution1.scala)
Caused by: java.lang.ClassNotFoundException: org.spark_project.guava.cache.CacheLoader
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
Please create Spark Context like below
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("someName").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
}
To read data
val rdd = sparkContext.textFile("path.csv")
and Spark Session like below
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Creating spark session")
.master("local[*]")
.getOrCreate()
}
To read data call
val df = spark.read.format("json").load("path.json")
Also if you have spark session create then you do not need to create Spark context separately, you can call Spark session like this way to take advantage of Spark context as well:
val data = spark.sparkContext.textFile("path")

Spark NaiveBayesTextClassification

i'm trying to create a text classifier spark(1.6.2) app, but I don't know what am I doing wrong. This is my code:
import org.apache.spark.ml.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.mllib
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
/**
* Created by kebodev on 2016.11.29..
*/
object PredTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("IktatoSparkRunner")
.set("spark.executor.memory", "2gb")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainData = sqlContext.read.json("src/main/resources/tst.json")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val model = NaiveBayes.train(featurizedData)
}
}
The NaiveBayes object doesn't have train method, what should I import?
If i try to use this way:
val naBa = new NaiveBayes()
naBa.fit(featurizedData)
I get this exception:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually StringType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:56)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:40)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:68)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
at PredTest$.main(PredTest.scala:37)
at PredTest.main(PredTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
This is how my json file looks like:
{"text":"any text","label":"6.0"}
I'm really noob in this topic. Can anyone help me how to create a model, and then how to predict a new value.
Thank you!
Labels and Feature Vectors only contain Doubles. Your label column contains a String.
See your stacktrace:
Column label must be of type DoubleType but was actually StringType.
You can use the StringIndexer or CountVectorizer to convert it appropriately. See http://spark.apache.org/docs/latest/ml-features.html#stringindexer for further details.

How to query data stored in Hive table using SparkSession of Spark2?

I am trying to query data stored in Hive table from Spark2. Environment: 1.cloudera-quickstart-vm-5.7.0-0-vmware 2. Eclipse with Scala2.11.8 plugin 3. Spark2 and Maven under
I did not change spark default configuration. Do I need configure anything in Spark or Hive?
Code
import org.apache.spark._
import org.apache.spark.sql.SparkSession
object hiveTest {
def main (args: Array[String]){
val sparkSession = SparkSession.builder.
master("local")
.appName("HiveSQL")
.enableHiveSupport()
.getOrCreate()
val data= sparkSession2.sql("select * from test.mark")
}
}
Getting error
16/08/29 00:18:10 INFO SparkSqlParser: Parsing command: select * from test.mark
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:48)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:47)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:54)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
at hiveTest$.main(hiveTest.scala:34)
at hiveTest.main(hiveTest.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Duplicate SQLConfigEntry. spark.sql.hive.convertCTAS has been registered
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.internal.SQLConf$.org$apache$spark$sql$internal$SQLConf$$register(SQLConf.scala:44)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.sql.internal.SQLConf$SQLConfigBuilder$$anonfun$apply$1.apply(SQLConf.scala:51)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$createWithDefault$1.apply(ConfigBuilder.scala:122)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:122)
at org.apache.spark.sql.hive.HiveUtils$.<init>(HiveUtils.scala:103)
at org.apache.spark.sql.hive.HiveUtils$.<clinit>(HiveUtils.scala)
... 14 more
Any suggestion is appreciated
Thanks
Robin
This is what I am using:
import org.apache.spark.sql.SparkSession
object LoadCortexDataLake extends App {
val spark = SparkSession.builder().appName("Cortex-Batch").enableHiveSupport().getOrCreate()
spark.read.parquet(file).createOrReplaceTempView("temp")
spark.sql(s"insert overwrite table $table_nm partition(year='$yr',month='$mth',day='$dt') select * from temp")
I think you should use 'sparkSession.sql' instead of 'sparkSession2.sql'
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
val spark = SparkSession.
builder().
appName("Connect to Hive").
config("hive.metastore.warehouse.uris","thrift://cdh-hadoop-master:Port").
enableHiveSupport().
getOrCreate()
val df = spark.sql("SELECT * FROM table_name")

Spark kryo encoder ArrayIndexOutOfBoundsException

I'm trying to create a dataset with some geo data using spark and esri. If Foo only have Point field, it'll work but if I add some other fields beyond a Point, I get ArrayIndexOutOfBoundsException.
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
case class Foo(position: Point, name: String)
object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
}
}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
Main$.main(Main.scala:24) at Main.main(Main.scala)
Kryo create encoder for complex data types based on Spark SQL Data Types. So check the result of schema that kryo create:
val enc: Encoder[Foo] = Encoders.kryo[Foo]
println(enc.schema) // StructType(StructField(value,BinaryType,true))
val numCols = schema.fieldNames.length // 1
So you have one column data in Dataset and it's in Binary format. But It's strange that why Spark attempting to show Dataset in more than one column (and that error occurs). To fix this, upgrade Spark version to 2.0.0.
By using Spark 2.0.0, you still have problem with columns data types. I hope writing manual schema works if you can write StructType for esri Point class:
val schema = StructType(
Seq(
StructField("point", StructType(...), true),
StructField("name", StringType, true)
)
)
val rdd = sc.parallelize(Seq(Row(new Point(0,0), "bar")))
sqlContext.createDataFrame(rdd, schema).toDS