How to query Spark StreamingContext with spark sql in zeppelin? - scala

I am trying to use spark sql to query the data coming from kafka using zeppelin for real time trend analysis but without success.
here is the simple code snippets that I am running in zeppelin
//Load Dependency
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://repo1.maven.org/maven2/")
z.load("org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1")
z.load("org.apache.spark:spark-core_2.11:2.0.1")
z.load("org.apache.spark:spark-sql_2.11:2.0.1")
z.load("org.apache.spark:spark-streaming_2.11:2.0.1"
//simple streaming
%spark
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setAppName("clickstream")
.setMaster("local[*]")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.driver.allowMultipleContexts","true")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config(conf)
.getOrCreate()
val ssc = new StreamingContext(conf, Seconds(1))
val topicsSet = Set("timer")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.25.1:9091,192.168.25.1:9092,192.168.25.1:9093")
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
lines.window(Seconds(60)).foreachRDD{ rdd =>
val clickDF = spark.read.json(rdd) //doesn't have to be json
clickDF.createOrReplaceTempView("testjson1")
//olderway
//clickDF.registerTempTable("testjson2")
clickDF.show
}
lines.print()
ssc.start()
ssc.awaitTermination()
I am able to print each kafka message but when I run simple sql %sql select * from testjson1 // or testjson2, I get the following error
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
In the this post Streaming Data is being queried (with twitter example). So I am thinking it should be possible with kafka streaming. So I guess, maybe, I am doing something wrong OR missing some point?
Any ideas, suggestions, recommendation is welcomed

The error message does not tell that the temp view is missing. The error message tells, that the type None does not provide an element with name 'get'.
With spark the calculations based on the RDDs are performed when an action is called. So up to the point where you are creating the temporary table no calculation is performed. All the calculations are performed when you execute your query on the table. If your table would not exist you would get another error message.
Maybe the Kafka messages could be printed, but your exception tells, that the None instance does not know 'get'. So I believe that your source JSON data contains items without data and those items are represented by None and therefore cause the execption while spark performs the calculations.
I would suggest that you verify if your solution works in general, by testing if it works with a sample data that does not contain empty JSON elements.

Related

No Encoder found for org.locationtech.jts.geom.Point

While using Geomesa and Scala, I have been attempting to encode 2 columns in a Spark Dataframe using the below snippets, but I am continually receiving an issue where it appears that Scala cannot serialize the returned objects into a Dataframe. When using Postgres and PostGIS, life is easy - is this an easy issue, or is there a better library which can handle Geospatial querying coming from a Spark Dataframe that contains latitute and longitude in Double format?
The versions that I am using in my SBT are:
spark: 2.3.0
scala: 2.11.12
geomesa: 2.2.1
jst-*: 1.17.0-SNAPSHOT
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
import org.apache.spark.sql.SparkSession
import org.locationtech.jts.geom.{Coordinate, GeometryFactory}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.locationtech.geomesa.spark.jts._
object GetRandomData {
def main(sysArgs: Array[String]) {
#transient val spark: SparkSession = {
SparkSession
.builder()
.config("spark.ui.enabled", "false")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.mb","24")
.appName("GetRandomData")
.master("local[*]")
.getOrCreate()
}
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
var coordinates = sc.parallelize(
List(
(35.40466, -80.905458),
(35.344079, -80.872267),
(35.139606, -80.840845),
(35.537786, -80.780051),
(35.525361, -83.031932),
(34.928323, -80.766732),
(35.533865, -82.72344),
(35.50997, -80.588572),
(35.286251, -83.150514),
(35.558519, -81.067069),
(35.569311, -80.916993),
(35.835867, -81.067904),
(35.221695, -82.662141)
)
).
toDS().
toDF("geo_lat", "geo_lng")
coordinates = coordinates.select(coordinates.columns.map(c => col(c).cast(DoubleType)) : _*)
coordinates.show()
val testing = coordinates.map(r => new GeometryFactory().createPoint(new Coordinate(3.4, 5.6)))
val coordinatesPointDf = coordinates.withColumn("point", st_makePoint(col("geo_lat"), col("geo_lng")))
}
}
The exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
- root class: "org.locationtech.jts.geom.Point"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.locationtech.geomesa.spark.jts.encoders.SpatialEncoders$class.jtsPointEncoder(SpatialEncoders.scala:21)
at org.locationtech.geomesa.spark.jts.package$.jtsPointEncoder(package.scala:17)
at GetRandomData$.main(Main.scala:50)
at GetRandomData.main(Main.scala)
If you aren't using an underlying GeoMesa store to load data into a spark session you'll need to explicitly register the JTS types with:
org.apache.spark.sql.SQLTypes.init(spark.sqlContext)
This will register the ST_ operations as well as the JTS encoders.
In plain english, the exception is saying:
I don't known how to convert a Point to a Spark type.
If you keep the latitude and longitude as doubles in your Dataset then you should be fine but as soon as you use an object like Point then you'll need to tell Spark how to convert it. In Spark terms, these are called Encoders and you can create custom ones.
Or you switch to an RDD where no conversion is necessary as long as you don't mind losing Spark SQL stuff.

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.

How to suppress the "Stage 2===>" from the output console in spark?

I have dataframe and trying to get distinct count and able to get distinct count successfully but whenever scala program is executing i'm getting this message ([Stage 2:=============================> (1 + 1) / 2])how can i suppress particular this message in console.
val countID = dataDF.select(substring(col("dataDF"),5,7).distinct().count()
You need to set spark.ui.showConsoleProgress to false
I found this in the comments of the ticket for the addition of the progress bar.
https://issues.apache.org/jira/browse/SPARK-4017
I haven't seen it in any of the documentation though, it really should be added.
If you want to do it by code. Add the following when you are creating the SparkContext:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
Logger.getRootLogger.setLevel(Level.ERROR) // Disabling "INFO" level logs (these lines must be before to create the SparkContext)
val conf = new SparkConf().set("spark.ui.showConsoleProgress", "false").setAppName("myApp")
val sc = new SparkContext(conf)
UPDATE CONTENT FOR SPARK2+:
Using SparkSession you can supress those messages adding the following line (.config("spark.ui.showConsoleProgress", "false")) to the declaration:
spark = SparkSession
.builder
.master("local[*]")
.appName("myApp")
.config("spark.ui.showConsoleProgress", "false")

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.

where is the declaration of SchemaRDD in spark 1.3.0's API

these code reports error in IDEA,why?
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
val people = sc.textFile("c3/test.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")
Is there another way to transform sqlContext into SchemaRDD, excepting the import sqlContext.createSchemaRDD?
I can't find the SchemaRDD class in spark api document, why?
SchemaRDD has been renamed to DataFrame in Apache Spark 1.3.0. See the migration guide.