No Encoder found for org.locationtech.jts.geom.Point - scala

While using Geomesa and Scala, I have been attempting to encode 2 columns in a Spark Dataframe using the below snippets, but I am continually receiving an issue where it appears that Scala cannot serialize the returned objects into a Dataframe. When using Postgres and PostGIS, life is easy - is this an easy issue, or is there a better library which can handle Geospatial querying coming from a Spark Dataframe that contains latitute and longitude in Double format?
The versions that I am using in my SBT are:
spark: 2.3.0
scala: 2.11.12
geomesa: 2.2.1
jst-*: 1.17.0-SNAPSHOT
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
import org.apache.spark.sql.SparkSession
import org.locationtech.jts.geom.{Coordinate, GeometryFactory}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.locationtech.geomesa.spark.jts._
object GetRandomData {
def main(sysArgs: Array[String]) {
#transient val spark: SparkSession = {
SparkSession
.builder()
.config("spark.ui.enabled", "false")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.mb","24")
.appName("GetRandomData")
.master("local[*]")
.getOrCreate()
}
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
var coordinates = sc.parallelize(
List(
(35.40466, -80.905458),
(35.344079, -80.872267),
(35.139606, -80.840845),
(35.537786, -80.780051),
(35.525361, -83.031932),
(34.928323, -80.766732),
(35.533865, -82.72344),
(35.50997, -80.588572),
(35.286251, -83.150514),
(35.558519, -81.067069),
(35.569311, -80.916993),
(35.835867, -81.067904),
(35.221695, -82.662141)
)
).
toDS().
toDF("geo_lat", "geo_lng")
coordinates = coordinates.select(coordinates.columns.map(c => col(c).cast(DoubleType)) : _*)
coordinates.show()
val testing = coordinates.map(r => new GeometryFactory().createPoint(new Coordinate(3.4, 5.6)))
val coordinatesPointDf = coordinates.withColumn("point", st_makePoint(col("geo_lat"), col("geo_lng")))
}
}
The exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
- root class: "org.locationtech.jts.geom.Point"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.locationtech.geomesa.spark.jts.encoders.SpatialEncoders$class.jtsPointEncoder(SpatialEncoders.scala:21)
at org.locationtech.geomesa.spark.jts.package$.jtsPointEncoder(package.scala:17)
at GetRandomData$.main(Main.scala:50)
at GetRandomData.main(Main.scala)

If you aren't using an underlying GeoMesa store to load data into a spark session you'll need to explicitly register the JTS types with:
org.apache.spark.sql.SQLTypes.init(spark.sqlContext)
This will register the ST_ operations as well as the JTS encoders.

In plain english, the exception is saying:
I don't known how to convert a Point to a Spark type.
If you keep the latitude and longitude as doubles in your Dataset then you should be fine but as soon as you use an object like Point then you'll need to tell Spark how to convert it. In Spark terms, these are called Encoders and you can create custom ones.
Or you switch to an RDD where no conversion is necessary as long as you don't mind losing Spark SQL stuff.

Related

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

How to make Scala script infer custom column types in a csv/json schema?

I am trying to infer schema when I load a csv file in my SQLContext using SparkSession. Please note that I do not want to use class here as I am trying to infer the data file schema as soon it is loaded as I do not have any info about the data types or column names of the file before loading it.
Here is what I am trying out in Scala:
package example
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import java.io.File
import org.apache.spark.sql.SparkSession
//import sqlContext.implicits._
object SimpleScalaSpark {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "local")
.getOrCreate()
//val etl1Rdd = spark.sparkContext.wholeTextFiles("etl1.json").map(x => x._2)
val jsonTbl = spark.sqlContext.read.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("dateFormat","MM/dd/yyyy HH:mm")
.csv("s1.csv")
// print the inferred schema
jsonTbl.printSchema
}
}
I am able to get DateTime, Integer, Double, String as data types for my file. But I want to implement custom data types based on my own regex patterns such as fields like SSN, VIN-ID, PhoneNumber etc. which all have a fixed pattern which can be detected using regex. This would make schema extraction process for me more accurate and precise. For example, suppose I have a column which contains data formed of 5 or more alphabets and 2 or more numbers, I can say that this column is of type ID.
Any ideas on if it is possible to do this using Scala/Spark? Please let me know the implementation part as well if possible or a source to technical documentation.

Why does from_json fail with "not found : value from_json"?

I am reading a Kafka topic using Spark 2.1.1 (kafka 0.10+) and the payload is a JSON string. I'd like to parse the string with a schema and move forward with business logic.
Everyone seems to suggest that I should use from_json to parse the JSON strings, however, it doesn't seem to compile for my situation. The error being
not found : value from_json
.select(from_json($"json", txnSchema) as "data")
When i tried the following lines into spark shell, it works just fine -
val df = stream
.select($"value" cast "string" as "json")
.select(from_json($"json", txnSchema) as "data")
.select("data.*")
Any idea, what could I be doing wrong in the code to see this piece working in shell but not in IDE/compile time?
Here's the code:
import org.apache.spark.sql._
object Kafka10Cons3 extends App {
val spark = SparkSession
.builder
.appName(Util.getProperty("AppName"))
.master(Util.getProperty("spark.master"))
.getOrCreate
val stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", Util.getProperty("kafka10.broker"))
.option("subscribe", src_topic)
.load
val txnSchema = Util.getTxnStructure
val df = stream
.select($"value" cast "string" as "json")
.select(from_json($"json", txnSchema) as "data")
.select("data.*")
}
you're probably just missing the relevant import - import org.apache.spark.sql.functions._.
You have imported spark.implicits._ and org.apache.spark.sql._, but none of these would import the individual function in functions.
I was also importing com.wizzardo.tools.json which looks like it also has a from_json function, which must have been the one the compiler chose (since it was imported first?) and which was apparently incompatible with my version of spark
Make sure you are not importing the from_json function from some other json library, as this library may be incompatible with the version of spark you are using.

How to query Spark StreamingContext with spark sql in zeppelin?

I am trying to use spark sql to query the data coming from kafka using zeppelin for real time trend analysis but without success.
here is the simple code snippets that I am running in zeppelin
//Load Dependency
%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://repo1.maven.org/maven2/")
z.load("org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1")
z.load("org.apache.spark:spark-core_2.11:2.0.1")
z.load("org.apache.spark:spark-sql_2.11:2.0.1")
z.load("org.apache.spark:spark-streaming_2.11:2.0.1"
//simple streaming
%spark
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setAppName("clickstream")
.setMaster("local[*]")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.driver.allowMultipleContexts","true")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config(conf)
.getOrCreate()
val ssc = new StreamingContext(conf, Seconds(1))
val topicsSet = Set("timer")
val kafkaParams = Map[String, String]("metadata.broker.list" -> "192.168.25.1:9091,192.168.25.1:9092,192.168.25.1:9093")
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet).map(_._2)
lines.window(Seconds(60)).foreachRDD{ rdd =>
val clickDF = spark.read.json(rdd) //doesn't have to be json
clickDF.createOrReplaceTempView("testjson1")
//olderway
//clickDF.registerTempTable("testjson2")
clickDF.show
}
lines.print()
ssc.start()
ssc.awaitTermination()
I am able to print each kafka message but when I run simple sql %sql select * from testjson1 // or testjson2, I get the following error
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
In the this post Streaming Data is being queried (with twitter example). So I am thinking it should be possible with kafka streaming. So I guess, maybe, I am doing something wrong OR missing some point?
Any ideas, suggestions, recommendation is welcomed
The error message does not tell that the temp view is missing. The error message tells, that the type None does not provide an element with name 'get'.
With spark the calculations based on the RDDs are performed when an action is called. So up to the point where you are creating the temporary table no calculation is performed. All the calculations are performed when you execute your query on the table. If your table would not exist you would get another error message.
Maybe the Kafka messages could be printed, but your exception tells, that the None instance does not know 'get'. So I believe that your source JSON data contains items without data and those items are represented by None and therefore cause the execption while spark performs the calculations.
I would suggest that you verify if your solution works in general, by testing if it works with a sample data that does not contain empty JSON elements.

Spark kryo encoder ArrayIndexOutOfBoundsException

I'm trying to create a dataset with some geo data using spark and esri. If Foo only have Point field, it'll work but if I add some other fields beyond a Point, I get ArrayIndexOutOfBoundsException.
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
case class Foo(position: Point, name: String)
object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
}
}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
Main$.main(Main.scala:24) at Main.main(Main.scala)
Kryo create encoder for complex data types based on Spark SQL Data Types. So check the result of schema that kryo create:
val enc: Encoder[Foo] = Encoders.kryo[Foo]
println(enc.schema) // StructType(StructField(value,BinaryType,true))
val numCols = schema.fieldNames.length // 1
So you have one column data in Dataset and it's in Binary format. But It's strange that why Spark attempting to show Dataset in more than one column (and that error occurs). To fix this, upgrade Spark version to 2.0.0.
By using Spark 2.0.0, you still have problem with columns data types. I hope writing manual schema works if you can write StructType for esri Point class:
val schema = StructType(
Seq(
StructField("point", StructType(...), true),
StructField("name", StringType, true)
)
)
val rdd = sc.parallelize(Seq(Row(new Point(0,0), "bar")))
sqlContext.createDataFrame(rdd, schema).toDS