Why does from_json fail with "not found : value from_json"? - scala

I am reading a Kafka topic using Spark 2.1.1 (kafka 0.10+) and the payload is a JSON string. I'd like to parse the string with a schema and move forward with business logic.
Everyone seems to suggest that I should use from_json to parse the JSON strings, however, it doesn't seem to compile for my situation. The error being
not found : value from_json
.select(from_json($"json", txnSchema) as "data")
When i tried the following lines into spark shell, it works just fine -
val df = stream
.select($"value" cast "string" as "json")
.select(from_json($"json", txnSchema) as "data")
.select("data.*")
Any idea, what could I be doing wrong in the code to see this piece working in shell but not in IDE/compile time?
Here's the code:
import org.apache.spark.sql._
object Kafka10Cons3 extends App {
val spark = SparkSession
.builder
.appName(Util.getProperty("AppName"))
.master(Util.getProperty("spark.master"))
.getOrCreate
val stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", Util.getProperty("kafka10.broker"))
.option("subscribe", src_topic)
.load
val txnSchema = Util.getTxnStructure
val df = stream
.select($"value" cast "string" as "json")
.select(from_json($"json", txnSchema) as "data")
.select("data.*")
}

you're probably just missing the relevant import - import org.apache.spark.sql.functions._.
You have imported spark.implicits._ and org.apache.spark.sql._, but none of these would import the individual function in functions.
I was also importing com.wizzardo.tools.json which looks like it also has a from_json function, which must have been the one the compiler chose (since it was imported first?) and which was apparently incompatible with my version of spark
Make sure you are not importing the from_json function from some other json library, as this library may be incompatible with the version of spark you are using.

Related

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

No Encoder found for org.locationtech.jts.geom.Point

While using Geomesa and Scala, I have been attempting to encode 2 columns in a Spark Dataframe using the below snippets, but I am continually receiving an issue where it appears that Scala cannot serialize the returned objects into a Dataframe. When using Postgres and PostGIS, life is easy - is this an easy issue, or is there a better library which can handle Geospatial querying coming from a Spark Dataframe that contains latitute and longitude in Double format?
The versions that I am using in my SBT are:
spark: 2.3.0
scala: 2.11.12
geomesa: 2.2.1
jst-*: 1.17.0-SNAPSHOT
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
import org.apache.spark.sql.SparkSession
import org.locationtech.jts.geom.{Coordinate, GeometryFactory}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.locationtech.geomesa.spark.jts._
object GetRandomData {
def main(sysArgs: Array[String]) {
#transient val spark: SparkSession = {
SparkSession
.builder()
.config("spark.ui.enabled", "false")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.mb","24")
.appName("GetRandomData")
.master("local[*]")
.getOrCreate()
}
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
var coordinates = sc.parallelize(
List(
(35.40466, -80.905458),
(35.344079, -80.872267),
(35.139606, -80.840845),
(35.537786, -80.780051),
(35.525361, -83.031932),
(34.928323, -80.766732),
(35.533865, -82.72344),
(35.50997, -80.588572),
(35.286251, -83.150514),
(35.558519, -81.067069),
(35.569311, -80.916993),
(35.835867, -81.067904),
(35.221695, -82.662141)
)
).
toDS().
toDF("geo_lat", "geo_lng")
coordinates = coordinates.select(coordinates.columns.map(c => col(c).cast(DoubleType)) : _*)
coordinates.show()
val testing = coordinates.map(r => new GeometryFactory().createPoint(new Coordinate(3.4, 5.6)))
val coordinatesPointDf = coordinates.withColumn("point", st_makePoint(col("geo_lat"), col("geo_lng")))
}
}
The exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
- root class: "org.locationtech.jts.geom.Point"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.locationtech.geomesa.spark.jts.encoders.SpatialEncoders$class.jtsPointEncoder(SpatialEncoders.scala:21)
at org.locationtech.geomesa.spark.jts.package$.jtsPointEncoder(package.scala:17)
at GetRandomData$.main(Main.scala:50)
at GetRandomData.main(Main.scala)
If you aren't using an underlying GeoMesa store to load data into a spark session you'll need to explicitly register the JTS types with:
org.apache.spark.sql.SQLTypes.init(spark.sqlContext)
This will register the ST_ operations as well as the JTS encoders.
In plain english, the exception is saying:
I don't known how to convert a Point to a Spark type.
If you keep the latitude and longitude as doubles in your Dataset then you should be fine but as soon as you use an object like Point then you'll need to tell Spark how to convert it. In Spark terms, these are called Encoders and you can create custom ones.
Or you switch to an RDD where no conversion is necessary as long as you don't mind losing Spark SQL stuff.

How to make Scala script infer custom column types in a csv/json schema?

I am trying to infer schema when I load a csv file in my SQLContext using SparkSession. Please note that I do not want to use class here as I am trying to infer the data file schema as soon it is loaded as I do not have any info about the data types or column names of the file before loading it.
Here is what I am trying out in Scala:
package example
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import java.io.File
import org.apache.spark.sql.SparkSession
//import sqlContext.implicits._
object SimpleScalaSpark {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "local")
.getOrCreate()
//val etl1Rdd = spark.sparkContext.wholeTextFiles("etl1.json").map(x => x._2)
val jsonTbl = spark.sqlContext.read.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("dateFormat","MM/dd/yyyy HH:mm")
.csv("s1.csv")
// print the inferred schema
jsonTbl.printSchema
}
}
I am able to get DateTime, Integer, Double, String as data types for my file. But I want to implement custom data types based on my own regex patterns such as fields like SSN, VIN-ID, PhoneNumber etc. which all have a fixed pattern which can be detected using regex. This would make schema extraction process for me more accurate and precise. For example, suppose I have a column which contains data formed of 5 or more alphabets and 2 or more numbers, I can say that this column is of type ID.
Any ideas on if it is possible to do this using Scala/Spark? Please let me know the implementation part as well if possible or a source to technical documentation.

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")

Reading TSV into Spark Dataframe with Scala API

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.
Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val sqlContext = new SQLContext(sc)
val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")
The documentation says you can specify the delimiter but I am unclear about how to specify that option.
All of the option parameters are passed in the option() function as below:
val segments = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("s3n://michaeldiscenza/data/test_segments")
With Spark 2.0+ use the built-in CSV connector to avoid third party dependancy and better performance:
val spark = SparkSession.builder.getOrCreate()
val segments = spark.read.option("sep", "\t").csv("/path/to/file")
You May also try to inferSchema and check for schema.
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("sep","\t")
.option("header", "true")
.load(tmp_loc)
df.printSchema()