Passing sparkSession to function in Scala - Spark 2.1 - scala

I'm migrating from Spark 1.6 version to 2.1 and looking at the below
mentioned function in Scala, trying to figure out, how do we pass
SparkSession variable to the function instead of sqlContext -
private def readHiveTable(sqlContext: HiveContext, hiveTableNm: String,
hiveWorkDb: String): DataFrame
I mean, would sqlContext change to sparkSession ? what is the right way to pass this variable in Spark 2.1 ? may be something like -
private def readHiveTable(spark: sparkSession, hiveTableNm: String,
hiveWorkDb: String): DataFrame
UPDATE
Resolved it by passing -
spark: SparkSession instead of sqlContext: HiveContext

Related

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

No Encoder found for org.locationtech.jts.geom.Point

While using Geomesa and Scala, I have been attempting to encode 2 columns in a Spark Dataframe using the below snippets, but I am continually receiving an issue where it appears that Scala cannot serialize the returned objects into a Dataframe. When using Postgres and PostGIS, life is easy - is this an easy issue, or is there a better library which can handle Geospatial querying coming from a Spark Dataframe that contains latitute and longitude in Double format?
The versions that I am using in my SBT are:
spark: 2.3.0
scala: 2.11.12
geomesa: 2.2.1
jst-*: 1.17.0-SNAPSHOT
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
import org.apache.spark.sql.SparkSession
import org.locationtech.jts.geom.{Coordinate, GeometryFactory}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.locationtech.geomesa.spark.jts._
object GetRandomData {
def main(sysArgs: Array[String]) {
#transient val spark: SparkSession = {
SparkSession
.builder()
.config("spark.ui.enabled", "false")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.mb","24")
.appName("GetRandomData")
.master("local[*]")
.getOrCreate()
}
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
var coordinates = sc.parallelize(
List(
(35.40466, -80.905458),
(35.344079, -80.872267),
(35.139606, -80.840845),
(35.537786, -80.780051),
(35.525361, -83.031932),
(34.928323, -80.766732),
(35.533865, -82.72344),
(35.50997, -80.588572),
(35.286251, -83.150514),
(35.558519, -81.067069),
(35.569311, -80.916993),
(35.835867, -81.067904),
(35.221695, -82.662141)
)
).
toDS().
toDF("geo_lat", "geo_lng")
coordinates = coordinates.select(coordinates.columns.map(c => col(c).cast(DoubleType)) : _*)
coordinates.show()
val testing = coordinates.map(r => new GeometryFactory().createPoint(new Coordinate(3.4, 5.6)))
val coordinatesPointDf = coordinates.withColumn("point", st_makePoint(col("geo_lat"), col("geo_lng")))
}
}
The exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
- root class: "org.locationtech.jts.geom.Point"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.locationtech.geomesa.spark.jts.encoders.SpatialEncoders$class.jtsPointEncoder(SpatialEncoders.scala:21)
at org.locationtech.geomesa.spark.jts.package$.jtsPointEncoder(package.scala:17)
at GetRandomData$.main(Main.scala:50)
at GetRandomData.main(Main.scala)
If you aren't using an underlying GeoMesa store to load data into a spark session you'll need to explicitly register the JTS types with:
org.apache.spark.sql.SQLTypes.init(spark.sqlContext)
This will register the ST_ operations as well as the JTS encoders.
In plain english, the exception is saying:
I don't known how to convert a Point to a Spark type.
If you keep the latitude and longitude as doubles in your Dataset then you should be fine but as soon as you use an object like Point then you'll need to tell Spark how to convert it. In Spark terms, these are called Encoders and you can create custom ones.
Or you switch to an RDD where no conversion is necessary as long as you don't mind losing Spark SQL stuff.

Spark 2 to Spark 1.6

I am trying to convert following code to run on the spark 1.6 but, on which I am facing certain issues. while converting the sparksession to context
object TestData {
def makeIntegerDf(spark: SparkSession, numbers: Seq[Int]): DataFrame =
spark.createDataFrame(
spark.sparkContext.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
}
How Do I convert it to make it run on spark 1.6
SparkSession is supported from spark 2.0 on-wards only. So if you want to use spark 1.6 then you would need to create SparkContext and sqlContext in driver class and pass them to the function.
so you can create
val conf = new SparkConf().setAppName("simple")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
and then call the function as
val callFunction = makeIntegerDf(sparkContext, sqlContext, numbers)
And your function should be as
def makeIntegerDf(sparkContext: SparkContext, sqlContext: SQLContext, numbers: Seq[Int]): DataFrame =
sqlContext.createDataFrame(
sparkContext.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
The only main difference here is the use of spark which is a spark session as opposed to spark context.
So you would do something like this:
object TestData {
def makeIntegerDf(sc: SparkContext, sqlContext: SQLContext, numbers: Seq[Int]): DataFrame =
sqlContext.createDataFrame(
sc.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
}
Of course you would need to create a spark context instead of spark session in order to provide it to the function.

Importing Spark libraries using Intellij IDEA

I would like to use spark SQL in an Intellij IDEA SBT project.
Even though I have imported the library the code does not seem to import it.
Spark Core seems to be working however.
You can't create a DataFrame from a scala List[A]. You need first to create an RDD[A], and then transform that to a DataFrame. You also need an SQLContext:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val test = sc.parallelize(List(1,2,3,4)).toDF
For reference this is how the Spark 2.0 boilerplate with spark sql should look like:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local")
.appName("some name")
.getOrCreate()
import spark.sqlContext.implicits._
}
}

How to create a Spark Dataset from an RDD

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.
Here is an answer that traverses an extra step - the DataFrame. We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:
val sqlContext = new SQLContext(sc)
val pointsTrainDf = sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
Update Ever heard of a SparkSession ? (neither had I until now..)
So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:
Spark 2.0.0+ approaches
Notice in both of the below approaches (simpler one of which credit #zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.
val sparkSession = SparkSession.builder().getOrCreate()
val pointsTrainDf = sparkSession.createDataset(training)
val model = new LogisticRegression()
.train(pointsTrainDs.as[LabeledPoint])
Second way for Spark 2.0.0+ Credit to #zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val trainDs = training.toDS()
Traditional Spark 1.X and earlier approach
val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**
See also: How to store custom objects in Dataset? by the esteemed #zero323 .