Spark convert RDD to DataFrame - Enumeration is not supported - scala

I have a case class which contains a enumeration field "PersonType". I would like to insert this record to a Hive table.
object PersonType extends Enumeration {
type PersonType = Value
val BOSS = Value
val REGULAR = Value
}
case class Person(firstname: String, lastname: String)
case class Holder(personType: PersonType.Value, person: Person)
And:
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
val item = new Holder(PersonType.REGULAR, new Person("tom", "smith"))
val content: Seq[Holder] = Seq(item)
val data : RDD[Holder] = sc.parallelize(content)
val df = data.toDF()
...
When I try to convert the corresponding RDD to DataFrame, I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type com.test.PersonType.Value is not supported
...
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:691)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94)
I'd like to convert PersonType to String before inserting to Hive.
Is it possible to extend the implicitconversion to handle PersonType as well?
I tried something like this but didn't work:
object PersonTypeConversions {
implicit def toString(personType: PersonTypeConversions.Value): String = personType.toString()
}
import PersonTypeConversions._
Spark: 1.6.0

Related

Map different value to the case class property during serialization and deserialization using Jackson

I am trying to deserialize this JSON using Jackson library -
{
"name": "abc",
"ageInInt": 30
}
To the case class Person
case class Person(name: String, #JsonProperty(value = "ageInInt")#JsonAlias(Array("ageInInt")) age: Int)
but I am getting -
No usable value for age
Did not find value which can be converted into int
org.json4s.package$MappingException: No usable value for age
Did not find value which can be converted into int
Basically, I want to deserialize the json with the different key fields ageInInt to age.
here is the complete code -
val json =
"""{
|"name": "Tausif",
|"ageInInt": 30
|}""".stripMargin
implicit val format = DefaultFormats
println(Serialization.read[Person](json))
You need to register DefaultScalaModule to your JsonMapper.
import com.fasterxml.jackson.databind.json.JsonMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.core.`type`.TypeReference
import com.fasterxml.jackson.annotation.JsonProperty
val mapper = JsonMapper.builder()
.addModule(DefaultScalaModule)
.build()
case class Person(name: String, #JsonProperty(value = "ageInInt") age: Int)
val json =
"""{
|"name": "Tausif",
|"ageInInt": 30
|}""".stripMargin
val person: Person = mapper.readValue(json, new TypeReference[Person]{})
println(person) // Prints Person(Tausif,30)

accessing spark from another class

I've created a class containing a function that processes a spark dataframe.
class IsbnEncoder(df: DataFrame) extends Serializable {
def explodeIsbn(): DataFrame = {
val name = df.first().get(0).toString
val year = df.first().get(1).toString
val isbn = df.first().get(2).toString
val isbn_ean = "ISBN-EAN: " + isbn.substring(6, 9)
val isbn_group = "ISBN-GROUP: " + isbn.substring(10, 12)
val isbn_publisher = "ISBN-PUBLISHER: " + isbn.substring(12, 16)
val isbn_title = "ISBN-TITLE: " + isbn.substring(16, 19)
val data = Seq((name, year, isbn_ean),
(name, year, isbn_group),
(name, year, isbn_publisher),
(name, year, isbn_title))
df.union(spark.createDataFrame(data))
}
}
The problem is I don't know how to create a dataframe within the class without creating a new instance of spark = sparksession.builder().appname("isbnencoder").master("local").getorcreate(). This is defined in another class in a separate file that includes this file and uses this class(the one I've included). Obviously, my code is getting errors because the compiler doesn't know what spark is.
You can create a trait that extends from serializable and create spark session as a lazy variable and then through out your project in all the objects that you create, you can extend that trait and it will give you sparksession instance.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
SparkSession.builder().appName("TestApp").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
}
As you are extending the SparkSessionWrapper on both the Objects that you created, spark session would get initialized when first time spark variable is encountered in the code and then you refer it on any object that extends that trait without passing that as a parameter to the method. It works or give you a experience that is similar to notebook.

Scala Broadcast + UDF

I am trying to broadcast an List and pass the broadcast variable to UDF (Scala code is present in separate file). But facing issues.
val Lookup_BroadCast = SC.broadcast(lookup_data)
UDF creation with 3 arguments
val Call_Sub_Pgm = udf(foo(_: String, Lookup_BroadCast: org.apache.spark.broadcast.Broadcast[List[String]], Trace: String))
Calling the UDF using "withColumn"
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
I am getting compilation error for above code - "found broadcast variable, required Sql Column"
If i remove "Lookup_BroadCast" variable from above
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
then I get below error:
java.lang.ClassCastException: org.spark.masking.ExtractData$$anonfun$7 cannot be cast to scala.Function0
Serializable wrapper class can be created for function, with Broadcast in constructor:
class Wrapper(Lookup_BroadCast: Broadcast[List[String]]) extends Serializable {
def foo(v: String, s: String): String = {
// usage example
Lookup_BroadCast.value.head
}
}
And used like:
val wrapper = new Wrapper(Lookup_BroadCast)
val Call_Sub_Pgm = udf(wrapper.foo(_: String, _: String))

Scala Spark Dataset change class type

I have a dataframe which I created as a schema of MyData1 and then I created a column so that the new dataframe follows the schema of MyData2. And now I want to return the new dataframe as a Dataset but having the following error:
[info] org.apache.spark.sql.AnalysisException: cannot resolve '`hashed`' given input columns: [id, description];
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
here is my code:
import org.apache.spark.sql.{DataFrame, Dataset}
case class MyData1(id: String, description: String)
case class MyData2(id: String, description: String, hashed: String)
object MyObject {
def read(arg1: String, arg2: String): Dataset[MyData2] {
var df: DataFrame = null
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
df = Seq(obj1, obj2, obj3).toDF("id", "description")
df.withColumn("hashed", lit("hash"))
val ds: Dataset[MyData2] = df.as[MyData2]
ds
}
}
I know that there is something probably wrong in the following line but can't figure out
val ds: Dataset[MyData2] = df.as[MyData2]
I am a newbie so probably doing a basic mistake. Anyone can help? TIA
You forgot to assign the newly created Dataframe to df
df = df.withColumn("hashed", lit("hash"))
withcolumn Spark docs says
Returns a new Dataset by adding a column or replacing the existing
column that has the same name.
The better version of your read function is as below,
Just try to avoid null assignments, var, and a return statement is not really required
def read(arg1: String, arg2: String): Dataset[MyData2] = {
val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
val obj2 = new Matcher("cbutrer383", "g567g4rwew")
val obj3 = new Matcher("cbutrer383", "567yr45e45")
Seq(obj1, obj2, obj3).toDF("id", "description")
.withColumn("hashed", lit("hash"))
.as[MyData2]
}

Add data type from PostgreSQL extension in Slick

I'm using the PostGIS extension for PostgreSQL and I'm trying to retrieve a PGgeometry object from a table.
This version is working fine :
import java.sql.DriverManager
import java.sql.Connection
import org.postgis.PGgeometry
object PostgersqlTest extends App {
val driver = "org.postgresql.Driver"
val url = "jdbc:postgresql://localhost:5432/gis"
var connection:Connection = null
try {
Class.forName(driver)
connection = DriverManager.getConnection(url)
val statement = connection.createStatement()
val resultSet = statement.executeQuery("SELECT geom FROM table;")
while ( resultSet.next() ) {
val geom = resultSet.getObject("geom").asInstanceOf[PGgeometry]
println(geom)
}
} catch {
case e: Exception => e.printStackTrace()
}
connection.close()
}
I need to be able to do the same thing using Slick custom query. But this version doesn't work :
Q.queryNA[PGgeometry]("SELECT geom FROM table;")
and gives me this compilation error
Error:(50, 40) could not find implicit value for parameter rconv: scala.slick.jdbc.GetResult[org.postgis.PGgeometry]
val query = Q.queryNA[PGgeometry](
^
Is there a simple way to add the PGgeometry data type in Slick without having to convert the returned object to a String and parse it?
To use it successfully, you need define a GetResult, and maybe SetParameter if you want to insert/update it to db.
Here's some codes extracted from slick tests (p.s. I assume you're using slick 2.1.0):
implicit val getUserResult = GetResult(r => new User(r.<<, r.<<))
case class User(id:Int, name:String)
val userForID = Q[Int, User] + "select id, name from USERS where id = ?"
But, if your java/scala type is jts.Geometry instead of PGgeometry, you can try to use slick-pg, which has built-in support for jts.Geometry and PostGIS for slick Lifted and Plain SQL.
To overcome the same issue, I used slick-pg (0.8.2)and JTS's Geometry classes as tminglei mentioned in the previous answer. There are two steps to use slick-pg to handle PostGIS's geometry types: (i) extend Slick's PostgresDriver with PgPostGISSupport and (ii) define an implicit converter for your plain query as shown below.
As shown in this page, you should first extend the PostgresDriver with PgPostGISSupport:
object MyPostgresDriver extends PostgresDriver with PgPostGISSupport {
override lazy val Implicit = new Implicits with PostGISImplicits
override val simple = new Implicits with SimpleQL with PostGISImplicits with PostGISAssistants
val plainImplicits = new Implicits with PostGISPlainImplicits
}
Using the implicit conversions defined in plainImplicits in the extended driver, you can write your query as:
import com.vividsolutions.jts.geom.LineString // Or any other JTS geometry types.
import MyPostgresDriver.plainImplicits._
import scala.slick.jdbc.GetResult
case class Row(id: Int, geom: LineString)
implicit val geomConverter = GetResult[Row](r => {
Row(r.nextInt, r.nextGeometry[LineString])
})
val query = Q.queryNA[Row](
"""SELECT id, geom FROM table;"""
)