Implicits in a Spark Scala program not working - scala

I am not able to perform an implicit conversion from an RDD to a Dataframe in a Scala program although I am importing spark.implicits._.
Any help would be appreciated.
Main Program with the implicits:
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = spark.sparkContext
val data = sc.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
case class CC1(LAT: Double, LONG: Double)
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
Error is as follows:
Error:(40, 25) value toDF is not a member of org.apache.spark.rdd.RDD[CC1]
val allDF = allData.toDF()

When you define the case class CC1 inside the main method, you hit https://issues.scala-lang.org/browse/SI-6649; toDF() then fails to locate the appropriate implicit TypeTag for that class at compile time.
You can see this in this simple example:
case class Out()
object TestImplicits {
def main(args: Array[String]) {
case class In()
val typeTagOut = implicitly[TypeTag[Out]] // compiles
val typeTagIn = implicitly[TypeTag[In]] // does not compile: Error:(23, 31) No TypeTag available for In
}
}
Spark's relevant implicit conversion has this type parameter: [T <: Product : TypeTag] (see newProductEncoder here), which means an implicit TypeTag[CC1] is required.
To fix this - simply move the definition of CC1 out of the method, or out of object entirely:
case class CC1(LAT: Double, LONG: Double)
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}

I thought the toDF is in sqlContext.implicits._ so you need to import that not spark.implicits._. At least that is the case in spark 1.6

Related

Select case class based on String in Scala

How can I select a case class based on a String value?
My code is
val spark = SparkSession.builder()...
val rddOfJsonStrings: RDD[String] = // some json strings as RDD
val classSelector: String = ??? // could be "Foo" or "Bar", or any other String value
case class Foo(foo: String)
case class Bar(bar: String)
if (classSelector == "Foo") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
} else if (classSelector == "Bar") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
} else {
throw ClassUnknownException //custom Exception
}
The variable classSeletector is a simple String that should be used to point to the case class of the same name.
Imagine I don't only have Foo and Bar as case classes but more then those two. How is it possible to call the df.as[] statement based on the String (if possible at all)?
Or is there a completely different approach available in Scala?
Check below code
classSeletector match {
case c if Foo.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Foo]
case c if Bar.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Bar]
case _ => throw ClassUnknownException //custom Exception
}
How is it possible to call the df.as[] statement based on the String (if possible at all)?
It isn't (or based on any runtime value). You may note that all answers still need to:
have a separate branch for Foo and Bar (and one more branch for each class you'll want to add);
repeat the class name twice in the branch.
You can avoid the second:
import scala.reflect.{classTag, ClassTag}
val df: DataFrame = spark.read.json(rddOfJsonStrings)
// local function defined where df and classSelector are visible
def dfAsOption[T : Encoder : ClassTag] =
Option.when(classSelector == classTag[T].runtimeClass.simpleName)(df.as[T])
dfAsOption[Foo].dfAsOption(asOption[Bar]).getOrElse(throw ClassUnknownException)
But for the first you'd need a macro if it's possible at all. I would guess it isn't.
Define a generic method and invoke it,
getDs[Foo](spark,rddOfJsonStrings)
getDs[Bar](spark,rddOfJsonStrings)
def getDs[T](spark : SparkSession, rddOfJsonStrings:String) {
spark.read.json(rddOfJsonStrings).as[T](Encoders.bean[T](classOf[T]))
}
Alternative-
highlights-
Use simpleName of the case class and not of the companion object
if classSelector is null, the solution won't fail
case class Foo(foo: String)
case class Bar(bar: String)
Testcase-
val rddOfJsonStrings: RDD[String] = spark.sparkContext.parallelize(Seq("""{"foo":1}"""))
val classSelector: String = "Foo" // could be "Foo" or "Bar", or any other String value
val ds = classSelector match {
case foo if classOf[Foo].getSimpleName == foo =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
case bar if classOf[Bar].getSimpleName == bar =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
case _ => throw new UnsupportedOperationException
}
ds.show(false)
/**
* +---+
* |foo|
* +---+
* |1 |
* +---+
*/
You can use reflective toolbox
import org.apache.spark.sql.{Dataset, SparkSession}
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
import spark.implicits._
val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"foo":"aaa"}"""))
// val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"bar":"bbb"}"""))
val classSelector: String = "Foo"
// val classSelector: String = "Bar"
case class Foo(foo: String)
case class Bar(bar: String)
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.DataFrame
import Main._
import spark.implicits._
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[$classSelector]
""")).asInstanceOf[Dataset[_]]
println(res) // [foo: string]
}
Notice that statically you will have a Dataset[_], not Dataset[Foo] or Dataset[Bar].

How can one create a method in scala that generates the sqlcontext implicits encoder based only on type?

I am trying to build a framework on top of spark that can automatically create datasets from data store to disk. An example of the sort of thing I would like to do is:
var sparkSession: SparkSession = _ // initialized elsewhere
def generateDataset[T <: Product : TypeTag](path: Path): Dataset[T] = {
val df: DataFrame = generateDataFrameFromPath(path)
import sparkSession.implicits._
df.as[T]
}
which works just fine. The problem I have is trying to extend that to Dataframes and other classes for which implicit encoders can be generated (like String or Int). I have tried to do something like this:
var sparkSession: SparkSession = _ // initialized elsewhere
def generateDataset[T : TypeTag](path: Path): Dataset[T] = {
val df: DataFrame = generateDataFrameFromPath(path)
typeOf[T] match {
case t if t =:= typeOf[Row] => df
case t if t <:< typeOf[Product] =>
import sparkSession.implicits._
df.as[T]
}
}
But the compiler doesn't like this even though we know T is a subclass of Product when we call .as[T].
I know that the standard approach would be to use the Encoder context bound/implicit however my calling code has no knowledge of the sparkSession until it gets the data back.
Is there a way to get this to work without having the encoder generated by the caller?
Try generating encode in place:
def generateDataset[T : TypeTag](path: Path) = {
val df: DataFrame = generateDataFrameFromPath(path)
typeOf[T] match {
case t if t =:= typeOf[Row] => df
case t if t <:< typeOf[Product] =>
implicit val enc: org.apache.spark.sql.Encoder[T] =
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder()
df.as[T]
}
}

Class import error in Scala/Spark

I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}

Streaming CSV with akka-http in scala

I am very new to akka-http, and I would like to stream a csv with an arbitrary number of lines.
For instance, I would like to return :
a,1
b,2
c,3
with the following code
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, ??? using map)
}
}
Http().bindAndHandle(route,"localhost",8080)
Thanks for your help
EDIT: Thanks to Ramon J Romero y Vigil
package test
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.model.HttpCharsets.`UTF-8`
import akka.http.scaladsl.model._
import akka.http.scaladsl.server.Directives._
import akka.stream._
import akka.util.ByteString
import scala.collection.mutable
object Test{
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val mapStream = Stream.fromIterator(() => map.toIterator)
.map((k: String, v: Int) => s"$k,$v")
.map(ByteString.apply)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, mapStream)
}
}
Http().bindAndHandle(route, "localhost", 8080)
}
}
With this code I have two compile error:
Error:(29, 28) value fromIterator is not a member of object scala.collection.immutable.Stream
val mapStream = Stream.fromIterator(() => map.toIterator)
Error:(38, 11) overloaded method value apply with alternatives:
(contentType: akka.http.scaladsl.model.ContentType,file: java.io.File,chunkSize: Int)akka.http.scaladsl.model.UniversalEntity <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.stream.scaladsl.Source[akka.util.ByteString,Any])akka.http.scaladsl.model.HttpEntity.Chunked <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.util.ByteString)akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType,bytes: Array[Byte])akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType.NonBinary,string: String)akka.http.scaladsl.model.HttpEntity.Strict
cannot be applied to (akka.http.scaladsl.model.ContentType.WithCharset, List[akka.util.ByteString])
HttpEntity(`text/csv`, mapStream)
I used a List of tuples to get arround the first issue (hower i do not know how to stream a map in Scala)
No idea for the second
Thanks for your help.
(I am using scala 2.11.8)
Use the apply function in HttpEntity that takes in a Source[ByteString,Any]. The apply creates a Chunked entity. You can read your file using code based on the documentation for streaming file IO using an akka stream Source:
import akka.stream.scaladsl._
val file = Paths.get("yourFile.csv")
val entity = HttpEntity(`txt/csv`, FileIO.fromPath(file))
The stream will break up your file into chunk sizes, default is currently set to 8192.
To stream the map that you've created you can use a similar trick:
val mapStream = Source.fromIterator(() => map.toIterator)
.map( (k : String, v : Int) => s"$k,$v" )
.map(ByteString.apply)
val mapEntity = HttpEntity(`test/csv`, mapStream)

How to pass hiveContext as argument to functions spark scala

I have created a hiveContext in main() function in Scala and I need to pass through parameters this hiveContext to other functions, this is the structure:
object Project {
def main(name: String): Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
...
}
def read (streamId: Int, hc:hiveContext): Array[Byte] = {
...
}
def close (): Unit = {
...
}
}
but it doesn't work. Function read() is called inside main().
any idea?
I'm declaring hiveContext as implicit, this is working for me
implicit val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)
Defined in MyJob:
override def run(config: Config)(implicit sqlContext: SQLContext): Unit = ...
But if you don't want it implicit, this should be the same
val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)(sqlContext)
override def run(config: Config)(sqlContext: SQLContext): Unit = ...
Also, your function read should receive HiveContext as the type for the parameter hc, and not hiveContext
def read (streamId: Int, hc:HiveContext): Array[Byte] =
I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion.
Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.