How to pass hiveContext as argument to functions spark scala - scala

I have created a hiveContext in main() function in Scala and I need to pass through parameters this hiveContext to other functions, this is the structure:
object Project {
def main(name: String): Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
...
}
def read (streamId: Int, hc:hiveContext): Array[Byte] = {
...
}
def close (): Unit = {
...
}
}
but it doesn't work. Function read() is called inside main().
any idea?

I'm declaring hiveContext as implicit, this is working for me
implicit val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)
Defined in MyJob:
override def run(config: Config)(implicit sqlContext: SQLContext): Unit = ...
But if you don't want it implicit, this should be the same
val sqlContext: HiveContext = new HiveContext(sc)
MyJob.run(conf)(sqlContext)
override def run(config: Config)(sqlContext: SQLContext): Unit = ...
Also, your function read should receive HiveContext as the type for the parameter hc, and not hiveContext
def read (streamId: Int, hc:HiveContext): Array[Byte] =

I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion.
Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.

Related

Convert Scala Reflection MethodMirror to Scala Function

I am trying to create a Seq of methods that will operate on a Spark DataFrame. Currently I am explicitly creating this Seq at runtime:
val allFuncs: Seq[DataFrame => DataFrame] = Seq(func1, func2, func3)
def func1(df: DataFrame): DataFrame = {}
def func2(df: DataFrame): DataFrame = {}
def func3(df: DataFrame): DataFrame = {}
I added functionality that allows developers to add an annotation and I'm creating a Seq of MethodMirrors from it like so, but I'd like getMyFuncs to return a Seq[(DataFrame => DataFrame)]:
def getMyFuncs(): Seq[(DataFrame => DataFrame)] = {
// Gets anything with the #MyFunc annotation
val listOfAnnotations = typeOf[T].members.flatMap(f => f.annotations.find(_.tree.tpe =:= typeOf[MyFunc]).map((f, _))).toList
val rm = runtimeMirror(this.getClass.getClassLoader)
val instanceMirror = rm.reflect(this)
listOfAnnotations.map(annotation => instanceMirror.reflectMethod(annotation._1.asMethod)).toSeq
}
#MyFunc
def func1(df: DataFrame): DataFrame = {}
#MyFunc
def func2(df: DataFrame): DataFrame = {}
#MyFunc
def func3(df: DataFrame): DataFrame = {}
However, the Seq returned by getMyFuncs is a Seq[reflect.runtime.universe.MethodMirror], not Seq[(DataFrame => DataFrame)]. Which is expected, but not the output I need. Is there any way to convert the MethodMirrors into a Scala function?
Try mapping:
val getMyFuncs: Seq[reflect.runtime.universe.MethodMirror] = ???
val getMyFuncs1: Seq[DataFrame => DataFrame] =
getMyFuncs.map(mirror => (dataFrame: DataFrame) => mirror(dataFrame).asInstanceOf[DataFrame])
i.e. creating lambdas manually using reflect.runtime.universe.MethodMirror#apply(..).

Implicits in a Spark Scala program not working

I am not able to perform an implicit conversion from an RDD to a Dataframe in a Scala program although I am importing spark.implicits._.
Any help would be appreciated.
Main Program with the implicits:
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = spark.sparkContext
val data = sc.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
case class CC1(LAT: Double, LONG: Double)
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
Error is as follows:
Error:(40, 25) value toDF is not a member of org.apache.spark.rdd.RDD[CC1]
val allDF = allData.toDF()
When you define the case class CC1 inside the main method, you hit https://issues.scala-lang.org/browse/SI-6649; toDF() then fails to locate the appropriate implicit TypeTag for that class at compile time.
You can see this in this simple example:
case class Out()
object TestImplicits {
def main(args: Array[String]) {
case class In()
val typeTagOut = implicitly[TypeTag[Out]] // compiles
val typeTagIn = implicitly[TypeTag[In]] // does not compile: Error:(23, 31) No TypeTag available for In
}
}
Spark's relevant implicit conversion has this type parameter: [T <: Product : TypeTag] (see newProductEncoder here), which means an implicit TypeTag[CC1] is required.
To fix this - simply move the definition of CC1 out of the method, or out of object entirely:
case class CC1(LAT: Double, LONG: Double)
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
I thought the toDF is in sqlContext.implicits._ so you need to import that not spark.implicits._. At least that is the case in spark 1.6

How can one create a method in scala that generates the sqlcontext implicits encoder based only on type?

I am trying to build a framework on top of spark that can automatically create datasets from data store to disk. An example of the sort of thing I would like to do is:
var sparkSession: SparkSession = _ // initialized elsewhere
def generateDataset[T <: Product : TypeTag](path: Path): Dataset[T] = {
val df: DataFrame = generateDataFrameFromPath(path)
import sparkSession.implicits._
df.as[T]
}
which works just fine. The problem I have is trying to extend that to Dataframes and other classes for which implicit encoders can be generated (like String or Int). I have tried to do something like this:
var sparkSession: SparkSession = _ // initialized elsewhere
def generateDataset[T : TypeTag](path: Path): Dataset[T] = {
val df: DataFrame = generateDataFrameFromPath(path)
typeOf[T] match {
case t if t =:= typeOf[Row] => df
case t if t <:< typeOf[Product] =>
import sparkSession.implicits._
df.as[T]
}
}
But the compiler doesn't like this even though we know T is a subclass of Product when we call .as[T].
I know that the standard approach would be to use the Encoder context bound/implicit however my calling code has no knowledge of the sparkSession until it gets the data back.
Is there a way to get this to work without having the encoder generated by the caller?
Try generating encode in place:
def generateDataset[T : TypeTag](path: Path) = {
val df: DataFrame = generateDataFrameFromPath(path)
typeOf[T] match {
case t if t =:= typeOf[Row] => df
case t if t <:< typeOf[Product] =>
implicit val enc: org.apache.spark.sql.Encoder[T] =
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder()
df.as[T]
}
}

Class import error in Scala/Spark

I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}

Transform data using scala in spark

I am trying to transform the input text file into a Key/Value RDD, but the code below doesn't work.(The text file is a tab separated file.) I am really new to Scala and Spark so I would really appreciate your help.
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object shortTwitter {
def main(args: Array[String]): Unit = {
for (line <- Source.fromFile(args(1).txt).getLines()) {
val newLine = line.map(line =>
val p = line.split("\t")
(p(0).toString, p(1).toInt)
)
}
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text = sc.textFile(args(0))
val counts = text.flatMap(line => line.split("\t"))
}
}
I'm assuming you want the resulting RDD to have the type RDD[(String, Int)], so -
You should use map (which transforms each record into a single new record) and not flatMap (which transform each record into multiple records)
You should map the result of the split into a tuple
Altogether:
val counts = text
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
EDIT per clarification in comment: if you're also interested in fixing the non-Spark part (which reads the file sequentially), you have some errors in the for-comprehension syntax, here's the entire thing:
def main(args: Array[String]): Unit = {
// read the file without Spark (not necessary when using Spark):
val countsWithoutSpark: Iterator[(String, Int)] = for {
line <- Source.fromFile(args(1)).getLines()
} yield {
val p = line.split("\t")
(p(0), p(1).toInt)
}
// equivalent code using Spark:
val sparkConf = new SparkConf().setAppName("ShortTwitterAnalysis").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val counts: RDD[(String, Int)] = sc.textFile(args(0))
.map(line => line.split("\t"))
.map(arr => (arr(0), arr(1).toInt))
}