How to deal with contexts in Spark/Scala when using map() - scala

I'm not very familiar with Scala, neither with Spark, and I'm trying to develop a basic test to understand how DataFrames actually work. My objective is to update my myDF based on values of some registries of another table.
Well, on the one hand, I have my App:
object TestApp {
def main(args: Array[String]) {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
implicit val hiveContext : SQLContext = new HiveContext(sc)
val test: Test = new Test()
test.test
}
}
On the other hand, I have my Test class :
class Test(implicit sqlContext: SQLContext) extends Serializable {
val hiveContext: SQLContext = sqlContext
import hiveContext.implicits._
def test(): Unit = {
val myDF = hiveContext.read.table("myDB.Customers").sort($"cod_a", $"start_date".desc)
myDF.map(myMap).take(1)
}
def myMap(row: Row): Row = {
def _myMap: (String, String) = {
val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment")
var target: (String, String) = casoX(investmentDF, row.getAs[String]("cod_a"), row.getAs[String]("cod_p"))
target
}
def casoX(df: DataFrame, codA: String, codP: String)(implicit hiveContext: SQLContext): (String, String) = {
var rows: Array[Row] = null
if (codP != null) {
println(df)
rows = df.filter($"cod_a" === codA && $"cod_p" === codP).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
} else {
rows = df.filter($"cod_a" === codA).orderBy($"sales".desc).select($"cod_t", $"nom_t").collect
}
if (rows.length > 0) (row(0).asInstanceOf[String], row(1).asInstanceOf[String]) else null
}
val target: (String, String) = _myMap
Row(row(0), row(1), row(2), row(3), row(4), row(5), row(6), target._1, target._2, row(9))
}
}
Well, when I execute it, I have a NullPointerException on the instruction val investmentDF: DataFrame = hiveContext.read.table("myDB.Investment"), and more precisely hiveContext.read
If I analyze hiveContext in the "test" function, I can access to its SparkContext, and I can load my DF without any problem.
Nevertheless if I analyze my hiveContext object just before getting the NullPointerException, its sparkContext is null, and I suppose due to sparkContext is not Serializable (and as I am in a map function, I'm loosing part of my hiveContext object, am I right?)
Anyway, I don't know what's wrong exactly with my code, and how should I alter it to get my investmentDF without any NullPointerException?
Thanks!

Related

Performance issue with UDF.. is there a better to solve the transformation. Database write is getting stuck

Table
userid
data
123234
{"type":1,"actionData":{"actionType":"Category","id":"1233232","title":"BOSS","config":{"type":"X"}}}
And I need a output table like this..
userid
action
123234
{"type":"Category","template":{"data":"{"title":"BOSS"}" },"additionalInfo":{"type":1,"config":{"type":"X"} } }
Scala spark..
It is getting stuck while database write with UDF.
running
bin/spark-shell --master local[*] --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.0 --driver-memory 200g
Need a better way to solve it..
object testDataMigration extends Serializable {
def main(cassandra: String): Unit = {
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("UserLookupMigration")
.config("spark.master", "local[*]")
.config("spark.cassandra.connection.host",cassandra)
.config("spark.cassandra.output.batch.size.rows", "10000")
.config("spark.cassandra.read.timeoutMS","60000")
.getOrCreate()
val res = time(migrateData());
Console.println("Time taken to execute script", res._1);
spark.stop();
}
def migrateData()(implicit spark: SparkSession) {
)
val file = new File("validation_count.txt" )
val print_Writer = new PrintWriter(file)
//Reading data from user_feed table
val userFeedData = spark.read.format("org.apache.spark.sql.cassandra")
.option("keyspace", "sunbird").option("table", "TABLE1").load();
print_Writer.write("User Feed Table records:"+ userFeedData.count() );
//Persisting user feed data into memory
userFeedData.persist()
val userFeedWithNewUUID = userFeedData
.withColumn("newId",expr("uuid()"))
.withColumn("action", myColUpdate(userFeedData("data"),
userFeedData("createdby"), userFeedData("category")))
userFeedWithNewUUID.persist()
val userFeedV2Format = userFeedWithNewUUID.select(
col("newId"),col("category"),col("createdby"),
col("createdon"),col("action"),col("expireon"),
col("priority"),col("status"),col("updatedby"),
col("updatedon"),col("userid"))
.withColumnRenamed("newId","id")
.withColumn("version",lit("v2").cast(StringType))
//Persist v2 format data to memory
userFeedV2Format.persist()
print_Writer.write("User Feed V2 Format records:"+ userFeedV2Format.count() );
userFeedV2Format.write.format("org.apache.spark.sql.cassandra")
.option("keyspace", "sunbird_notifications")
.option("table", "TABLE2")
.mode(SaveMode.Append).save();
//Remove from memory
userFeedV2Format.unpersist()
userFeedData.unpersist()
print_Writer.close()
}
def myColUpdate= udf((data: String, createdby: String, category: String)=> {
val jsonMap = parse(data).values.asInstanceOf[Map[String, Object]]
val actionDataMap = new HashMap[String, Object]
val additionalInfo = new HashMap[String,Object]
val dataTemplate = new HashMap[String,String]
val templateMap = new HashMap[String,Object]
val createdByMap = new HashMap[String,Object]
createdByMap("type")="System"
createdByMap("id")=createdby
var actionType: String = null
for((k,v)<-jsonMap){
if(k == "actionData"){
val actionMap = v.asInstanceOf[Map[String,Object]]
if(actionMap.contains("actionType")){
actionType = actionMap("actionType").asInstanceOf[String]
}
for((k1,v1)<-actionMap){
if(k1 == "title" || k1 == "description"){
dataTemplate(k1)=v1.asInstanceOf[String]
}else{
additionalInfo(k1)=v1
}
}
}else{
additionalInfo(k)=v
}
}
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
templateMap("data")=mapper.writeValueAsString(dataTemplate)
templateMap("ver")="4.4.0"
templateMap("type")="JSON"
actionDataMap("type")=actionType
actionDataMap("category")=category.asInstanceOf[String]
actionDataMap("createdBy")=createdByMap;
actionDataMap("template") =templateMap;
actionDataMap("additionalInfo")=additionalInfo
mapper.writeValueAsString(actionDataMap)
})
}
Getting stuck Table 1 has 40 million data.

How to persist the list which we made dynamically from dataFrame in scala spark

def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.
Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}
You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.

type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]

Why the following code has a compilation error at return statement,
def getData(queries: Array[String]): Dataset[Row] = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props).registerTempTable("")
return res
}
Error,
type mismatch; found : Unit required: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]
Scala version 2.11.11
Spark version 2.0.0
EDIT:
Actual case
def getDataFrames(queries: Array[String]) = {
val jdbcResult = queries.map(query => {
val tablename = extractTableName(query)
if (tablename.contains("1")) {
spark.sqlContext.read.format("jdbc").jdbc(jdbcUrl1, query, props)
} else {
spark.sqlContext.read.format("jdbc").jdbc(jdbcUrl2, query, props)
}
})
}
Here I want to return the combined output from the iteration like an Array[Dataset[Row]] or Array[DataFrame] (but Dataframe is not available in 2.0.0 as a dependency). Do the above code does the magic ? or How can I do it?
You can return a list of dataframes as List[Dataframe]
def getData(queries: Array[String]): List[Dataframe] = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props)
//create multiple dataframe from your queries
val df1 = ???
val df2 = ???
val list = List(df1, df2)
//You can create a list dynamically with list of quries
list
}
registerTempTable returns Unit you better remove the registerTempTable and return Dataframe, and return a list of dataframes.
UPDATE:
Here is how you can return list of dataframes with list of queries
def getDataFrames(queries: Array[String]): Array[DataFrame] = {
val jdbcResult = queries.map(query => {
val tablename = extractTableName(query)
val dataframe = if (tablename.contains("1")) {
spark.sqlContext.read.format("jdbc").jdbc("", query, prop)
} else {
spark.sqlContext.read.format("jdbc").jdbc("", query, prop)
}
dataframe
})
jdbcResult
}
I hope this helps!
Its clear from the error message that there is a type mismatch in your function.
registerTempTable() api creates an in-memory table scoped to the current session and stays accesible till the SparkSession is active.
Check the return type of registerTempTable() api here
change your code to the following to remove the error message:
def getData(queries: Array[String]): Unit = {
val res = spark.read.format("jdbc").jdbc(jdbcUrl, "", props).registerTempTable("")
}
an even better way would be to write the code as follows:
val tempName: String = "Name_Of_Temp_View"
spark.read.format("jdbc").jdbc(jdbcUrl, "", props).createOrReplaceTempView(tempName)
Use the createOrReplaceTempView() as registerTempTable() is deprecated since Spark 2.0.0
The Alternate solution as per your requirement:
def getData(queries: Array[String], spark: SparkSession): Array[DataFrame] = {
spark.read.format("jdbc").jdbc(jdbcUrl, "", props).createOrReplaceTempView("Name_Of_Temp_Table")
val result: Array[DataFrame] = queries.map(query => spark.sql(query))
result }

How can I dynamically invoke the same scala function in cascading manner with output of previous call goes as input to the next call

I am new to Spark-Scala and trying following thing but I am stuck up and not getting on how to achieve this requirement. I shall be really thankful if someone can really help in this regards.
We have to invoke different rules on different columns of given table. The list of column names and rules is being passed as argument to the program
The resultant of first rule should go as input to the next rule input.
question : How can I execute exec() function in cascading manner with dynamically filling the arguments for as many rules as specified in arguments.
I have developed a code as follows.
object Rules {
def main(args: Array[String]) = {
if (args.length != 3) {
println("Need exactly 3 arguments in format : <sourceTableName> <destTableName> <[<colName>=<Rule> <colName>=<Rule>,...")
println("E.g : INPUT_TABLE OUTPUT_TABLE [NAME=RULE1,ID=RULE2,TRAIT=RULE3]");
System.exit(-1)
}
val conf = new SparkConf().setAppName("My-Rules").setMaster("local");
val sc = new SparkContext(conf);
val srcTableName = args(0).trim();
val destTableName = args(1).trim();
val ruleArguments = StringUtils.substringBetween(args(2).trim(), "[", "]");
val businessRuleMappings = ruleArguments.split(",").map(_.split("=")).map(arr => arr(0) -> arr(1)).toMap;
val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc) ;
val hiveContext : HiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
val dfSourceTbl = hiveContext.table("TEST.INPUT_TABLE");
def exec(dfSource: DataFrame,columnName :String ,funName: String): DataFrame = {
funName match {
case "RULE1" => TransformDF(columnName,dfSource,RULE1);
case "RULE2" => TransformDF(columnName,dfSource,RULE2);
case "RULE3" => TransformDF(columnName,dfSource,RULE3);
case _ =>dfSource;
}
}
def TransformDF(x:String, df:DataFrame, f:(String,DataFrame)=>DataFrame) : DataFrame = {
f(x,df);
}
def RULE1(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE2(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE3(column : String,sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
// How can I call this exec() function with output casacing and arguments for variable number of rules.
val finalResultDF = exec(exec(exec(dfSourceTbl,"NAME","RULE1"),"ID","RULE2"),"TRAIT","RULE3);
finalResultDF.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("DB.destTableName")
}
}
I would write all the rules as functions transforming one dataframe to another:
val rules: Seq[(DataFrame) => DataFrame] = Seq(
RULE1("NAME",_:DataFrame),
RULE2("ID",_:DataFrame),
RULE3("TRAIT",_:DataFrame)
)
Not you can apply them using folding
val finalResultDF = rules.foldLeft(dfSourceTbl)(_ transform _)

i want to store each rdd into database in twitter streaming using apache spark but got error of task not serialize in scala

I write a code in which twitter streaming take a rdd of tweet class and store each rdd in database but it got error task not serialize I paste the code.
sparkstreaming.scala
case class Tweet(id: Long, source: String, content: String, retweet: Boolean, authName: String, username: String, url: String, authId: Long, language: String)
trait SparkStreaming extends Connector {
def startStream(appName: String, master: String): StreamingContext = {
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = new DBCrud(db, "table1")
val sparkConf: SparkConf = new SparkConf().setAppName(appName).setMaster(master).set(" spark.driver.allowMultipleContexts", "true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// .set("spark.kryo.registrator", "HelloKryoRegistrator")
// sparkConf.registerKryoClasses(Array(classOf[DBCrud]))
val sc: SparkContext = new SparkContext(sparkConf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(10))
ssc
}
}
object SparkStreaming extends SparkStreaming
I use this streaming context in plat controller to store tweets in database but it throws exception. I'm using mongodb to store it.
def streamstart = Action {
val stream = SparkStreaming
val a = stream.startStream("ss", "local[2]")
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = DBCrud
val twitterauth = new TwitterClient().tweetCredantials()
val tweetDstream = TwitterUtils.createStream(a, Option(twitterauth.getAuthorization))
val tweets = tweetDstream.filter { x => x.getUser.getLang == "en" }.map { x => Tweet(x.getId, x.getSource, x.getText, x.isRetweet(), x.getUser.getName, x.getUser.getScreenName, x.getUser.getURL, x.getUser.getId, x.getUser.getLang) }
// tweets.foreachRDD { x => x.foreach { x => dbcrud.insert(x) } }
tweets.saveAsTextFiles("/home/knoldus/sentiment project/spark services/tweets/tweets")
// val s=new BirdTweet()
// s.hastag(a.sparkContext)
a.start()
Ok("start streaming")
}
When make a single of streaming which take tweets and use forEachRDD to store each tweet then it works but if I use it from outside it doesn't work.
Please help me.
Try to create connection with MongoDB inside foreachRDD block, as mentioned in Spark Documentation
tweets.foreachRDD { x =>
x.foreach { x =>
val db = connector("localhost", "rmongo", "rmongo", "pass")
val dbcrud = new DBCrud(db, "table1")
dbcrud.insert(x)
}
}