Writing udf function in scala and using them in pyspark job - scala

We are trying to write a scala udf function and call it from a map function in pyspark. The dateframe schema is quite complex the columns we want to pass to this function are array of StructType.
trip_force_speeds = trip_details.groupby("vehicle_id","driver_id", "StartDtLocal", "EndDtLocal")\
.agg(collect_list(struct(col("event_start_dt_local"),
col("force"),
col("speed"),
col("sec_from_start"),
col("sec_from_end"),
col("StartDtLocal"),
col("EndDtLocal"),
col("verisk_vehicle_id"),
col("trip_duration_sec")))\
.alias("trip_details"))
In our map function we need to do some computation.
def calculateVariables(rec: Row):HashMap[String,Float] = {
val trips = rec.getAs[List]("trips")
val base_variables = new HashMap[String, Float]()
val entropy_variables = new HashMap[String, Float]()
val week_day_list = List("monday", "tuesday", "wednesday", "thursday", "friday")
for (trip <- trips)
{
if (trip("start_dt_local") >= trip("StartDtLocal") && trip("start_dt_local") <= trip("EndDtLocal"))
{
base_variables("trip_summary_count") += 1
if (trip("duration_sec").toFloat >= 300 && trip("duration_sec").toFloat <= 1800) {
base_variables ("bounded_trip") += 1
base_variables("bounded_trip_duration") = trip("duration_sec") + base_variables("bounded_trip_duration")
base_variables("total_bin_1") += 30
base_variables("total_bin_2") += 30
base_variables("total_bin_3") += 60
base_variables("total_bin_5") += 60
base_variables("total_bin_6") += 30
base_variables("total_bin_7") += 30
}
if (trip("duration_sec") > 120 && trip("duration_sec") < 21600 )
{
base_variables("trip_count") += 1
}
base_variables("trip_distance") += trip("distance_km")
base_variables("trip_duration") = trip("duration_sec") + base_variables("trip_duration")
base_variables("speed_event_distance") = trip("speed_event_distance_km") + base_variables("speed_event_distance")
base_variables("speed_event_duration") = trip("speed_event_duration_sec") + base_variables("speed_event_duration")
base_variables("speed_event_distance_ratio") = trip("speed_distance_ratio") + base_variables("speed_event_distance_ratio")
base_variables("speed_event_duration_ratio") = trip("speed_duration_ratio") + base_variables("speed_event_duration_ratio")
}
}
return base_variables
}
when we tried to compile the scala code we got thie error
i tried using Row but got this error
"error: kinds of the type arguments (List) do not conform to the expected kinds of the type parameters (type T). List's type parameters do not match type T's expected parameters: type List has one type parameter, but type T has none – "
in my case the trip is a list of rows. this is the schema
StructType(List(StructField(verisk_vehicle_id,StringType,true),StructField(verisk_driver_id,StringType,false),StructField(StartDtLocal,TimestampType,true),StructField(EndDtLocal,TimestampType,true),StructField(trips,ArrayType(StructType(List(StructField(week_start_dt_local,TimestampType,true),StructField(week_end_dt_local,TimestampType,true),StructField(start_dt_local,TimestampType,true),StructField(end_dt_local,TimestampType,true),StructField(StartDtLocal,TimestampType,true),StructField(EndDtLocal,TimestampType,true),StructField(verisk_vehicle_id,StringType,true),StructField(duration_sec,FloatType,true),StructField(distance_km,FloatType,true),StructField(speed_distance_ratio,FloatType,true),StructField(speed_duration_ratio,FloatType,true),StructField(speed_event_distance_km,FloatType,true),StructField(speed_event_duration_sec,FloatType,true))),true),true),StructField(trip_details,ArrayType(StructType(List(StructField(event_start_dt_local,TimestampType,true),StructField(force,FloatType,true),StructField(speed,FloatType,true),StructField(sec_from_start,FloatType,true),StructField(sec_from_end,FloatType,true),StructField(StartDtLocal,TimestampType,true),StructField(EndDtLocal,TimestampType,true),StructField(verisk_vehicle_id,StringType,true),StructField(trip_duration_sec,FloatType,true))),true),true)))
is there something wrong in the way we defined the function signature we tried overriding the spark structtype but that didn't work for me.
i am from a python background and am facing some performance issues in python job, that's why i decided to write this map function in Scala.

You must work with the Row type instead of the StructType in your udf. StructType represents the schema itself not the data. A little example in Scala that you can use:
object test{
import org.apache.spark.sql.functions.{udf, collect_list, struct}
val hash = HashMap[String, Float]("start_dt_local" -> 0)
// This simple type to store you results
val sampleDataset = Seq(Row(Instant.now().toEpochMilli, Instant.now().toEpochMilli))
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
def calculateVariablesUdf = udf { trip: Row =>
if(trip.getAs[Long]("start_dt_local") >= trip.getAs[Long]("StartDtLocal")) {
// crate a new instance with your results
hash("start_dt_local") + 1
} else {
hash("start_dt_local") + 0
}
}
def main(args: Array[String]) : Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val rdd = spark.sparkContext.parallelize(sampleDataset)
val df = spark.createDataFrame(rdd, StructType(List(StructField("start_dt_local", LongType, false), StructField("StartDtLocal", LongType, false))))
df.agg(collect_list(calculateVariablesUdf(struct(col("start_dt_local"), col("StartDtLocal")))).as("result")).show(false)
}
}
Edit. For a better understanding:
You are wrong when you consider a schema description: StructType(List(StructField)) as the type of your field. There is not List type in your DataFrame.
If you treat your calculateVariables as an udf you don´t need the for loop. I mean:
def calculateVariables = udf { trip: Row =>
trip("start_dt_local").getAs[Long]
// your logic ....
}
As I put in the example you can return your updated Hash directly in the udf

Related

How to persist the list which we made dynamically from dataFrame in scala spark

def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.
Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}
You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.

How can I dynamically invoke the same scala function in cascading manner with output of previous call goes as input to the next call

I am new to Spark-Scala and trying following thing but I am stuck up and not getting on how to achieve this requirement. I shall be really thankful if someone can really help in this regards.
We have to invoke different rules on different columns of given table. The list of column names and rules is being passed as argument to the program
The resultant of first rule should go as input to the next rule input.
question : How can I execute exec() function in cascading manner with dynamically filling the arguments for as many rules as specified in arguments.
I have developed a code as follows.
object Rules {
def main(args: Array[String]) = {
if (args.length != 3) {
println("Need exactly 3 arguments in format : <sourceTableName> <destTableName> <[<colName>=<Rule> <colName>=<Rule>,...")
println("E.g : INPUT_TABLE OUTPUT_TABLE [NAME=RULE1,ID=RULE2,TRAIT=RULE3]");
System.exit(-1)
}
val conf = new SparkConf().setAppName("My-Rules").setMaster("local");
val sc = new SparkContext(conf);
val srcTableName = args(0).trim();
val destTableName = args(1).trim();
val ruleArguments = StringUtils.substringBetween(args(2).trim(), "[", "]");
val businessRuleMappings = ruleArguments.split(",").map(_.split("=")).map(arr => arr(0) -> arr(1)).toMap;
val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc) ;
val hiveContext : HiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
val dfSourceTbl = hiveContext.table("TEST.INPUT_TABLE");
def exec(dfSource: DataFrame,columnName :String ,funName: String): DataFrame = {
funName match {
case "RULE1" => TransformDF(columnName,dfSource,RULE1);
case "RULE2" => TransformDF(columnName,dfSource,RULE2);
case "RULE3" => TransformDF(columnName,dfSource,RULE3);
case _ =>dfSource;
}
}
def TransformDF(x:String, df:DataFrame, f:(String,DataFrame)=>DataFrame) : DataFrame = {
f(x,df);
}
def RULE1(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE2(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE3(column : String,sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
// How can I call this exec() function with output casacing and arguments for variable number of rules.
val finalResultDF = exec(exec(exec(dfSourceTbl,"NAME","RULE1"),"ID","RULE2"),"TRAIT","RULE3);
finalResultDF.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("DB.destTableName")
}
}
I would write all the rules as functions transforming one dataframe to another:
val rules: Seq[(DataFrame) => DataFrame] = Seq(
RULE1("NAME",_:DataFrame),
RULE2("ID",_:DataFrame),
RULE3("TRAIT",_:DataFrame)
)
Not you can apply them using folding
val finalResultDF = rules.foldLeft(dfSourceTbl)(_ transform _)

Scala/Spark: Converting zero inflated data in dataframe to libsvm

I am very new to scala (typically I do this in R)
I have imported a large dataframe (2000+ columns, 100000+ rows) that is zero-inflated.
Task
To convert the data to libsvm format
Steps
As I understand the steps are as follows
Ensure feature columns are set to DoubleType and Target is an Int
Iterate through each row, retaining each value >0 in one array and index of its column in another array
Convert to RDD[LabeledPoint]
Save RDD in libsvm format
I am stuck on 3 (but maybe) because I am doing step 2 wrong.
Here is my code:
Main Function:
#Test
def testSpark(): Unit =
{
try
{
var mDF: DataFrame = spark.read.option("header", "true").option("inferSchema", "true").csv("src/test/resources/knimeMergedTRimmedVariables.csv")
val mDFTyped = castAllTypedColumnsTo(mDF, IntegerType, DoubleType)
val indexer = new StringIndexer()
.setInputCol("Majors_Final")
.setOutputCol("Majors_Final_Indexed")
val mDFTypedIndexed = indexer.fit(mDFTyped).transform(mDFTyped)
val mDFFinal = castColumnTo(mDFTypedIndexed,"Majors_Final_Indexed", IntegerType)
//only doubles accepted by sparse vector, so that's what we filter for
val fieldSeq: scala.collection.Seq[StructField] = schema.fields.toSeq.filter(f => f.dataType == DoubleType)
val fieldNameSeq: Seq[String] = fieldSeq.map(f => f.name)
val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
assertTrue(true)
}
catch
{
case ex: Exception =>
{
println(s"There has been an Exception. Message is ${ex.getMessage} and ${ex}")
fail()
}
}
}
Convert each row to LabeledPoint:
#throws(classOf[Exception])
private def convertRowToLabeledPoint(rowIn: Row, fieldNameSeq: Seq[String], label:Int): LabeledPoint =
{
try
{
val values: Map[String, Double] = rowIn.getValuesMap(fieldNameSeq)
val sortedValuesMap = ListMap(values.toSeq.sortBy(_._1): _*)
val rowValuesItr: Iterable[Double] = sortedValuesMap.values
var positionsArray: ArrayBuffer[Int] = ArrayBuffer[Int]()
var valuesArray: ArrayBuffer[Double] = ArrayBuffer[Double]()
var currentPosition: Int = 0
rowValuesItr.foreach
{
kv =>
if (kv > 0)
{
valuesArray += kv;
positionsArray += currentPosition;
}
currentPosition = currentPosition + 1;
}
val lp:LabeledPoint = new LabeledPoint(label, org.apache.spark.mllib.linalg.Vectors.sparse(positionsArray.size,positionsArray.toArray, valuesArray.toArray))
return lp
}
catch
{
case ex: Exception =>
{
throw new Exception(ex)
}
}
}
Problem
So then I try to create a dataframe of labeledpoints which can easily be converted to an RDD.
val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
But I get the following error:
SparkTest.scala:285: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for seri
alizing other types will be added in future releases.
[INFO] val labeled:DataFrame = mDFFinal.map(row => convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"))).toDF()
OK, so I skipped the DataFrame and created an Array of LabeledPoints whish is easily converted to an RDD. The rest is easy.
I stress, that while this works, I am new to scala and there may be more efficient ways to do this.
Main Function is now as follows:
val mDF: DataFrame = spark.read.option("header", "true").option("inferSchema", "true").csv("src/test/resources/knimeMergedTRimmedVariables.csv")
val mDFTyped = castAllTypedColumnsTo(mDF, IntegerType, DoubleType)
val indexer = new StringIndexer()
.setInputCol("Majors_Final")
.setOutputCol("Majors_Final_Indexed")
val mDFTypedIndexed = indexer.fit(mDFTyped).transform(mDFTyped)
val mDFFinal = castColumnTo(mDFTypedIndexed,"Majors_Final_Indexed", IntegerType)
mDFFinal.show()
//only doubles accepted by sparse vector, so that's what we filter for
val fieldSeq: scala.collection.Seq[StructField] = mDFFinal.schema.fields.toSeq.filter(f => f.dataType == DoubleType)
val fieldNameSeq: Seq[String] = fieldSeq.map(f => f.name)
var positionsArray: ArrayBuffer[LabeledPoint] = ArrayBuffer[LabeledPoint]()
mDFFinal.collect().foreach
{
row => positionsArray+=convertRowToLabeledPoint(row,fieldNameSeq,row.getAs("Majors_Final_Indexed"));
}
val mRdd:RDD[LabeledPoint]= spark.sparkContext.parallelize(positionsArray.toSeq)
MLUtils.saveAsLibSVMFile(mRdd, "./output/libsvm")

Scala : map Dataset[Row] to Dataset[Row]

I am trying to use scala to transform a dataset with array to a dataset with label and vectors, before putting it into some machine learning algo.
So far, I succeeded to add a double label, but i block on the vectors part. Below, the code to create the vectors :
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.sql.types.{DataTypes, StructField}
import org.apache.spark.sql.{Dataset, Row, _}
import spark.implicits._
def toVectors(withLabelDs: Dataset[Row]) = {
val allLabel = withLabelDs.count()
var countLabel = 0
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
println("schema line {}", line.schema)
//StructType(
// StructField(label,DoubleType,false),
// StructField(code,ArrayType(IntegerType,true),true),
// StructField(score,ArrayType(IntegerType,true),true))
val label = line.getDouble(0)
val indicesList = line.getList(1)
val indicesSize = indicesList.size
val indices = new Array[Int](indicesSize)
val valuesList = line.getList(2)
val values = new Array[Double](indicesSize)
var i = 0
while ( {
i < indicesSize
}) {
indices(i) = indicesList.get(i).asInstanceOf[Int] - 1
values(i) = valuesList.get(i).asInstanceOf[Int].toDouble
i += 1
}
var r: Row = null
try {
r = Row(label, Vectors.sparse(195, indices, values))
countLabel += 1
}
catch {
case e: IllegalArgumentException =>
println("something went wrong with label {} / indices {} / values {}", label, indices, values)
println("", e)
}
println("Still {} labels to process", allLabel - countLabel)
r
})
newDataset
}
With this code, I got this error :
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
So naturally, I changed my code
def toVectors(withLabelDs: Dataset[Row]) = {
...
}, Encoders.bean(Row.getClass))
newDataset
}
But I got this error :
error: overloaded method value map with alternatives:
[U](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,U],
encoder: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
<and>
[U](func: org.apache.spark.sql.Row => U)
(implicit evidence$6: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
cannot be applied to (org.apache.spark.sql.Row => org.apache.spark.sql.Row, org.apache.spark.sql.Encoder[?0])
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
How can I make this work ? Aka, having a dataset[Row] returned with Vectors ?
Two things:
.map is of type (T => U)(implicit Encoder[U]) => Dataset[U] but looks like you are calling it like it is (T => U, implicit Encoder[U]) => Dataset[U] which are slightly different. Instead of .map(f, encoder), try .map(f)(encoder).
Also, I doubt Encoders.bean(Row.getClass) will work since Row is not a bean. Some quick googling turned up RowEncoder which looks like it should work but I couldn't find much documentation about it.
The error message is unfortunately quite poor. import spark.implicits._ is only correct in the spark-shell. What it actually means is to import <Spark Session object>.implicits._, spark just happens to be the variable name used for the SparkSession object in the spark-shell.
You can access the SparkSession from a Dataset
At the top of your method you can add the import
def toVectors(withLabelDs: Dataset[Row]) = {
val sparkSession = withLabelIDs.sparkSession
import sparkSession.implicits._
//rest of method code

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.