How to call FileSystem from udf - scala

What I expected
The goal is to add a column with modification time to each DataFrame row.
Given
val data = spark.read.parquet("path").withColumn("input_file_name", input_file_name())
+----+------------------------+
| id | input_file_name |
+----+------------------------+
| 1 | hdfs://path/part-00001 |
| 2 | hdfs://path/part-00001 |
| 3 | hdfs://path/part-00002 |
+----+------------------------+
Expected
+----+------------------------+
| id | modification_time |
+----+------------------------+
| 1 | 2000-01-01Z00:00+00:00 |
| 2 | 2000-01-01Z00:00+00:00 |
| 3 | 2000-01-02Z00:00+00:00 |
+----+------------------------+
What I tried
I wrote a function to get the modification time
def getModificationTime(path: String): Long = {
FileSystem.get(spark.sparkContext.hadoopConfiguration)
.getFileStatus(new org.apache.hadoop.fs.Path(path))
.getModificationTime()
}
val modificationTime = getModificationTime("hdfs://srsdev/projects/khajiit/data/OfdCheques2/date=2020.02.01/part-00002-04b9e4c8-5916-4bb2-b9ff-757f843a0142.c000.snappy.parquet")
modificationTime: Long = 1580708401253
... but it does not work in query
def input_file_modification_time = udf((path: String) => getModificationTime(path))
data.select(input_file_modification_time($"input_file_name") as "modification_time").show(20, false)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 54.0 failed 4 times, most recent failure: Lost task 0.3 in stage 54.0 (TID 408, srs-hdp-s1.dev.kontur.ru, executor 3): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$input_file_modification_time$1: (string) => bigint)

The problem is that spark is null in the UDF, because it only exists on the driver. Another problem is that hadoops Configuration is not serializable, so you cannot easily enclose it the the udf. But there is a workound using org.apache.spark.SerializableWritable:
import org.apache.spark.SerializableWritable
import org.apache.hadoop.conf.Configuration
val conf = new SerializableWritable(spark.sparkContext.hadoopConfiguration)
def getModificationTime(path: String, conf:SerializableWritable[Configuration]): Long = {
org.apache.hadoop.fs.FileSystem.get(conf.value)
.getFileStatus(new org.apache.hadoop.fs.Path(path))
.getModificationTime()
}
def input_file_modification_time(conf:SerializableWritable[Configuration]) = udf((path: String) => getModificationTime(path,conf))
data.select(input_file_modification_time(conf)($"input_file_name") as "modification_time").show(20, false)

Note Invoking getModificationTime for each row of your DataFrame will have performance impact.
Modified your code to fetch file metadata one time & stored in files:Map[String,Long], created UDF input_file_modification_time to fetch data from Map[String,Long].
Please check the below code.
scala> val df = spark.read.format("parquet").load("/tmp/par")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> :paste
// Entering paste mode (ctrl-D to finish)
def getModificationTime(path: String): Long = {
FileSystem.get(spark.sparkContext.hadoopConfiguration)
.getFileStatus(new org.apache.hadoop.fs.Path(path))
.getModificationTime()
}
// Exiting paste mode, now interpreting.
getModificationTime: (path: String)Long
scala> implicit val files = df.inputFiles.flatMap(name => Map(name -> getModificationTime(name))).toMap
files: scala.collection.immutable.Map[String,Long] = Map(file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000_2.snappy.parquet -> 1588080295000, file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000_3.snappy.parquet -> 1588080299000, file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000_4.snappy.parquet -> 1588080302000, file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000.snappy.parquet -> 1588071322000)
scala> :paste
// Entering paste mode (ctrl-D to finish)
def getTime(fileName:String)(implicit files: Map[String,Long]): Long = {
files.getOrElse(fileName,0L)
}
// Exiting paste mode, now interpreting.
getTime: (fileName: String)(implicit files: Map[String,Long])Long
scala> val input_file_modification_time = udf(getTime _)
input_file_modification_time: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> df.withColumn("createdDate",input_file_modification_time(input_file_name)).show
+---+-------------+
| id| createdDate|
+---+-------------+
| 1|1588080295000|
| 2|1588080295000|
| 3|1588080295000|
| 4|1588080295000|
| 5|1588080295000|
| 6|1588080295000|
| 7|1588080295000|
| 8|1588080295000|
| 9|1588080295000|
| 10|1588080295000|
| 11|1588080295000|
| 12|1588080295000|
| 13|1588080295000|
| 14|1588080295000|
| 15|1588080295000|
| 16|1588080295000|
| 17|1588080295000|
| 18|1588080295000|
| 19|1588080295000|
| 20|1588080295000|
+---+-------------+
only showing top 20 rows
scala>

Related

Failed to execute UDF

there is the table in which the field "A" contains sql query. It is necessary to add an additional field "B" that would contain the time spent on the execution of the query from the "A" field. I wrote a UDF and everything works well, but when caching the resulting table or trying to write the final dataframe to a physical table, I got the error:
"Failed to execute user defined function ($anonfun$1: (string) =>
string)"
. What could be the problem?
Example:
val set_time = udf((query: String) => {
val start = new Timestamp(new Date().getDate)
val count = spark.sql(s"${query}").count
val time_query = (new Timestamp(new Date().getTime)).getTime() - start.getTime()
time_query.toString
})
Source table "source":
+--------------------+
| A |
+--------------------+
|"Select * From ..." |
|"Select * From ..." |
|"Select * From ..." |
|"Select * From ..." |
|"Select * From ..." |
+--------------------+
val result = spark.sql("from source").
withColumn("B", set_time(col("A")))
result.show
+--------------------+------+
| A | B |
+--------------------+------+
|"Select * From ..." | 356 |
|"Select * From ..." | 642 |
|"Select * From ..." | 2745 |
|"Select * From ..." | 1324 |
|"Select * From ..." | 635 |
+--------------------+------+
But:
//ERROR
result.write.mode("overwrite").saveAsTable("dbName.result")
//ERROR
val result_cache = result.persist
result_cache.show
The issue here is that UDF works on a executors where spark session isn't available. So I guess you get NullPointer exception on a "val count = spark.sql..." line.
You should do it on a driver using not a UDF but just function1. Also using collect() I suppose that the main table isn't big and will fit into a driver memory:
Example:
import java.util.Date
import java.time.LocalDateTime
val set_time = (query: String) => {
val start = new Timestamp(new Date().getTime)
val count = spark.sql(s"${query}").count
val time_query = (new Timestamp(new Date().getTime)).getTime() - start.getTime()
time_query.toString
}
val result = spark.sql("select 'select 1' as A union all select 'select 2' as A")
val s = result.collect().map(x =>(x(0).asInstanceOf[String],set_time(x(0).asInstanceOf[String]))).toList.toDF("A","B")
s.show
s.cache().show
+--------+---+
| A| B|
+--------+---+
|select 1|171|
|select 2|135|
+--------+---+
PS: also val start = new Timestamp(new Date().getDate) in your example should be val start = new Timestamp(new Date().getTime)

how to parallellize this in spark using spark dataset api

I am using spark-sql-2.4.1v with Java 8.
I have data columns like below
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979),
("Indus_2","Indus_2_Name","Country1", "State2",21789933),
("Indus_3","Indus_3_Name","Country1", "State3",21789978),
("Indus_4","Indus_4_Name","Country2", "State1",41789978),
("Indus_5","Indus_5_Name","Country3", "State3",27789978),
("Indus_6","Indus_6_Name","Country1", "State1",27899790),
("Indus_7","Indus_7_Name","Country3", "State1",27899790),
("Indus_8","Indus_8_Name","Country1", "State2",27899790),
("Indus_9","Indus_9_Name","Country4", "State1",27899790)
).toDF("industry_id","industry_name","country","state","revenue");
Given the below inputs list :
val countryList = Seq("Country1","Country2");
val stateMap = Map("Country1" -> {"State1","State2"}, "Country2" -> {"State2","State3"});
In spark job , for each country for each state I need to calculate few industries total revenue.
In other languages we do in for loop.
i.e.
for( country <- countryList ){
for( state <- stateMap.get(country){
// do some calculation for each state industries
}
}
In spark , what i understood we should do like this, i.e. all executors not been utilized by doing this.
so what is the correct way to handle this ?
I have added few extra rows to your sample data to differentiate aggregation. I have used scala parallel collection, For each country it will get states & then uses those values to filter the given dataframe & then do aggregation, end it will join all the result back.
scala> val df = Seq(
| ("Indus_1","Indus_1_Name","Country1", "State1",12789979),
| ("Indus_2","Indus_2_Name","Country1", "State2",21789933),
| ("Indus_2","Indus_2_Name","Country1", "State2",31789933),
| ("Indus_3","Indus_3_Name","Country1", "State3",21789978),
| ("Indus_4","Indus_4_Name","Country2", "State1",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State2",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State2",81789978),
| ("Indus_4","Indus_4_Name","Country2", "State3",41789978),
| ("Indus_4","Indus_4_Name","Country2", "State3",51789978),
| ("Indus_5","Indus_5_Name","Country3", "State3",27789978),
| ("Indus_6","Indus_6_Name","Country1", "State1",27899790),
| ("Indus_7","Indus_7_Name","Country3", "State1",27899790),
| ("Indus_8","Indus_8_Name","Country1", "State2",27899790),
| ("Indus_9","Indus_9_Name","Country4", "State1",27899790)
| ).toDF("industry_id","industry_name","country","state","revenue")
df: org.apache.spark.sql.DataFrame = [industry_id: string, industry_name: string ... 3 more fields]
scala> val countryList = Seq("Country1","Country2","Country4","Country5");
countryList: Seq[String] = List(Country1, Country2, Country4, Country5)
scala> val stateMap = Map("Country1" -> ("State1","State2"), "Country2" -> ("State2","State3"),"Country3" -> ("State31","State32"));
stateMap: scala.collection.immutable.Map[String,(String, String)] = Map(Country1 -> (State1,State2), Country2 -> (State2,State3), Country3 -> (State31,State32))
scala>
scala> :paste
// Entering paste mode (ctrl-D to finish)
countryList
.par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map{data =>
df.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
}.reduce(_ union _).show(false)
// Exiting paste mode, now interpreting.
+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790 |
|Country1|State1|Indus_6_Name |27899790 |
|Country1|State2|Indus_2_Name |53579866 |
|Country1|State1|Indus_1_Name |12789979 |
|Country2|State3|Indus_4_Name |93579956 |
|Country2|State2|Indus_4_Name |123579956 |
+--------+------+-------------+-------------+
scala>
Edit - 1 : Separated Agg code into different function block.
scala> def processDF(data:(String,(String,String)),adf:DataFrame) = adf.filter($"country" === data._1 && ($"state" === data._2._1 || $"state" === data._2._2)).groupBy("country","state","industry_name").agg(sum("revenue").as("total_revenue"))
processDF: (data: (String, (String, String)), adf: org.apache.spark.sql.DataFrame)org.apache.spark.sql.DataFrame
scala> :paste
// Entering paste mode (ctrl-D to finish)
countryList.
par
.filter(cn => stateMap.exists(_._1 == cn))
.map(country => (country,stateMap(country)))
.map(data => processDF(data,df))
.reduce(_ union _)
.show(false)
// Exiting paste mode, now interpreting.
+--------+------+-------------+-------------+
|country |state |industry_name|total_revenue|
+--------+------+-------------+-------------+
|Country1|State2|Indus_8_Name |27899790 |
|Country1|State1|Indus_6_Name |27899790 |
|Country1|State2|Indus_2_Name |53579866 |
|Country1|State1|Indus_1_Name |12789979 |
|Country2|State3|Indus_4_Name |93579956 |
|Country2|State2|Indus_4_Name |123579956 |
+--------+------+-------------+-------------+
scala>
It really depent on what you want to do, if you don`t need to share state between states(country states), then u should create your DataFrame that each row is (country,state) and then you can control how much rows will be process parallely (num partitions and num cores).
You can use flatMapValues to create key-value pairs and then make your calculations in .map step.
scala> val data = Seq(("country1",Seq("state1","state2","state3")),("country2",Seq("state1","state2")))
scala> val rdd = sc.parallelize(data)
scala> val rdd2 = rdd.flatMapValues(s=>s)
scala> rdd2.foreach(println(_))
(country1,state1)
(country2,state1)
(country1,state2)
(country2,state2)
(country1,state3)
Here you can perform operations, I've added # to each state
scala> rdd2.map(s=>(s._1,s._2+"#")).foreach(println(_))
(country1,state1#)
(country1,state2#)
(country1,state3#)
(country2,state1#)
(country2,state2#)

How to apply customizable Aggregator on Spark Dataset?

I have the following schema and student records of a spark dataset.
id | name | subject | score
1 | Tom | Math | 99
1 | Tom | Math | 88
1 | Tom | Physics | 77
2 | Amy | Math | 66
My goal is to transfer this dataset into another one which shows a list of record of the highest score of every subject for all students
id | name | subject_score_list
1 | Tom | [(Math, 99), (Physics, 77)]
2 | Amy | [(Math, 66)]
I've decided to use an Aggregator to do the transformation after transforming this dataset into ((id, name), (subject score)) key-value pair.
For the buffer I tried to use a mutable Map[String, Integer] so I can update the score if the subject exists and the new score is higher. Here's how the aggregator looks like
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
type StudentSubjectPair = ((String, String), (String, Integer))
type SubjectMap = collection.mutable.Map[String, Integer]
type SubjectList = List[(String, Integer)]
val StudentSubjectAggregator = new Aggregator[StudentSubjectPair, SubjectMap, SubjectList] {
def zero: SubjectMap = collection.mutable.Map[String, Integer]()
def reduce(buf: SubjectMap, input: StudentSubjectPair): SubjectMap = {
if (buf.contains(input._2._1))
buf.map{ case (input._2._1, score) => input._2._1 -> math.max(score, input._2._2) }
else
buf(input._2._1) = input._2._2
buf
}
def merge(b1: SubjectMap, b2: SubjectMap): SubjectMap = {
for ((subject, score) <- b2) {
if (b1.contains(subject))
b1(subject) = math.max(score, b2(subject))
else
b1(subject) = score
}
b1
}
def finish(buf: SubjectMap): SubjectList = buf.toList
override def bufferEncoder: Encoder[SubjectMap] = ExpressionEncoder[SubjectMap]
override def outputEncoder: Encoder[SubjectList] = ExpressionEncoder[SubjectList]
}.toColumn.name("subject_score_list")
I use Aggregator because I found it customizable and if I want to find the mean score of a subject I can simply change the reduce and merge functions.
I'm expecting two answers for this post.
Is it a good way to use Aggregator to get this job done? Are there any other simple way to get the same output?
What's the correct encoder for collection.mutable.Map[String, Integer] and List[(String, Integer)] since I always get the following error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 231, localhost, executor driver):
java.lang.ClassCastException: scala.collection.immutable.HashMap$HashTrieMap cannot be cast to scala.collection.mutable.Map
at $anon$1.merge(<console>:54)
Appreciate for any input and help, thanks!
I think you can achieve your desired result with the DataFrame API.
val df= Seq((1 ,"Tom" ,"Math",99),
(1 ,"Tom" ,"Math" ,88),
(1 ,"Tom" ,"Physics" ,77),
(2 ,"Amy" ,"Math" ,66)).toDF("id", "name", "subject","score")
GroupBy on id, name,and subject for max score, followed by a groupBy on
id,name with a collect_list on map of subject,score
df.groupBy("id","name", "subject").agg(max("score").as("score")).groupBy("id","name").
agg(collect_list(map($"subject",$"score")).as("subject_score_list"))
+---+----+--------------------+
| id|name| subject_score_list|
+---+----+--------------------+
| 1| Tom|[[Physics -> 77],...|
| 2| Amy| [[Math -> 66]]|
+---+----+--------------------+

Spark creating a new column based on a mapped value of an existing column

I am trying to map the values of one column in my dataframe to a new value and put it into a new column using a UDF, but I am unable to get the UDF to accept a parameter that isn't also a column. For example I have a dataframe dfOriginial like this:
+-----------+-----+
|high_scores|count|
+-----------+-----+
| 9| 1|
| 21| 2|
| 23| 3|
| 7| 6|
+-----------+-----+
And I'm trying to get a sense of the bin the numeric value falls into, so I may construct a list of bins like this:
case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
val binMin = binMax - binWidth
// only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
def rangeAsString(): String = {
val sb = new StringBuilder()
sb.append(trimDecimal(binMin)).append(" - ").append(trimDecimal(binMax))
sb.toString()
}
}
And then I want to transform my old dataframe like this to make dfBin:
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
| 9| 1| 0 - 10 |
| 21| 2| 20 - 30 |
| 23| 3| 20 - 30 |
| 7| 6| 0 - 10 |
+-----------+-----+---------+
So that I can ultimately get a count of the instances of the bins by calling .groupBy("bin_range").count().
I am trying to generate dfBin by using the withColumn function with an UDF.
Here's the code with the UDF I am attempting to use:
val convertValueToBinRangeUDF = udf((value:String, binList:List[Bin]) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
val dfBin = dfOriginal.withColumn("bin_range", convertValueToBinRangeUDF(col("high_scores"), binList))
But it's giving me a type mismatch:
Error:type mismatch;
found : List[Bin]
required: org.apache.spark.sql.Column
val valueCountsWithBin = valuesCounts.withColumn(binRangeCol, convertValueToBinRangeUDF(col(columnName), binList))
Seeing the definition of an UDF makes me think it should handle the conversion fine, but it's clearly not, any ideas?
The problem is that parameters to an UDF should all be of column type. One solution would be to convert binList into a column and pass it to the UDF similar to the current code.
However, it is simpler to adjust the UDF slightly and turn it into a def. In this way you can easily pass other non-column type data:
def convertValueToBinRangeUDF(binList: List[Bin]) = udf((value:String) => {
val number = BigDecimal(value)
val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
bin.rangeAsString()
})
Usage:
val dfBin = valuesCounts.withColumn("bin_range", convertValueToBinRangeUDF(binList)($"columnName"))
Try this -
scala> case class Bin(binMax:BigDecimal, binWidth:BigDecimal) {
| val binMin = binMax - binWidth
|
| // only one of the two evaluations can include an "or=", otherwise a value could fit in 2 bins
| def fitsInBin(value: BigDecimal): Boolean = value > binMin && value <= binMax
|
| def rangeAsString(): String = {
| val sb = new StringBuilder()
| sb.append(binMin).append(" - ").append(binMax)
| sb.toString()
| }
| }
defined class Bin
scala> val binList = List(Bin(10, 10), Bin(20, 10), Bin(30, 10), Bin(40, 10), Bin(50, 10))
binList: List[Bin] = List(Bin(10,10), Bin(20,10), Bin(30,10), Bin(40,10), Bin(50,10))
scala> spark.udf.register("convertValueToBinRangeUDF", (value: String) => {
| val number = BigDecimal(value)
| val bin = binList.find( bin => bin.fitsInBin(number)).getOrElse(Bin(BigDecimal(0), BigDecimal(0)))
| bin.rangeAsString()
| })
res13: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
//-- Testing with one record
scala> val dfOriginal = spark.sql(s""" select "9" as `high_scores`, "1" as count """)
dfOriginal: org.apache.spark.sql.DataFrame = [high_scores: string, count: string]
scala> dfOriginal.createOrReplaceTempView("dfOriginal")
scala> val dfBin = spark.sql(s""" select high_scores, count, convertValueToBinRangeUDF(high_scores) as bin_range from dfOriginal """)
dfBin: org.apache.spark.sql.DataFrame = [high_scores: string, count: string ... 1 more field]
scala> dfBin.show(false)
+-----------+-----+---------+
|high_scores|count|bin_range|
+-----------+-----+---------+
|9 |1 |0 - 10 |
+-----------+-----+---------+
Hope this will help.

Spark sorting of delimited data

I am new to Spark. Can you give any idea what is the problem with below code:
val rawData="""USA | E001 | ABC DE | 19850607 | IT | $100
UK | E005 | CHAN CL | 19870512 | OP | $200
USA | E003 | XYZ AB | 19890101 | IT | $250
USA | E002 | XYZ AB | 19890705 | IT | $200"""
val sc = ...
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
val data1=rdd.flatMap(line=> line.split(" | "))
val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false)
data2.saveAsTextFile("./sample_data1_output")
Here, .sortByKey(false) is not working and compiler gives me error:
[error] /home/admin/scala/spark-poc/src/main/scala/SparkApp.scala:26: value sortByKey is not a member of org.apache.spark.rdd.RDD[(String, String)]
[error] val data2 = data1.map(arr => (arr(2), arr.mkString(""))).sortByKey(false)
Question is how to get MappedRDD? Or on what object should I call sortByKey()?
Spark provides additional operations, like sortByKey(), on RDDs of pairs. These operations are available through a class called PairRDDFunctions and Spark uses implicit conversions to automatically perform the RDD -> PairRDDFunctions wrapping.
To import the implicit conversions, add the following lines to the top of your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
This is discussed in the Spark programming guide's section on Working with key-value pairs.