Filtering dataframe using hashmap - scala

I have a hashmap in which I stored the values
Map(862304021470656 -> List(0.0, 0.0, 0.0, 0.0, 1.540980096E9, 74.365111, 22.302669, 0.0),866561010400483 -> List(0.0, 1.0, 1.0, 2.0, 1.543622306E9, 78.0204, 10.005262, 56.0))
This is the dataframe
| id| lt| ln| evt| lstevt| s| d|agl|chg| d1| d2| d3| d4|ebt|ibt|port| a1| a2| a3| a4|nos|dfrmd|
+---------------+---------+---------+----------+----------+---+---+---+---+---+---+---+---+---+---+----+---+---+---+---+---+-----+
|862304021470656|25.284158|82.435973|1540980095|1540980095| 0| 39|298| 0| 0| 1| 1| 2| 0| 5| 97| 12| -1| -1| 22| 0| 0|
|862304021470656|25.284158|82.435973|1540980105|1540980105| 0| 0|298| 0| 0| 1| 1| 2| 0| 5| 97| 12| -1| -1| 22| 0| 0|
|862304021470656|25.284724|82.434222|1540980155|1540980155| 14| 47|289| 0| 0| 1| 1| 2| 0| 5| 97| 11| -1| -1| 22| 0| 0|
|866561010400483|25.284858|82.433831|1544980165|1540980165| 12| 42|295| 0| 0| 1| 1| 2| 0| 5| 97| 12| -1| -1| 22| 0| 0|
I want to just filter those value from dataframe, comparing the 4th index of list from the evt column,picking only the rows whose evt value is greater than that 4th index value of list,key in the map is id column of dataframe.

Here's one way using a UDF to fetch the evt value for comparison:
import org.apache.spark.sql.functions._
val df = Seq(
(862304021470656L, 25.284158, 82.435973, 1540980095),
(862304021470656L, 25.284158, 82.435973, 1540980105),
(862304021470656L, 25.284724, 82.434222, 1540980155),
(866561010400483L, 25.284858, 82.433831, 1544980165)
).toDF("id", "lt", "ln", "evt")
val listMap = Map(
862304021470656L -> List(0.0, 0.0, 0.0, 0.0, 1.540980096E9, 74.365111, 22.302669, 0.0),
866561010400483L -> List(0.0, 1.0, 1.0, 2.0, 1.543622306E9, 78.0204, 10.005262, 56.0)
)
def evtLimit(m: Map[Long, List[Double]], evtIdx: Int) = udf(
(id: Long) => m.get(id) match {
case Some(ls) => if (evtIdx < ls.size) ls(evtIdx) else Double.MaxValue
case None => Double.MaxValue
}
)
df.where($"evt" > evtLimit(listMap, 4)($"id")).show
// +---------------+---------+---------+----------+
// | id| lt| ln| evt|
// +---------------+---------+---------+----------+
// |862304021470656|25.284158|82.435973|1540980105|
// |862304021470656|25.284724|82.434222|1540980155|
// |866561010400483|25.284858|82.433831|1544980165|
// +---------------+---------+---------+----------+
Note that the UDF returns Double.MaxValue in case of non-matching key or invalid value in the provided Map. That can certainly be revised for specific business requirement.

You can get this with a simple sql:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = ... //your main Dataframe
val map = Map(..your data here..).toDF("id", "list")
val join = df.join(map, "id").filter(length($"list") >= 5 /* <-- just in case */)
val res = join.filter($"evt" > $"list"(4))

Related

Collect statistics from the DataFrame

I'm collecting dataframe statistics.
The maximum minimum average value of the column. The number of zeros in the column. The number of empty values in the column.
Conditions:
Number of columns n < 2000
Number of dataframe entries r < 10^9
The stack() function is used for the solution
https://www.hadoopinrealworld.com/understanding-stack-function-in-spark/#:~:text=stack%20function%20in%20Spark%20takes,an%20argument%20followed%20by%20expressions.&text=stack%20function%20will%20generate%20n%20rows%20by%20evaluating%20the%20expressions.
What scares:
The number of rows in the intermediate dataframe rusultDF. cal("period_date").dropDuplicates * columnsNames.size * r = many
Input:
val columnsNames = List("col_name1", "col_name2")
+---------+---------+-----------+
|col_name1|col_name2|period_date|
+---------+---------+-----------+
| 11| 21| 2022-01-31|
| 12| 22| 2022-01-31|
| 13| 23| 2022-03-31|
+---------+---------+-----------+
Output:
+-----------+---------+----------+----------+---------+---------+---------+
|period_date| columns|count_null|count_zero|avg_value|mix_value|man_value|
+-----------+---------+----------+----------+---------+---------+---------+
| 2022-01-31|col_name2| 0| 0| 21.5| 21| 22|
| 2022-03-31|col_name1| 0| 0| 13.0| 13| 13|
| 2022-03-31|col_name2| 0| 0| 23.0| 23| 23|
| 2022-01-31|col_name1| 0| 0| 11.5| 11| 12|
+-----------+---------+----------+----------+---------+---------+---------+
My solution:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local").appName("spark test5").getOrCreate()
import spark.implicits._
case class RealStructure(col_name1: Int, col_name2: Int, period_date: String)
val userTableDf = List(
RealStructure(11, 21, "2022-01-31"),
RealStructure(12, 22, "2022-01-31"),
RealStructure(13, 23, "2022-03-31")
) toDF()
//userTableDf.show()
//Start
new StatisticCollector(userTableDf)
class StatisticCollector(userTableDf: DataFrame) {
val columnsNames = List("col_name1", "col_name2")
val stack = s"stack(${columnsNames.length}, ${columnsNames.map(name => s"'$name', $name").mkString(",")})"
val resultDF = userTableDf.select(col("period_date"),
expr(s"$stack as (columns, values)")
)
//resultDF.show()
println(stack)
/**
+-----------+---------+------+
|period_date| columns|values|
+-----------+---------+------+
| 2022-01-31|col_name1| 11|
| 2022-01-31|col_name2| 21|
| 2022-01-31|col_name1| 12|
| 2022-01-31|col_name2| 22|
| 2022-03-31|col_name1| 13|
| 2022-03-31|col_name2| 23|
+-----------+---------+------+
stack(2, 'col_name1', col_name1,'col_name2', col_name2)
**/
val superResultDF = resultDF.groupBy(col("period_date"), col("columns")).agg(
sum(when(col("values").isNull, 1).otherwise(0)).alias("count_null"),
sum(when(col("values") === 0, 1).otherwise(0)).alias("count_zero"),
avg("values").cast("double").alias("avg_value"),
min(col("values")).alias("mix_value"),
max(col("values")).alias("man_value")
)
superResultDF.show()
}
Please give your assessment, if you see what can be solved more efficiently, then write how you would solve it.
The calculation speed is important.
It is necessary as quickly as it is provided by God.

Selecting specific rows from different dataframes within a map scope

Hello I am new to Spark and scala, and I have three similar dataframes as the following:
df1:
+--------+-------+-------+-------+
| Country|1/22/20|1/23/20|1/24/20|
+--------+-------+-------+-------+
| Chad| 1| 0| 5|
+--------+-------+-------+-------+
|Paraguay| 4| 6| 3|
+--------+-------+-------+-------+
| Russia| 0| 0| 1|
+--------+-------+-------+-------+
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using
df1.map{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
}
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.
df1.map{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(s.name, s"$key${s.name}")
})
newdf
}
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.drop("key2_country")
.drop("key3_country")
.withColumnRenamed("key1_country", "country")
finaldf.show(10)
//Output
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

spark scala transform a dataframe/rdd

I have a CSV file like below.
PK,key,Value
100,col1,val11
100,col2,val12
100,idx,1
100,icol1,ival11
100,icol3,ival13
100,idx,2
100,icol1,ival21
100,icol2,ival22
101,col1,val21
101,col2,val22
101,idx,1
101,icol1,ival11
101,icol3,ival13
101,idx,3
101,icol1,ival31
101,icol2,ival32
I want to transform this into the following.
PK,idx,key,Value
100,,col1,val11
100,,col2,val12
100,1,idx,1
100,1,icol1,ival11
100,1,icol3,ival13
100,2,idx,2
100,2,icol1,ival21
100,2,icol2,ival22
101,,col1,val21
101,,col2,val22
101,1,idx,1
101,1,icol1,ival11
101,1,icol3,ival13
101,3,idx,3
101,3,icol1,ival31
101,3,icol2,ival32
Basically I want to create the an new column called idx in the output dataframe which will be populated with the same value "n" as that of the row following the key=idx, value="n".
Here is one way using last window function with Spark >= 2.0.0:
import org.apache.spark.sql.functions.{last, when, lit}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("PK").rowsBetween(Window.unboundedPreceding, 0)
df.withColumn("idx", when($"key" === lit("idx"), $"Value"))
.withColumn("idx", last($"idx", true).over(w))
.orderBy($"PK")
.show
Output:
+---+-----+------+----+
| PK| key| Value| idx|
+---+-----+------+----+
|100| col1| val11|null|
|100| col2| val12|null|
|100| idx| 1| 1|
|100|icol1|ival11| 1|
|100|icol3|ival13| 1|
|100| idx| 2| 2|
|100|icol1|ival21| 2|
|100|icol2|ival22| 2|
|101| col1| val21|null|
|101| col2| val22|null|
|101| idx| 1| 1|
|101|icol1|ival11| 1|
|101|icol3|ival13| 1|
|101| idx| 3| 3|
|101|icol1|ival31| 3|
|101|icol2|ival32| 3|
+---+-----+------+----+
The code first creates a new column called idx which contains the value of Value when key == idx, or null otherwise. Then it retrieves the last observed idx over the defined window.

Rank per row over multiple columns in Spark Dataframe

I am using spark with Scala to transform a Dataframe , where I would like to compute a new variable which calculates the rank of one variable per row within many variables.
Example -
Input DF-
+---+---+---+
|c_0|c_1|c_2|
+---+---+---+
| 11| 11| 35|
| 22| 12| 66|
| 44| 22| 12|
+---+---+---+
Expected DF-
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 11| 35| 2| 3| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
This has aleady been answered using R - Rank per row over multiple columns in R,
but I need to do the same in spark-sql using scala. Thanks for the Help!
Edit- 4/1 . Encountered one scenario where if the values are same the ranks should be different. Editing first row for replicating the situation.
If I understand correctly, you want to have the rank of each column, within each row.
Let's first define the data, and the columns to "rank".
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
val cols = df.columns
Then we define a UDF that finds the index of an element in an array.
val pos = udf((a : Seq[Int], elt : Int) => a.indexOf(elt)+1)
We finally create a sorted array (in descending order) and use the UDF to find the rank of each column.
val ranks = cols.map(c => pos(col("array"), col(c)).as(c+"_rank"))
df.withColumn("array", sort_array(array(cols.map(col) : _*), false))
.select((cols.map(col)++ranks) :_*).show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 11| 12| 35| 3| 2| 1|
| 22| 12| 66| 2| 3| 1|
| 44| 22| 12| 1| 2| 3|
+---+---+---+--------+--------+--------+
EDIT:
As of Spark 2.4, the pos UDF that I defined can be replaced by the built in function array_position(column: Column, value: Any) that works exactly the same way (the first index is 1). This avoids using UDFs that can be slightly less efficient.
EDIT2:
The code above will generate duplicated indices in case you have duplidated keys. If you want to avoid it, you can create the array, zip it to remember which column is which, sort it and zip it again to get the final rank. It would look like this:
val colMap = df.columns.zipWithIndex.map(_.swap).toMap
val zip = udf((s: Seq[Int]) => s
.zipWithIndex
.sortBy(-_._1)
.map(_._2)
.zipWithIndex
.toMap
.mapValues(_+1))
val ranks = (0 until cols.size)
.map(i => 'zip.getItem(i) as colMap(i) + "_rank")
val result = df
.withColumn("zip", zip(array(cols.map(col) : _*)))
.select(cols.map(col) ++ ranks :_*)
One way to go about this would be to use windows.
val df = Seq((11, 21, 35),(22, 12, 66),(44, 22 , 12))
.toDF("c_0", "c_1", "c_2")
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
d.withColumn(column+"_rank", rank() over Window.orderBy(desc(column))))
.show
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 22| 12| 66| 2| 3| 1|
| 11| 21| 35| 3| 2| 2|
| 44| 22| 12| 1| 1| 3|
+---+---+---+--------+--------+--------+
But this is not a good idea. All the data will end up in one partition which will cause an OOM error if all the data does not fit inside one executor.
Another way would require to sort the dataframe three times, but at least that would scale to any size of data.
Let's define a function that zips a dataframe with consecutive indices (it exists for RDDs but not for dataframes)
def zipWithIndex(df : DataFrame, name : String) : DataFrame = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ (i+1)) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
And let's use it on the same dataframe df:
(0 to 2)
.map("c_"+_)
.foldLeft(df)((d, column) =>
zipWithIndex(d.orderBy(desc(column)), column+"_rank"))
.show
which provides the exact same result as above.
You could probably create a window function. Do note that this is susceptible to OOM if you have too much data. But, I just wanted to introduce to the concept of window functions here.
inputDF.createOrReplaceTempView("my_df")
val expectedDF = spark.sql("""
select
c_0
, c_1
, c_2
, rank(c_0) over (order by c_0 desc) c_0_rank
, rank(c_1) over (order by c_1 desc) c_1_rank
, rank(c_2) over (order by c_2 desc) c_2_rank
from my_df""")
expectedDF.show()
+---+---+---+--------+--------+--------+
|c_0|c_1|c_2|c_0_rank|c_1_rank|c_2_rank|
+---+---+---+--------+--------+--------+
| 44| 22| 12| 3| 3| 1|
| 11| 21| 35| 1| 2| 2|
| 22| 12| 66| 2| 1| 3|
+---+---+---+--------+--------+--------+

Understanding pivot and agg

I have the following columns in DataFrame df:
c_id p_id type values
278230 57371100 11 1
278230 57371100 12 1
...
I execute the following code and expect to see columns 11_total and 12_total:
df
.groupBy($"c_id",$"p_id")
.pivot("type")
.agg(sum("values") as "total")
.na.fill(0)
.show()
Instead, I get columns 11 and 12:
+-----------+----------+---+---+
| c_id| p_id| 11| 12|
+-----------+----------+---+---+
| 278230| 57371100| 0| 1|
| 337790| 72031970| 3| 0|
| 320710| 71904400| 0| 1|
Why?
That's because Spark appends aliases to the pivot column values only when there are multiple aggregations for clarity:
val df = Seq(
(278230, 57371100, 11, 1),
(278230, 57371100, 12, 2),
(337790, 72031970, 11, 1),
(337790, 72031970, 11, 2),
(337790, 72031970, 12, 3)
)toDF("c_id", "p_id", "type", "values")
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total")).
show
// +------+--------+---+---+
// | c_id| p_id| 11| 12|
// +------+--------+---+---+
// |337790|72031970| 3| 3|
// |278230|57371100| 1| 2|
// +------+--------+---+---+
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total"), max("values").as("max")).
show
// +------+--------+--------+------+--------+------+
// | c_id| p_id|11_total|11_max|12_total|12_max|
// +------+--------+--------+------+--------+------+
// |337790|72031970| 3| 2| 3| 3|
// |278230|57371100| 1| 1| 2| 2|
// +------+--------+--------+------+--------+------+