Finding Sum in a spark Scala dataframe with delimited values

Finding Sum in a spark Scala dataframe with delimited values - scala

I have the below dataframe with me.
val df1=Seq(
("1_2_3","5_10"),
("4_5_6","15_20")
)toDF("c1","c2")
+-----+-----+
| c1| c2|
+-----+-----+
|1_2_3| 5_10|
|4_5_6|15_20|
+-----+-----+
How to get the sum in a separate column based on the condition -
-Omit third value after delimiter - '_' in the first column.
-adding first value of each column ie, omitting '_3' and '_6' in 1_2_3 and 4_5_6
and then adding 1,5 and 2,10. Also adding 15+4 and 20+5.
Expected output -
+-----+-----+-----+
| c1| c2| res|
+-----+-----+-----+
|1_2_3| 5_10| 6_12|
|4_5_6|15_20|19_25|
+-----+-----+-----+

Try this-
zip_with + split
val df1=Seq(
("1_2_3","5_10"),
("4_5_6","15_20")
)toDF("c1","c2")
df1.show(false)
df1.withColumn("res",
expr("concat_ws('_', zip_with(split(c1, '_'), split(c2, '_'), (x, y) -> cast(x+y as int)))"))
.show(false)
/**
* +-----+-----+-----+
* |c1 |c2 |res |
* +-----+-----+-----+
* |1_2_3|5_10 |6_12 |
* |4_5_6|15_20|19_25|
* +-----+-----+-----+
*/
update dynamically for 50 columns
val end = 51 // 50 cols
val df = spark.sql("select '1_2_3' as c1")
val new_df = Range(2, end).foldLeft(df){(df, i) => df.withColumn(s"c$i", $"c1")}
new_df.show(false)
/**
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
* |c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |c9 |c10 |c11 |c12 |c13 |c14 |c15 |c16 |c17 |c18 |c19 |c20 |c21 |c22 |c23 |c24 |c25 |c26 |c27 |c28 |c29 |c30 |c31 |c32 |c33 |c34 |c35 |c36 |c37 |c38 |c39 |c40 |c41 |c42 |c43 |c44 |c45 |c46 |c47 |c48 |c49 |c50 |
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
* |1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
*/
val res = new_df.withColumn("res", $"c1")
Range(2, end).foldLeft(res){(df4, i) =>
df4.withColumn("res",
expr(s"concat_ws('_', zip_with(split(res, '_'), split(${s"c$i"}, '_'), (x, y) -> cast(x+y as int)))"))
}
.show(false)
/**
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
* |c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |c9 |c10 |c11 |c12 |c13 |c14 |c15 |c16 |c17 |c18 |c19 |c20 |c21 |c22 |c23 |c24 |c25 |c26 |c27 |c28 |c29 |c30 |c31 |c32 |c33 |c34 |c35 |c36 |c37 |c38 |c39 |c40 |c41 |c42 |c43 |c44 |c45 |c46 |c47 |c48 |c49 |c50 |res |
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
* |1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|1_2_3|50_100_150|
* +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----------+
*/

Related

How to count occurrences of different values in multiple columns all at once where number or name of columns is not known?

I have this dataframe that gets generated automatically and the names and numbers of columns will never be known. I would like to know how I can count the occurrence of each of the values in each of the columns.
For example,
Col1 Col2 Col3
Row1 True False False
Row2 True True True
Row3 False False True
Row4 False False False
The result should be something like:
Col1 Count Col2 Count Col3 Count
True 2 True 1 True 2
False 2 False 3 False 2
I have tried applying GroupBy kind of like this:
df.groupBy(record => (record.Col1, record.Col2, record.Col3)).count().show
But this wouldn't work for me since I wouldn't know the number or names of the columns.

Try this-
Load the test data provided
val data =
"""
|Col1 Col2 Col3
|True False False
|True True True
|False False True
|False False False
""".stripMargin
val stringDS2 = data.split(System.lineSeparator())
.map(_.split("\\s+").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df2 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS2)
df2.show(false)
df2.printSchema()
/**
* +-----+-----+-----+
* |Col1 |Col2 |Col3 |
* +-----+-----+-----+
* |true |false|false|
* |true |true |true |
* |false|false|true |
* |false|false|false|
* +-----+-----+-----+
*
* root
* |-- Col1: boolean (nullable = true)
* |-- Col2: boolean (nullable = true)
* |-- Col3: boolean (nullable = true)
*/
Simple way to compute the count for each distinct values in the column
val findCounts = df2.columns.flatMap(c => Seq(col(c), count(c).over(Window.partitionBy(c)).as(s"count_$c")))
df2.select(findCounts: _*).distinct()
.show(false)
/**
* +-----+----------+-----+----------+-----+----------+
* |Col1 |count_Col1|Col2 |count_Col2|Col3 |count_Col3|
* +-----+----------+-----+----------+-----+----------+
* |false|2 |false|3 |false|2 |
* |false|2 |false|3 |true |2 |
* |true |2 |false|3 |false|2 |
* |true |2 |true |1 |true |2 |
* +-----+----------+-----+----------+-----+----------+
*/
If you need in the same format as mentioned, try this
Assuming all the columns in the dataframe have same distinct values
// Assuming all the columns in the dataframe have same distinct values
val columns = df2.columns
val head = columns.head
val zeroDF = df2.groupBy(head).agg(count(head).as(s"${head}_count"))
columns.tail.foldLeft(zeroDF){
(df, c) => df.join(df2.groupBy(c).agg(count(c).as(s"${c}_count")), col(head) === col(c))
}.show(false)
/**
* +-----+----------+-----+----------+-----+----------+
* |Col1 |Col1_count|Col2 |Col2_count|Col3 |Col3_count|
* +-----+----------+-----+----------+-----+----------+
* |false|2 |false|3 |false|2 |
* |true |2 |true |1 |true |2 |
* +-----+----------+-----+----------+-----+----------+
*/

how to get max value between unbounded preceding and ignoring the current row date value for a given id in pyspark?

I have a below pyspark dataframe:
df
id date key1
A1 2020-01-06 K1
A1 2020-01-06 K2
A1 2020-01-07 K3
A1 2020-01-07 K3
A1 2020-01-20 K3
A2 ..
I need to add column last_date which is last max date for a given id ignoring the current date.
id date key1 last_date
A1 2020-01-06 K1
A1 2020-01-06 K2
A1 2020-01-07 K3 2020-01-06
A1 2020-01-07 K3 2020-01-06
A1 2020-01-20 K3 2020-01-07
I am using code but it is giving same date, how to ignore current row date?
unbounded_window = (
Window.partitionBy("id")
.orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
prepared_df =df.withColumn("last_date", F.max("date").over(unbounded_window))

You need to find when the date changes and forward fill it. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst = sqlContext.createDataFrame([('A1','2020-01-06','K1' ),('A1','2020-01-06','K2'),\
('A1','2020-01-07','K3' ),('A1','2020-01-07','K3'),('A1','2020-01-20','K3')],schema=['id','date','key'])
w=Window.partitionBy('id').orderBy('date')
tst_stp = tst.withColumn("date_lag",F.lag('date').over(w))
tst_dt = tst_stp.withColumn("date_chk",F.when((F.col('date')!=F.col('date_lag')),F.col('date_lag')))
#%%Forward fill
tst_res = tst_dt.withColumn('last_date',F.last('date_chk',ignorenulls=True).over(w))
Results:
tst_res.show()
+---+----------+---+----------+----------+----------+
| id| date|key| date_lag| date_chk| last_date|
+---+----------+---+----------+----------+----------+
| A1|2020-01-06| K1| null| null| null|
| A1|2020-01-06| K2|2020-01-06| null| null|
| A1|2020-01-07| K3|2020-01-06|2020-01-06|2020-01-06|
| A1|2020-01-07| K3|2020-01-07| null|2020-01-06|
| A1|2020-01-20| K3|2020-01-07|2020-01-07|2020-01-07|
+---+----------+---+----------+----------+----------+

Try this-
df1.show(false)
df1.printSchema()
/**
* +---+-------------------+----+
* |id |date |key1|
* +---+-------------------+----+
* |A1 |2020-01-06 00:00:00|K1 |
* |A1 |2020-01-06 00:00:00|K2 |
* |A1 |2020-01-07 00:00:00|K3 |
* |A1 |2020-01-07 00:00:00|K3 |
* |A1 |2020-01-20 00:00:00|K3 |
* +---+-------------------+----+
*
* root
* |-- id: string (nullable = true)
* |-- date: timestamp (nullable = true)
* |-- key1: string (nullable = true)
*/
val w = Window.partitionBy("id").orderBy("date")
val w1 = Window.partitionBy("id", "date")
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df1.withColumn("last_date", lag(col("date"), 1).over(w))
.withColumn("last_date", min(col("last_date")).over(w1))
.withColumn("last_date", when($"date" =!= $"last_date", $"last_date"))
.show(false)
/**
* +---+-------------------+----+-------------------+
* |id |date |key1|last_date |
* +---+-------------------+----+-------------------+
* |A1 |2020-01-06 00:00:00|K1 |null |
* |A1 |2020-01-06 00:00:00|K2 |null |
* |A1 |2020-01-07 00:00:00|K3 |2020-01-06 00:00:00|
* |A1 |2020-01-07 00:00:00|K3 |2020-01-06 00:00:00|
* |A1 |2020-01-20 00:00:00|K3 |2020-01-07 00:00:00|
* +---+-------------------+----+-------------------+
*/

How to compute an iterative computation with spark and scala

In the example below the code produce a computation that is applied systematically to the same set of the original records.
Instead, the code must use the previously computed value to produce the subsequent quantity.
package playground
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{KeyValueGroupedDataset, SparkSession}
object basic2 extends App {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val spark = SparkSession
.builder()
.appName("Sample app")
.master("local")
.getOrCreate()
import spark.implicits._
final case class Owner(car: String, pcode: String, qtty: Double)
final case class Invoice(car: String, pcode: String, qtty: Double)
val data = Seq(
Owner("A", "666", 80),
Owner("B", "555", 20),
Owner("A", "444", 50),
Owner("A", "222", 20),
Owner("C", "444", 20),
Owner("C", "666", 80),
Owner("C", "555", 120),
Owner("A", "888", 100)
)
val fleet = Seq(Invoice("A", "666", 15), Invoice("A", "888", 12))
val owners = spark.createDataset(data)
val invoices = spark.createDataset(fleet)
val gb: KeyValueGroupedDataset[Invoice, (Owner, Invoice)] = owners
.joinWith(invoices, invoices("car") === owners("car"), "inner")
.groupByKey(_._2)
gb.flatMapGroups {
case (fleet, group) ⇒
val subOwner: Vector[Owner] = group.toVector.map(_._1)
val calculatedRes = subOwner.filter(_.car == fleet.car)
calculatedRes.map(c => c.copy(qtty = .3 * c.qtty + fleet.qtty))
}
.show()
}
/**
* +---+-----+----+
* |car|pcode|qtty|
* +---+-----+----+
* | A| 666|39.0|
* | A| 444|30.0|
* | A| 222|21.0|
* | A| 888|45.0|
* | A| 666|36.0|
* | A| 444|27.0|
* | A| 222|18.0|
* | A| 888|42.0|
* +---+-----+----+
*
* +---+-----+----+
* |car|pcode|qtty|
* +---+-----+----+
* | A| 666|0.3 * 39.0 + 12|
* | A| 444|0.3 * 30.0 + 12|
* | A| 222|0.3 * 21.0 + 12|
* | A| 888|0.3 * 45.0 + 12|
* +---+-----+----+
*/
The second table above is showing the expected output. The first table is what the code of this question produces.
How to produce the expected output in an iterative way?
Notice that the order of computation doesn't matter, the results will be different but it is still a valid answer.

Check below code.
val getQtty = udf((invoicesQtty:Seq[Double],ownersQtty:Double) => {
invoicesQtty.tail.foldLeft((0.3 * ownersQtty + invoicesQtty.head))(
(totalIQ,nextInvoiceQtty) => 0.3 * totalIQ + nextInvoiceQtty
)
})
val getQttyStr = udf((invoicesQtty:Seq[Double],ownersQtty:Double) => {
val totalIQ = (0.3 * ownersQtty + invoicesQtty.head)
invoicesQtty.tail.foldLeft("")(
(data,nextInvoiceQtty) => {
s"0.3 * ${if(data.isEmpty) totalIQ else s"(${data})"} + ${nextInvoiceQtty}"
}
)
})
owners
.join(invoices, invoices("car") === owners("car"), "inner")
.orderBy(invoices("qtty").desc)
.groupBy(owners("car"),owners("pcode"))
.agg(
collect_list(invoices("qtty")).as("invoices_qtty"),
first(owners("qtty")).as("owners_qtty")
)
.withColumn("qtty",getQtty($"invoices_qtty",$"owners_qtty"))
.withColumn("qtty_str",getQttyStr($"invoices_qtty",$"owners_qtty"))
.show(false)
Result
+---+-----+-------------+-----------+----+-----------------+
|car|pcode|invoices_qtty|owners_qtty|qtty|qtty_str |
+---+-----+-------------+-----------+----+-----------------+
|A |666 |[15.0, 12.0] |80.0 |23.7|0.3 * 39.0 + 12.0|
|A |888 |[15.0, 12.0] |100.0 |25.5|0.3 * 45.0 + 12.0|
|A |444 |[15.0, 12.0] |50.0 |21.0|0.3 * 30.0 + 12.0|
|A |222 |[15.0, 12.0] |20.0 |18.3|0.3 * 21.0 + 12.0|
+---+-----+-------------+-----------+----+-----------------+

Filter rows after applying MAX function on dataframe in Pyspark

I would like to know how can we apply the filter function after applying MAX function on data frame using Pyspark.
Example: Display the name of the employee who earns the highest salary.
In sql,
select ename from emp where sal=(select max(sal) from emp) ;
I want to apply same logic on dataframe in Pyspark.

same query should work (tried in spark=2.4.5)
val df = Seq(("emp1", 100), ("emp2", 200)).toDF("ename", "sal")
df.createOrReplaceTempView("emp")
spark.sql(
"""
| select ename from emp where sal=(select max(sal) from emp)
""".stripMargin)
.show(false)
/**
* +-----+
* |ename|
* +-----+
* |emp2 |
* +-----+
*/

val sourceDF = Seq( (1, 10, "05-10-2019"),
(2, 20, "07-22-2020"),
(3, 30, "11-03-2017"))
.toDF("id", "metric", "transaction_date")
sourceDF.show(false)
// +---+------+----------------+
// |id |metric|transaction_date|
// +---+------+----------------+
// |1 |10 |05-10-2019 |
// |2 |20 |07-22-2020 |
// |3 |30 |11-03-2017 |
// +---+------+----------------+
val resDF = sourceDF.filter('metric.equalTo(sourceDF.agg(max('metric)).head().getInt(0)))
resDF.show(false)
// +---+------+----------------+
// |id |metric|transaction_date|
// +---+------+----------------+
// |3 |30 |11-03-2017 |
// +---+------+----------------+

Better Alternatives to EXCEPT Spark Scala

I have been told that EXCEPT is a very costly operation and one should always try to avoid using EXCEPT.
My Use Case -
val myFilter = "rollNo='11' AND class='10'"
val rawDataDf = spark.table(<table_name>)
val myFilteredDataframe = rawDataDf.where(myFilter)
val allOthersDataframe = rawDataDf.except(myFilteredDataframe)
But I am confused, in such use case , what are my alternatives ?

Use left anti join as below-
val df = spark.range(2).withColumn("name", lit("foo"))
df.show(false)
df.printSchema()
/**
* +---+----+
* |id |name|
* +---+----+
* |0 |foo |
* |1 |foo |
* +---+----+
*
* root
* |-- id: long (nullable = false)
* |-- name: string (nullable = false)
*/
val df2 = df.filter("id=0")
df.join(df2, df.columns.toSeq, "leftanti")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |foo |
* +---+----+
*/