Spark Dataframe Group by having New Indicator Column - scala

I need to group by "KEY" Column and need to check whether "TYPE_CODE" column has both "PL" and "JL" values , if so then i need to add a Indicator Column as "Y" else "N"
Example :
//Input Values
val values = List(List("66","PL") ,
List("67","JL") , List("67","PL"),List("67","PO"),
List("68","JL"),List("68","PO")).map(x =>(x(0), x(1)))
import spark.implicits._
//created a dataframe
val cmc = values.toDF("KEY","TYPE_CODE")
cmc.show(false)
------------------------
KEY |TYPE_CODE |
------------------------
66 |PL |
67 |JL |
67 |PL |
67 |PO |
68 |JL |
68 |PO |
-------------------------
Expected Output :
For each "KEY", If it has "TYPE_CODE" has both PL & JL then Y
else N
-----------------------------------------------------
KEY |TYPE_CODE | Indicator
-----------------------------------------------------
66 |PL | N
67 |JL | Y
67 |PL | Y
67 |PO | Y
68 |JL | N
68 |PO | N
---------------------------------------------------
For example,
67 has both PL & JL - So "Y"
66 has only PL - So "N"
68 has only JL - So "N"

One option:
1) collect TYPE_CODE as list;
2) check if it contains the specific strings;
3) then flatten the list with explode:
(cmc.groupBy("KEY")
.agg(collect_list("TYPE_CODE").as("TYPE_CODE"))
.withColumn("Indicator",
when(array_contains($"TYPE_CODE", "PL") && array_contains($"TYPE_CODE", "JL"), "Y").otherwise("N"))
.withColumn("TYPE_CODE", explode($"TYPE_CODE"))).show
+---+---------+---------+
|KEY|TYPE_CODE|Indicator|
+---+---------+---------+
| 68| JL| N|
| 68| PO| N|
| 67| JL| Y|
| 67| PL| Y|
| 67| PO| Y|
| 66| PL| N|
+---+---------+---------+

Another option:
Group by KEY and use agg to create two separate indicator columns (one for JL and on for PL), then calculate the combined indicator
join with the original DataFrame
Altogether:
val indicators = cmc.groupBy("KEY").agg(
sum(when($"TYPE_CODE" === "PL", 1).otherwise(0)) as "pls",
sum(when($"TYPE_CODE" === "JL", 1).otherwise(0)) as "jls"
).withColumn("Indicator", when($"pls" > 0 && $"jls" > 0, "Y").otherwise("N"))
val result = cmc.join(indicators, "KEY")
.select("KEY", "TYPE_CODE", "Indicator")
This might be slower than #Psidom's answer, but might be safer - collect_list might be problematic if you have a huge number of matches for a specific key (that list would have to be stored in a single worker's memory).
EDIT:
In case the input is known to be unique (i.e. JL / PL would only appear once per key, at most), indicators could be created using simple count aggregation, which is (arguably) easier to read:
val indicators = cmc
.where($"TYPE_CODE".isin("PL", "JL"))
.groupBy("KEY").count()
.withColumn("Indicator", when($"count" === 2, "Y").otherwise("N"))

Related

How to transform the below table to required format?

I have following table loaded as a dataframe :
Id Name customCount Custom1 Custom1value custom2 custom2Value custom3 custom3Value
1 qwerty 2 Height 171 Age 76 Null Null
2 asdfg 2 Weight 78 Height 166 Null Null
3 zxcvb 3 Age 28 SkinColor white Height 67
4 tyuio 1 Height 177 Null Null Null Null
5 asdfgh 2 SkinColor brown Age 34 Null Null
I need to change this table into below format :
Id Name customCount Height Weight Age SkinColor
1 qwerty 2 171 Null 76 Null
2 asdfg 2 161 78 Null Null
3 zxcvb 3 67 Null 28 white
4 tyuio 1 177 Null Null Null
5 asdfgh 2 Null Null 34 brown
I tried for two custom fields columns :
val rawDf= spark.read.option("Header",false).options(Map("sep"->"|")).csv("/sample/data.csv")
rawDf.createOrReplaceTempView("Table")
val dataframe=spark.sql("select distinct * from (select `_c3` from Table union select `_c5` from Table)")
val dfWithDistinctColumns=dataframe.toDF("colNames")
val list=dfWithDistinctColumns.select("colNames").map(x=>x.getString(0)).collect().toList
val rawDfWithSchema=rawDf.toDF("Id","Name",customCount","h1","v1","h2","v2")
val expectedDf=list.foldLeft(rawDfWithSchema)((df1,c)=>(df1.withColumn(c, when(col("h1")===c,col("v1")).when(col("h2")===c,col("v2")).otherwise(null)))).drop("h1","h2","v1","v2")
But I am not able to do union on multiple columns when I try it on 3 custom fields .
Can you please give any idea/solution for this?
You can do a pivot, but you also need to clean up the format of the dataframe first:
val df2 = df.select(
$"Id", $"Name", $"customCount",
explode(array(
array($"Custom1", $"Custom1value"),
array($"custom2", $"custom2Value"),
array($"custom3", $"custom3Value")
)).alias("custom")
).select(
$"Id", $"Name", $"customCount",
$"custom"(0).alias("key"),
$"custom"(1).alias("value")
).groupBy(
"Id", "Name", "customCount"
).pivot("key").agg(first("value")).drop("null").orderBy("Id")
df2.show
+---+------+-----------+----+------+---------+------+
| Id| Name|customCount| Age|Height|SkinColor|Weight|
+---+------+-----------+----+------+---------+------+
| 1|qwerty| 2| 76| 171| null| null|
| 2| asdfg| 2|null| 166| null| 78|
| 3| zxcvb| 3| 28| 67| white| null|
| 4| tyuio| 1|null| 177| null| null|
| 5|asdfgh| 2| 34| null| brown| null|
+---+------+-----------+----+------+---------+------+

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

spark delete previous row based on the new row with some conditions match

I have dataframe like below
type f1 f2 value
1 a xy 11
2 b ab 13
3 c na 16
3 c dir 18
3 c ls 23
I have to delete a previous row some some of conditions matches with next row,
for example from the above table, when column fields of type == type(row-1) && f1 == f1(row-1) && abs(value - value (row-1)) < 2 , when this condition matches I want to delete the previous row.
so I my table should like below
type f1 f2 value
1 a xy 11
2 b ab 13
3 c dir 18
3 c ls 30
I am thinking that we can make use of lag or lead features but not getting exact logic
Yes, its can be done using .lead()
import org.apache.spark.sql.expressions._
//define window specification
val windowSpec = Window.partitionBy($"type",$"f1").orderBy($"type")
val inputDF = sc.parallelize(List((1,"a","xy",11),(2,"b","ab",13),(3,"c","na",16),(3,"c","dir",18),(3,"c","ls",23))).toDF("type","f1","f2","value")
inputDF.withColumn("leadValue",lead($"value",1).over(windowSpec))
.withColumn("result", when(abs($"leadValue" - $"value") <= 2, 1).otherwise(0)) //check for condition
.filter($"result" === 0) //filter the rows
.drop("leadValue","result") //remove additional columns
.orderBy($"type")
.show
Output:
+----+---+---+-----+
|type| f1| f2|value|
+----+---+---+-----+
| 1| a| xy| 11|
| 2| b| ab| 13|
| 3| c|dir| 18|
| 3| c| ls| 23|
+----+---+---+-----+
Here as we already are partitioning by type & f1 we need not check for their equality condition

Iterate through rows in DataFrame and transform one to many

As an example in scala, I have a list and every item which matches a condition I want to appear twice (may not be the best option for this use case - but idea which counts):
l.flatMap {
case n if n % 2 == 0 => List(n, n)
case n => List(n)
}
I would like to do something similar in Spark - iterate over rows in a DataFrame and if a row matches a certain condition then I need to duplicate the row with some modifications in the copy. How can this be done?
For example, if my input is the table below:
| name | age |
|-------|-----|
| Peter | 50 |
| Paul | 60 |
| Mary | 70 |
I want to iterate through the table and test each row against multiple conditions, and for each condition that matches, an entry should be created with the name of the matched condition.
E.g. condition #1 is "age > 60" and condition #2 is "name.length <=4". This should result in the following output:
| name | age |condition|
|-------|-----|---------|
| Paul | 60 | 2 |
| Mary | 70 | 1 |
| Mary | 70 | 2 |
You can filter matching-conditions dataframes and then finally union all of them.
import org.apache.spark.sql.functions._
val condition1DF = df.filter($"age" > 60).withColumn("condition", lit(1))
val condition2DF = df.filter(length($"name") <= 4).withColumn("condition", lit(2))
val finalDF = condition1DF.union(condition2DF)
you should have your desired output as
+----+---+---------+
|name|age|condition|
+----+---+---------+
|Mary|70 |1 |
|Paul|60 |2 |
|Mary|70 |2 |
+----+---+---------+
I hope the answer is helpful
You can also use a combination of an UDF and explode(), like in the following example:
// set up example data
case class Pers1 (name:String,age:Int)
val d = Seq(Pers1("Peter",50), Pers1("Paul",60), Pers1("Mary",70))
val df = spark.createDataFrame(d)
// conditions logic - complex as you'd like
// probably should use a Set instead of Sequence but I digress..
val conditions:(String,Int)=>Seq[Int] = { (name,age) =>
(if(age > 60) Seq(1) else Seq.empty) ++
(if(name.length <=4) Seq(2) else Seq.empty)
}
// define UDF for spark
import org.apache.spark.sql.functions.udf
val conditionsUdf = udf(conditions)
// explode() works just like flatmap
val result = df.withColumn("condition",
explode(conditionsUdf(col("name"), col("age"))))
result.show
+----+---+---------+
|name|age|condition|
+----+---+---------+
|Paul| 60| 2|
|Mary| 70| 1|
|Mary| 70| 2|
+----+---+---------+
Here is one way to flatten it with rdd.flatMap:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val new_rdd = (df.rdd.flatMap(r => {
val conditions = Seq((1, r.getAs[Int](1) > 60), (2, r.getAs[String](0).length <= 4))
conditions.collect{ case (i, c) if c => Row.fromSeq(r.toSeq :+ i) }
}))
val new_schema = StructType(df.schema :+ StructField("condition", IntegerType, true))
spark.createDataFrame(new_rdd, new_schema).show
+----+---+---------+
|name|age|condition|
+----+---+---------+
|Paul| 60| 2|
|Mary| 70| 1|
|Mary| 70| 2|
+----+---+---------+

Scala Spark - How reduce a dataframe with many couple columns in a single couple columns?

i have a dataframe with many couple (count and score) columns.
This situation is not a pivot, but similar an unpivot.
Example:
|house_score | house_count | mobile_score | mobile_count | sport_score | sport_count | ....<other couple columns>.....|
| 20 2 48 6 6 78 |
| 40 78 47 74 69 6 |
I want a new dateframe with only two columns, score e count. The new dataframe reduce all couple columns in a only couple columns.
_________________
| score | count |
| 20 | 2 |
| 40 | 78 |
| 48 | 6 |
| 47 | 74 |
| 6 | 78 |
| 69 | 6 |
|_______________|
What's the best solution (elegant code/performance)?
You can achieve this using a foldLeft over the column names (excluding the part after the _). This is reasonably efficient since all intensive operations are distributed, and the code is fairly clean and concise.
// df from example
val df = sc.parallelize(List((20,2,48,6,6,78), (40,78,47,74,69,6) )).toDF("house_score", "house_count", "mobile_score", "mobile_count", "sport_score", "sport_count")
// grab column names (part before the _)
val cols = df.columns.map(col => col.split("_")(0)).distinct
// fold left over all columns
val result = cols.tail.foldLeft(
// init with cols.head column
df.select(col(s"${cols.head}_score").as("score"), col(s"${cols.head}_count").as("count"))
){case (acc,c) => {
// union current column c
acc.unionAll(df.select(col(s"${c}_score").as("score"), col(s"${c}_count").as("count")))
}}
result.show
Using unionAlls as suggested in another answer will require you to scan the data multiple times and on each scan project the df to only 2 columns. From a performance perspective scanning the data multiple times should be avoided if you can do the work in 1 pass especially if you have large datasets that are not cacheable or you need to do many scans.
You can do it in 1 pass, by generating all the tuples (score, count) and then flat mapping them. I let you decide how elegant it is:
scala> :paste
// Entering paste mode (ctrl-D to finish)
val df = List((20,2,48,6,6,78), (40,78,47,74,69,6))
.toDF("house_score", "house_count", "mobile_score", "mobile_count", "sport_score", "sport_count")
df.show
val result = df
.flatMap(r => Range(0, 5, 2).map(i => (r.getInt(i), r.getInt(i + 1))))
.toDF("score", "count")
result.show
// Exiting paste mode, now interpreting.
+-----------+-----------+------------+------------+-----------+-----------+
|house_score|house_count|mobile_score|mobile_count|sport_score|sport_count|
+-----------+-----------+------------+------------+-----------+-----------+
| 20| 2| 48| 6| 6| 78|
| 40| 78| 47| 74| 69| 6|
+-----------+-----------+------------+------------+-----------+-----------+
+-----+-----+
|score|count|
+-----+-----+
| 20| 2|
| 48| 6|
| 6| 78|
| 40| 78|
| 47| 74|
| 69| 6|
+-----+-----+
df: org.apache.spark.sql.DataFrame = [house_score: int, house_count: int ... 4 more fields]
result: org.apache.spark.sql.DataFrame = [score: int, count: int]