I have a dataframe like:
+----------+-----+------+------+-----+---+
| product|china|france|german|india|usa|
+----------+-----+------+------+-----+---+
| beans| 496| 200| 210| 234|119|
| banana| null| 345| 234| 123|122|
|starwberry| 340| 430| 246| 111|321|
| mango| null| 345| 456| 110|223|
| chiku| 765| 455| 666| 122|222|
| apple| 109| 766| 544| 444|333|
+----------+-----+------+------+-----+---+
I want to unpivot it by keeping fixed as mutiple columns like
import spark.implicits._
val unPivotDF = testData.select($"product",$"german", expr("stack(4, 'china', china, 'usa', usa, 'france', france,'india',india) " +
"as (Country,Total)"))
unPivotDF.show()
which gives below o/p:
+----------+------+-------+-----+
| product|german|Country|Total|
+----------+------+-------+-----+
| beans| 210| china| 496|
| beans| 210| usa| 119|
| beans| 210| france| 200|
| beans| 210| india| 234|
| banana| 234| china| null|
| banana| 234| usa| 122|
| banana| 234| france| 345|
| banana| 234| india| 123|
|starwberry| 246| china| 340|
|starwberry| 246| usa| 321|
|starwberry| 246| france| 430|
|starwberry| 246| india| 111|
which is perfect but this fixed column like product and german are runtime information so directly i cannot use the col names in select statement
So what i was doing
val fixedCol= List[String]()
fixedCol= "german" :: fixedCol
fixedCol= "product" :: fixedCol
val col= df.select(fixedCol:_*,expr("stack(.......)") //it gives error as first argument of select is fixed and second arg is varargs
I know it can be done by using but i cannot use sql:
val ss= spark.createOrReplaceTempView(df)
spark.sql("select.......")
Is there any other way to make it dynamic
Convert all column names and exp to List[Column]
val fixedCol : List[Column] = List(col("german") , col("product") , expr("stack(.......)"))
df.select(fixedCol:_*)
Related
I need some help. I have two dataframes, one has a few dates and the other has my significant data, catalogued by date.
It goes something like this:
First df, with the relevant data
+------+----------+---------------+
| id| test_date| score|
+------+----------+---------------+
| 1|2021-03-31| 94|
| 1|2021-01-31| 93|
| 1|2020-12-31| 100|
| 1|2020-06-30| 95|
| 1|2019-10-31| 58|
| 1|2017-10-31| 78|
| 2|2020-01-31| 79|
| 2|2018-03-31| 66|
| 2|2016-05-31| 77|
| 3|2021-05-31| 97|
| 3|2020-07-31| 100|
| 3|2019-07-31| 99|
| 3|2019-06-30| 98|
| 3|2018-07-31| 91|
| 3|2018-02-28| 86|
| 3|2017-11-30| 82|
+------+----------+---------------+
Second df, with the dates
+--------------+--------------+--------------+
| eval_date_1| eval_date_2| eval_date_3|
+--------------+--------------+--------------+
| 2021-01-31| 2020-10-31| 2019-06-30|
+--------------+--------------+--------------+
Needed DF
+------+--------------+---------+--------------+---------+--------------+---------+
| id| eval_date_1| score_1 | eval_date_2| score_2 | eval_date_3| score_3 |
+------+--------------+---------+--------------+---------+--------------+---------+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
+------+--------------+---------+--------------+---------+--------------+---------+
So, for instance, for the first id, the needed df takes the scores from the second, fourth and sixth rows from the first df. Those are the most updated dates that stay equal to or below the eval_date on the second df.
Assuming df is your main dataframe and df_date is the one which contains only dates.
from functools import reduce
from pyspark.sql import functions as F, Window as W
df_final = reduce(
lambda a, b: a.join(b, on="id"),
(
df.join(
F.broadcast(df_date.select(f"eval_date_{i}")),
on=F.col(f"eval_date_{i}") >= F.col("test_date"),
)
.withColumn(
"rnk",
F.row_number().over(W.partitionBy("id").orderBy(F.col("test_date").desc())),
)
.where("rnk=1")
.select("id", f"eval_date_{i}", "score")
for i in range(1, 4)
),
)
df_final.show()
+---+-----------+-----+-----------+-----+-----------+-----+
| id|eval_date_1|score|eval_date_2|score|eval_date_3|score|
+---+-----------+-----+-----------+-----+-----------+-----+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
+---+-----------+-----+-----------+-----+-----------+-----+
I have the following Dataset :
+----+-----+--------+-----+--------+
| id|date1|address1|date2|address2|
+----+-----+--------+-----+--------+
| 1| 2019| Paris| 2018| Madrid|
| 2| 2020|New York| 2002| Geneva|
| 3| 1998| London| 2005| Tokyo|
| 4| 2005| Sydney| 2013| Berlin|
+----+-----+-------+------+--------+
I try to obtain the most recent date and the corresponding address of each id in two other columns. The desired result is :
+----+-----+--------+-----+--------+--------+-----------+
| id|date1|address1|date2|address2|date_max|address_max|
+----+-----+--------+-----+--------+--------+-----------+
| 1| 2019| Paris| 2018| Madrid| 2019| Paris|
| 2| 2020|New York| 2002| Geneva| 2020| New York|
| 3| 1998| London| 2005| Tokyo| 2005| Tokyo|
| 4| 2005| Sydney| 2013| Berlin| 2013| Berlin|
+----+-----+-------+------+--------+--------+-----------+
Any ideas to make this in a very efficient way ?
You can do a CASE WHEN to pick the more recent date/address:
import org.apache.spark.sql.functions._
val date_max = when(col("date1") > col("date2"), col("date1")).otherwise(col("date2")).alias("date_max")
val address_max = when(col("date1") > col("date2"), col("address1")).otherwise(col("address2")).alias("address_max")
df = df.select("*", date_max, address_max)
If you want a more scalable option with many columns:
val df2 = df.withColumn(
"all_date",
array(df.columns.filter(_.contains("date")).map(col): _*)
).withColumn(
"all_address",
array(df.columns.filter(_.contains("address")).map(col): _*)
).withColumn(
"date_max",
array_max($"all_date")
).withColumn(
"address_max",
element_at($"all_address",
(array_position($"all_date", array_max($"all_date"))).cast("int")
)
).drop("all_date", "all_address")
df2.show
+---+-----+--------+-----+--------+-------+----------+
| id|date1|address1|date2|address2|datemax|addressmax|
+---+-----+--------+-----+--------+-------+----------+
| 1| 2019| Paris| 2018| Madrid| 2019| Paris|
| 2| 2020| NewYork| 2002| Geneva| 2020| NewYork|
| 3| 1998| London| 2005| Tokyo| 2005| Tokyo|
| 4| 2005| Sydney| 2013| Berlin| 2013| Berlin|
+---+-----+--------+-----+--------+-------+----------+
val data = Seq(
("India","Pakistan","India"),
("Australia","India","India"),
("New Zealand","Zimbabwe","New Zealand"),
("West Indies", "Bangladesh","Bangladesh"),
("Sri Lanka","Bangladesh","Bangladesh"),
("Sri Lanka","Bangladesh","Bangladesh"),
("Sri Lanka","Bangladesh","Bangladesh")
)
val df = data.toDF("Team_1","Team_2","Winner")
I have this dataframe. I want to get the count how many matches has each team played ?
There are 3 approaches discussed above answers, I tried to evaluate (just for educational/awareness ) in terms of time taken/elapsed with respect to performance....
import org.apache.log4j.Level
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Katu_37 extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
import spark.implicits._
val data = Seq(
("India", "Pakistan", "India"),
("Australia", "India", "India"),
("New Zealand", "Zimbabwe", "New Zealand"),
("West Indies", "Bangladesh", "Bangladesh"),
("Sri Lanka", "Bangladesh", "Bangladesh"),
("Sri Lanka", "Bangladesh", "Bangladesh"),
("Sri Lanka", "Bangladesh", "Bangladesh")
)
val df = data.toDF("Team_1", "Team_2", "Winner")
df.show
exec {
println( "METHOD 1 ")
df.select("Team_1").union(df.select("Team_2")).groupBy("Team_1").agg(count("Team_1")).show()
}
exec {
println( "METHOD 2 ")
df.select(array($"Team_1", $"Team_2").as("Team")).select("Team").withColumn("Team", explode($"Team")).groupBy("Team").agg(count("Team")).show()
}
exec {
println( "METHOD 3 ")
val matchesCount = df.selectExpr("Team_1 as Teams").union(df.selectExpr("Team_2 as Teams"))
matchesCount.groupBy("Teams").count().withColumnRenamed("count","MatchesPlayed").show()
}
/**
*
* #param f
* #tparam T
* #return
*/
def exec[T](f: => T) = {
val starttime = System.nanoTime()
println("t = " + f)
val endtime = System.nanoTime()
val elapsedTime = (endtime - starttime )
// import java.util.concurrent.TimeUnit
// val convertToSeconds = TimeUnit.MINUTES.convert(elapsedTime, TimeUnit.NANOSECONDS)
println("time Elapsed " + elapsedTime )
}
}
Result :
+-----------+----------+-----------+
| Team_1| Team_2| Winner|
+-----------+----------+-----------+
| India| Pakistan| India|
| Australia| India| India|
|New Zealand| Zimbabwe|New Zealand|
|West Indies|Bangladesh| Bangladesh|
| Sri Lanka|Bangladesh| Bangladesh|
| Sri Lanka|Bangladesh| Bangladesh|
| Sri Lanka|Bangladesh| Bangladesh|
+-----------+----------+-----------+
METHOD 1
+-----------+-------------+
| Team_1|count(Team_1)|
+-----------+-------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-------------+
t = ()
time Elapsed 2729302088
METHOD 2
+-----------+-----------+
| Team|count(Team)|
+-----------+-----------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-----------+
t = ()
time Elapsed 646513918
METHOD 3
+-----------+-------------+
| Teams|MatchesPlayed|
+-----------+-------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-------------+
t = ()
time Elapsed 988510662
I observed that org.apache.spark.sql.functions.array approach is taking (646513918 nano seconds) less time than union approach...
val matchesCount = df.selectExpr("Team_1 as Teams").union(df.selectExpr("Team_2 as Teams"))
matchesCount.groupBy("Teams").count().withColumnRenamed("count","MatchesPlayed").show()
+-----------+--------------+
| Teams|MatchesPlayed|
+-----------+--------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+--------------+
You can either use a union with select statement or use array from org.apache.spark.sql.functions.array
// METHOD 1
df.select("Team_1").union(df.select("Team_2")).groupBy("Team_1").agg(count("Team_1")).show()
// METHOD 2
df.select(array($"Team_1", $"Team_2").as("Team")).select("Team").withColumn("Team",explode($"Team")).groupBy("Team").agg(count("Team")).show()
Using select statement and union :
+-----------+-------------+
| Team_1|count(Team_1)|
+-----------+-------------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-------------+
Time Elapsed : 1588835600
Using array :
+-----------+-----------+
| Team|count(Team)|
+-----------+-----------+
| Sri Lanka| 3|
| India| 2|
|West Indies| 1|
| Bangladesh| 4|
| Zimbabwe| 1|
|New Zealand| 1|
| Australia| 1|
| Pakistan| 1|
+-----------+-----------+
Time Elapsed : 342103600
Performance wise using org.apache.spark.sql.functions.array is better.
I have two dataframes df and df2 as below
+------+---+----+
| name|age|city|
+------+---+----+
| John| 25| LA|
| Jane| 26| LA|
|Joseph| 28| SA|
+------+---+----+
+---+----+------+
|age|city|salary|
+---+----+------+
| 25| LA| 40000|
| 26| | 50000|
| | SF| 60000|
+---+----+------+
I want my result dataframe as below
+------+---+----+------+
| name|age|city|salary|
+------+---+----+------+
| John| 25| LA| 40000|
| Jane| 26| LA| 50000|
|Joseph| 28| SF| 60000|
+------+---+----+------+
Basically here I need to join using age, city as join columns but if any one of the column is empty in df2 then I need to join only with the other non null column. The solution I am looking for should be applicable even if there are around 5 columns to join only non null column should participate in the join for each row.
You could give more conditions when you join those dataframes and then select, groupBy would be needed.
df1.join(df2,
($"age" === $"age2" || $"age2".isNull) &&
($"city" === $"city2" || $"city2".isNull), "left")
.show
The result will be:
+------+---+----+----+-----+-------+
| name|age|city|age2|city2|salary2|
+------+---+----+----+-----+-------+
| John| 25| LA| 25| LA| 40000|
| Jane| 26| LA| 26| null| 50000|
|Joseph| 28| SF|null| SF| 60000|
+------+---+----+----+-----+-------+
But when you have more columns or the second dataframe has more null values, the result will be more complex.
df1.join(df2,df1.col("age")===df2.col("age") || df1.col("city")===df2.col("city")).select(df1.col("name"),df1.col("age"),df1.col("city"),df2.col("salary")).show
+----+---+----+------+
|name|age|city|salary|
+----+---+----+------+
|john| 25| LA| 40000|
|Jane| 26| LA| 40000|
|Jane| 26| LA| 50000|
+----+---+----+------+```
Im comparing 2 dataframes.
I choose to compare them column by column
I created 2 smaller dataframes from the parent dataframes.
based on join columns and the comparison columns:
Created 1st dataframe:
val df1_subset = df1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 66|
| blake | lively| 66|
| eva| green| 44|
| brad| pitt| 99|
| jason| momoa| 34|
| george | clooney| 67|
| ed| sheeran| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| null| null| |
+----------+---------+-------------+
Created 2nd Dataframe:
val df1_1_subset = df1_1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 34|
| brad| pitt| 78|
| eva| green| 56|
| tom | cruise| 99|
| jason| momoa| 34|
| george | clooney| 67|
| george | clooney| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| kyle| jenner| 56|
| celena| gomez| 2|
+----------+---------+-------------+
Then I joined the 2 subsets
I joined these as a full outer join to get the following:
val df_subset_joined = df1_subset.join(df1_1_subset, joinColsArray, "full_outer")
Joined Subset
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| will | smith| 67| 67|
| george | clooney| 67| 67|
| george | clooney| 67| 88|
| blake | lively| 66| null|
| celena| gomez| null| 2|
| eva| green| 44| 56|
| null| null| | null|
| jason| momoa| 34| 34|
| ed| sheeran| 88| null|
| lionel| messi| 88| 88|
| kyle| jenner| null| 56|
| tom | cruise| 66| 34|
| tom | cruise| 66| 99|
| brad| pitt| 99| 78|
| ryan| reynolds| 45| 45|
+----------+---------+-------------+-------------+
Then I tried to filter out the elements that are same in both comparison columns (loyalty_scores in this example) by using column positions
df_subset_joined.filter(_c2 != _c3).show
But that didnt work. Im getting the following error:
Error:(174, 33) not found: value _c2
df_subset_joined.filter(_c2 != _c3).show
What is the most efficient way for me to get a joined dataframe, where I only see the rows that do not match in the comparison columns.
I would like to keep this dynamic so hard coding column names is not an option.
Thank you for helping me understand this.
you need wo work with aliases and make us of the null-safe comparison operator (https://spark.apache.org/docs/latest/api/sql/index.html#_9), see also https://stackoverflow.com/a/54067477/1138523
val df_subset_joined = df1_subset.as("a").join(df1_1_subset.as("b"), joinColsArray, "full_outer")
df_subset_joined.filter(!($"a.loyality_score" <=> $"b.loyality_score")).show
EDIT: for dynamic column names, you can use string interpolation
import org.apache.spark.sql.functions.col
val xxx : String = ???
df_subset_joined.filter(!(col(s"a.$xxx") <=> col(s"b.$xxx"))).show