I have 2 tables.
table_1 consists of a, b (primary key for identifying the email sent), and s.
table_2 table consists of a, b (primary key for identifying the email clicked), and c.
For each table_1.a, I want to get the count of table_2.c for each b of table_1 found in table_2. If b from table_1 is not found in table_2, then the count of c for that b is set to 0. Then, for each table_1.a, I want to order each b by s.
Example:
table_1 =
+----------+----+----------+
| a| b| s|
+----------+----+----------+
| 1433| 42a|2021-02-01|
+----------+----+----------+
| 1523| 41c|2021-02-23|
+----------+----+----------+
| 1523| 42c|2021-02-24|
+----------+----+----------+
| 1523| 38a|2021-01-03|
+----------+----+----------+
| 1421| 39a|2021-02-28|
+----------+----+----------+
mclicks =
+----------+----+----------+
| a| b| c|
+----------+----+----------+
| 1523| 41c|2021-02-24|
+----------+----+----------+
| 1523| 41c|2021-02-25|
+----------+----+----------+
| 1523| 42c|2021-02-27|
+----------+----+----------+
| 1421| 39a|2021-02-28|
+----------+----+----------+
I would like my final output to be:
+----+----+-----+
| a1| b| c|
+----+----+-----+
|1433| 42a| 0|
+----+----+-----+
|1523| 38a| 0|
+----+----+-----+
|1523| 41c| 2|
+----+----+-----+
|1523| 42c| 1|
+----+----+-----+
This is my code:
df = table_1.join(table_2, table_1.a == table_2.a, "left") \
.select(table_1.a.alias('a1'), table_1.s.alias('s1'), \
table_1.b.alias('b1'), table_2.c('c2')) \
.filter((table_1.s).between('2021-01-31', '2021-03-01'))
df1 = df.groupBy(df['a1'], df['b1'], df['s1']) \
.agg(count(df['c2']).alias('df_cd'))
w = Window.partitionBy(df1['a1'], df1['b1']) \
.orderBy(df1['s1'])
df2 = df1.withColumn('row', row_number().over(w))
Join, conditionally attribute c and groupby, sum
table_1 = (table_1.join(mclicks, how='outer', on=['a','b']).withColumn("c1",F.when(F.col("c").isNotNull(), 1).otherwise(0)
)).groupby('a','b').agg(sum(col('c1')).alias('c'))
Related
I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+
I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/
I have a dataframe like below
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2999-12-31|
| b| 3|2011-12-14|2999-12-31|
| a| 4|2013-12-17|2999-12-31|
| b| 8|2011-12-19|2999-12-31|
| a| 6|2013-12-23|2999-12-31|
+----+----+----------+----------+
I need to group the records based on ColA and rank the records based on colC(most recent date gets bigger rank) and then update the dates in colD by subtracting a day from the colC record of the adjacent rank.
The final dataframe should like below
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-29|2999-12-31|
+----+----+----------+----------+
You can get it using the window functions
scala> val df = Seq(("a",2,"2013-12-12","2999-12-31"),("b",3,"2011-12-14","2999-12-31"),("a",4,"2013-12-17","2999-12-31"),("b",8,"2011-12-19","2999-12-31"),("a",6,"2013-12-23","2999-12-31")).toDF("colA","colB","colC","colD")
df: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> val df2 = df.withColumn("colc",'colc.cast("date")).withColumn("cold",'cold.cast("date"))
df2: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> df2.createOrReplaceTempView("yash")
scala> spark.sql(""" select cola,colb,colc,cold, rank() over(partition by cola order by colc) c1, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold2 from yash """).show
+----+----+----------+----------+---+----------+
|cola|colb| colc| cold| c1| cold2|
+----+----+----------+----------+---+----------+
| b| 3|2011-12-14|2999-12-31| 1|2011-12-18|
| b| 8|2011-12-19|2999-12-31| 2|2999-12-31|
| a| 2|2013-12-12|2999-12-31| 1|2013-12-16|
| a| 4|2013-12-17|2999-12-31| 2|2013-12-22|
| a| 6|2013-12-23|2999-12-31| 3|2999-12-31|
+----+----+----------+----------+---+----------+
scala>
Removing the unnecessary columns
scala> spark.sql(""" select cola,colb,colc, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold from yash """).show
+----+----+----------+----------+
|cola|colb| colc| cold|
+----+----+----------+----------+
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-19|2999-12-31|
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
+----+----+----------+----------+
scala>
You can create row_number over partition by colA and order by colC, then a self join on the dataframe. The code should look like this.
val rnkDF = df.withColumn("rnk", row_number().over(Window.partitionBy("colA").orderBy($"colC".asc)))
.withColumn("rnkminusone", $"rnk" - lit(1))
val joinDF = rnkDF.alias('A).join(rnkDF.alias('B), ($"A.colA" === $"B.colA").and($"A.rnk" === $"B.rnkminusone"),"left")
.select($"A.colA".as("colA")
, $"A.colB".as("colB")
, $"A.colC".as("colC")
, when($"B.colC".isNull, $"A.colD").otherwise(date_sub($"B.colC", 1)).as("colD"))
The results are below. I hope this helps.
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-19|2999-12-31|
+----+----+----------+----------+
query I'm using:
I want to replace existing columns with new values on condition, if value of another col = ABC then column remain same otherwise should give null or blank.
It's giving result as per logic but only for last column it encounters in loop.
import pyspark.sql.functions as F
for i in df.columns:
if i[4:]!='ff':
new_df=df.withColumn(i,F.when(df.col_ff=="abc",df[i])\
.otherwise(None))
df:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| a | b | c | def |
| b | c | b | abc |
| c | d | a | def |
+------+----+-----+-------+
required output:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| null |null|null | def |
| b | c | b | abc |
| null |null|null | def |
+------+----+-----+-------+
The problem in your code is that you're overwriting new_df with the original DataFrame df in each iteration of the loop. You can fix it by first setting new_df = df outside of the loop, and then performing the withColumn operations on new_df inside the loop.
For example, if df were the following:
df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#| a| b| c| def|
#| b| c| b| abc|
#| c| d| a| def|
#+----+----+----+------+
Change your code to:
import pyspark.sql.functions as F
new_df = df
for i in df.columns:
if i[4:]!='ff':
new_df = new_df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i)))
Notice here that I removed the .otherwise(None) part because when will return null by default if the condition is not met.
You could also do the same using functools.reduce:
from functools import reduce # for python3
new_df = reduce(
lambda df, i: df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i))),
[i for i in df.columns if i[4:] != "ff"],
df
)
In both cases the result is the same:
new_df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#|null|null|null| def|
#| b| c| b| abc|
#|null|null|null| def|
#+----+----+----+------+
I've a following data which has user and supervisor relationship.
user |supervisor |id
-----|-----------|----
a | b | 1
b | c | 2
c | d | 3
e | b | 4
I want to explode the relationship hierarchy between the user and supervisor as below.
user |supervisor |id
-----|-----------|----
a | b | 1
a | c | 1
a | d | 1
b | c | 2
b | d | 2
c | d | 3
e | b | 4
e | c | 4
e | d | 4
As you see, for the user 'a', the immediate supervisor is 'b' but again 'b' has 'c' as its supervisor. So indirectly 'c' is supervisor for 'a' as well and so on. Such as, my aim is to explode the hierarchy at any level for a given user. What is the best way to implement this in spark-scala ?
Here is an approach using dataframes. I show doing one level of hierarchy, but it can be done multiple times by repeating the step below:
val df = sc.parallelize(Array(("a", "b", 1), ("b", "c", 2), ("c", "d", 3), ("e", "b", 4))).toDF("user", "supervisor", "id")
scala> df.show
+----+----------+---+
|user|supervisor| id|
+----+----------+---+
| a| b| 1|
| b| c| 2|
| c| d| 3|
| e| b| 4|
+----+----------+---+
Let's enable cross joins:
spark.conf.set("spark.sql.crossJoin.enabled", true)
Then join the same table:
val dfjoin = df.as("df1").join(df.as("df2"), $"df1.supervisor" === $"df2.user", "left").select($"df1.user", $"df1.supervisor".as("s1"), $"df1.id", $"df2.supervisor".as("s2"))
I use an udf to combine two columns into an array:
import org.apache.spark.sql.functions.udf
val combineUdf = udf((x: String, y: String) => Seq(x, y))
val dfcombined = dfjoin.withColumn("combined", combineUdf($"s1", $"s2")).select($"user", $"combined", $"id")
Then the last step is to flatten the array to separate rows and filter rows that did not join:
val dfexplode = dfcombined.withColumn("supervisor", explode($"combined")).select($"user", $"id", $"supervisor").filter($"supervisor".isNotNull)
The first level hierarchy looks like this:
scala> dfexplode.show
+----+---+----------+
|user| id|supervisor|
+----+---+----------+
| c| 3| d|
| b| 2| c|
| b| 2| d|
| a| 1| b|
| a| 1| c|
| e| 4| b|
| e| 4| c|
+----+---+----------+