I have a dataframe like the following:
+-----------+-----------+---------------+------+---------------------+
|best_col |A |B | C |<many more columns> |
+-----------+-----------+---------------+------+---------------------+
| A | 14 | 26 | 32 | ... |
| C | 13 | 17 | 96 | ... |
| B | 23 | 19 | 42 | ... |
+-----------+-----------+---------------+------+---------------------+
I want to end up with a DataFrame like this:
+-----------+-----------+---------------+------+---------------------+----------+
|best_col |A |B | C |<many more columns> | result |
+-----------+-----------+---------------+------+---------------------+----------+
| A | 14 | 26 | 32 | ... | 14 |
| C | 13 | 17 | 96 | ... | 96 |
| B | 23 | 19 | 42 | ... | 19 |
+-----------+-----------+---------------+------+---------------------+----------+
Essentially, I want to add a column result that will choose the value from the column specified in the best_col column. best_col only contains column names that are present in the DataFrame. Since I have dozens of columns, I want to avoid using a bunch of when statements to check when col(best_col) === A etc. I tried doing col(col("best_col").toString()), but this didn't work. Is there an easy way to do this?
Using map_filter introduced in Spark 3.0:
val df = Seq(
("A", 14, 26, 32),
("C", 13, 17, 96),
("B", 23, 19, 42),
).toDF("best_col", "A", "B", "C")
df.withColumn("result", map(df.columns.tail.flatMap(c => Seq(col(c), lit(col("best_col") === lit(c)))): _*))
.withColumn("result", map_filter(col("result"), (a, b) => b))
.withColumn("result", map_keys(col("result"))(0))
.show()
+--------+---+---+---+------+
|best_col| A| B| C|result|
+--------+---+---+---+------+
| A| 14| 26| 32| 14|
| C| 13| 17| 96| 96|
| B| 23| 19| 42| 19|
+--------+---+---+---+------+
Related
I have a problem where I need to relate rows to each other. I have tried many things but I am now completly stuck. I have tried partitioning, lag, groupbys but nothing works.
The rows below the ID 26 wil relate to the MPAN of 26
ID | MPAN | Value
---------------------
26 | 12345678 | Hello
27 | 99900234 | Bye
30 | 77563820 | Help
33 | 89898937 | Stuck
26 | 54877273 | Need a genius
29 | 54645643 | So close
30 | 22222222 | Thanks
e.g.
ID | MPAN | Value | Relation
----------------------------------------
26 | 12345678 | Hello | NULL
27 | 99900234 | Bye | 12345678
30 | 77563820 | Help | 12345678
33 | 89898937 | Stuck | 12345678
26 | 54877273 | Genius | NULL
29 | 54645643 | So close | 54877273
30 | 22222222 | Thanks | 54877273
This code below only works for previous row and not the LAG for the 26 record
df = spark.read.load('abfss://Files/', format='parquet')
df = df.withColumn("identity", F.monotonically_increasing_id())
win = Window.orderBy("identity")
condition = F.col("Prop_0") != '026'
df = df.withColumn("FlagY", F.when(condition, mpanlookup))
df.show()
As I said in my comment, you need a column to maintain the order. In your example, you used monotonically_increasing_id to create that "ordering" column, but that is absurd because
The function is non-deterministic because its result depends on partition IDs.
Assuming you have a proper "ordering" column :
df.show()
+---+---+--------+-------------+
|idx| ID| MPAN| Value|
+---+---+--------+-------------+
| 1| 26|12345678|Hello |
| 2| 27|99900234|Bye |
| 3| 30|77563820|Help |
| 4| 33|89898937|Stuck |
| 5| 26|54877273|Need a genius|
| 6| 29|54645643|So close |
| 7| 30|22222222|Thanks |
+---+---+--------+-------------+
you can simply do that with last function :
from pyspark.sql import functions as F, Window
df.withColumn(
"Relation",
F.last(F.when(F.col("ID") == 26, F.col("MPAN")), ignorenulls=True).over(
Window.orderBy("idx")
),
).show()
+---+---+--------+-------------+--------+
|idx| ID| MPAN| Value|Relation|
+---+---+--------+-------------+--------+
| 1| 26|12345678|Hello |12345678|
| 2| 27|99900234|Bye |12345678|
| 3| 30|77563820|Help |12345678|
| 4| 33|89898937|Stuck |12345678|
| 5| 26|54877273|Need a genius|54877273|
| 6| 29|54645643|So close |54877273|
| 7| 30|22222222|Thanks |54877273|
+---+---+--------+-------------+--------+
I have a dataset that has a group, ID and target column. I am attempting to eliminate null target values by the Group column, ignoring the ID column. I'd like to do this in PySpark.
| Group | ID | Target |
| ----- | --- | -------- |
| A | B | 10 |
| A | B | 10 |
| A | B | 10 |
| A | C | null |
| A | C | null |
| A | C | null |
| B | D | null |
| B | D | null |
| B | D | null |
This is the resulting dataset I'm looking for:
| Group | ID | Target |
| ----- | --- | -------- |
| A | B | 10 |
| A | B | 10 |
| A | B | 10 |
| B | D | null |
| B | D | null |
| B | D | null |
In other words, if the group has a target value already, I don't need the values in that group that have a null target, regardless of their ID. However, I need to make sure every group has a target that is not null, so if there is a group that has only null targets, they cannot be dropped.
You can compute max(target) per group and assign this to all rows in the of the group. Then filter rows such that a if maximum is null then select the row is if maximum is not null and target is also not null.
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [("A", "B", 10),
("A", "B", 10),
("A", "B", 10),
("A", "C", None),
("A", "C", None),
("A", "C", None),
("B", "D", None),
("B", "D", None),
("B", "D", None),]
df = spark.createDataFrame(data, ("Group", "ID", "Target",))
window_spec = Window.partitionBy("Group")
df.withColumn("max_target", F.max("Target").over(window_spec))\
.filter((F.col("max_target").isNull()) |
(F.col("Target").isNotNull() & F.col("max_target").isNotNull()))\
.drop("max_target")\
.show()
Output
+-----+---+------+
|Group| ID|Target|
+-----+---+------+
| A| B| 10|
| A| B| 10|
| A| B| 10|
| B| D| null|
| B| D| null|
| B| D| null|
+-----+---+------+
I have a dataframe like:
+---------------+-------+
| date | ID | count |
+--------+------+-------+
|20170101| 258 | 1003 |
|20170102| 258 | 13 |
|20170103| 258 | 1 |
|20170104| 258 | 108 |
|20170109| 258 | 25 |
| ... | ... | ... |
|20170101| 2813 | 503 |
|20170102| 2813 | 139 |
| ... | ... | ... |
|20170101| 4963 | 821 |
|20170102| 4963 | 450 |
| ... | ... | ... |
+--------+------+-------+
in my dataframe, there's not some data.
For example, here, date 20170105 ~ 20170108 for ID 258 are missing
and missing data means not appear(= count == 0).
But I'd like to add data which count is 0 too, like this:
+---------------+-------+
| date | ID | count |
+--------+------+-------+
|20170101| 258 | 1003 |
|20170102| 258 | 13 |
|20170103| 258 | 1 |
|20170104| 258 | 108 |
|20170105| 258 | 0 |
|20170106| 258 | 0 |
|20170107| 258 | 0 |
|20170108| 258 | 0 |
|20170109| 258 | 25 |
| ... | ... | ... |
|20170101| 2813 | 503 |
|20170102| 2813 | 139 |
| ... | ... | ... |
|20170101| 4963 | 821 |
|20170102| 4963 | 450 |
| ... | ... | ... |
+--------+------+-------+
dataframe is immutable so, if I want to add zero counted data to this dataframe,
have to make a new dataframe.
But even if I have a duration(20170101 ~ 20171231) and ID list, I cannot use for loop to dataframe.
How can I make a new dataframe?
ps. what I already tried was to make a correct dataframe and then compare 2 dataframes, make another dataframe which has only 0 counted data. finally union "original dataframe" and "0 counted dataframe". I think this is not good and long process. please recommend me some other efficient solutions.
from pyspark.sql.functions import unix_timestamp, from_unixtime, struct, datediff, lead, col, explode, lit, udf
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, DateType
from datetime import timedelta
#sample data
df = sc.parallelize([
['20170101', 258, 1003],
['20170102', 258, 13],
['20170103', 258, 1],
['20170104', 258, 108],
['20170109', 258, 25],
['20170101', 2813, 503],
['20170102', 2813, 139],
['20170101', 4963, 821],
['20170102', 4963, 450]]).\
toDF(('date', 'ID', 'count')).\
withColumn("date", from_unixtime(unix_timestamp('date', 'yyyyMMdd')).cast('date'))
df.show()
def date_list_fn(d):
return [d[0] + timedelta(days=x) for x in range(1, d[1])]
date_list_udf = udf(date_list_fn, ArrayType(DateType()))
w = Window.partitionBy('ID').orderBy('date')
#dataframe having missing date
df_missing = df.withColumn("diff", datediff(lead('date').over(w), 'date')).\
filter(col("diff") > 1).\
withColumn("date_list", date_list_udf(struct("date", "diff"))).\
withColumn("date_list", explode(col("date_list"))).\
select(col("date_list").alias("date"), "ID", lit(0).alias("count"))
#final dataframe by combining sample data with missing date dataframe
final_df = df.union(df_missing).sort(col("ID"), col("date"))
final_df.show()
Sample data:
+----------+----+-----+
| date| ID|count|
+----------+----+-----+
|2017-01-01| 258| 1003|
|2017-01-02| 258| 13|
|2017-01-03| 258| 1|
|2017-01-04| 258| 108|
|2017-01-09| 258| 25|
|2017-01-01|2813| 503|
|2017-01-02|2813| 139|
|2017-01-01|4963| 821|
|2017-01-02|4963| 450|
+----------+----+-----+
Output is:
+----------+----+-----+
| date| ID|count|
+----------+----+-----+
|2017-01-01| 258| 1003|
|2017-01-02| 258| 13|
|2017-01-03| 258| 1|
|2017-01-04| 258| 108|
|2017-01-05| 258| 0|
|2017-01-06| 258| 0|
|2017-01-07| 258| 0|
|2017-01-08| 258| 0|
|2017-01-09| 258| 25|
|2017-01-01|2813| 503|
|2017-01-02|2813| 139|
|2017-01-01|4963| 821|
|2017-01-02|4963| 450|
+----------+----+-----+
I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful
I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+