I have a pyspark Dataframe:
Dataframe example:
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | ["""] | ["67"]
--------------------------------------------
2 | ["""] | ["78"] | ["90"]
--------------------------------------------
3 | ["""] | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
I want to convert all the values ["""] of the columns: column_1, column_2, column_3 to null. types of these 3 columns is an Array.
Excpect result:
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | null | ["67"]
--------------------------------------------
2 | null | ["78"] | ["90"]
--------------------------------------------
3 | null | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
I tried this solution bellow:
df = df.withColumn(
"column_1",
F.when((F.size(F.col("column_1")) == ""),
F.lit(None)).otherwise(F.col("column_1"))
).withColumn(
"column_2",
F.when((F.size(F.col("column_2")) == ""),
F.lit(None)).otherwise(F.col("column_2"))
).withColumn(
"column_3",
F.when((F.size(F.col("column_3")) == ""),
F.lit(None)).otherwise(F.col("column_3"))
)
But it convert all to null.
How can I test on an empty array that contain an empty String normally, [""] not [].
Thank you
you can test with a when and replace the values:
df.withColumn(
"column_1",
F.when(F.col("column_1") != F.array(F.lit('"')), # or '"""' ?
F.col("column_1")
))
Do that for each of your columns.
Related
I have a dataset like this below:
+------+------+-------+------------+------------+-----------------+
| key1 | key2 | price | date_start | date_end | sequence_number |
+------+------+-------+------------+------------+-----------------+
| a | b | 10 | 2022-01-03 | 2022-01-05 | 1 |
| a | b | 10 | 2022-01-02 | 2050-05-15 | 2 |
| a | b | 10 | 2022-02-02 | 2022-05-10 | 3 |
| a | b | 20 | 2024-02-01 | 2050-10-10 | 4 |
| a | b | 20 | 2024-04-01 | 2025-09-10 | 5 |
| a | b | 10 | 2022-04-02 | 2024-09-10 | 6 |
| a | b | 20 | 2024-09-11 | 2050-10-10 | 7 |
+------+------+-------+------------+------------+-----------------+
What I want to achieve is this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The sequence number is the order in which the data rows were received.
The resultant dataset should be able to fix the overlapping dates for each price but also consider the fact that when there is a new price for the same key columns the older record's date_end is updated to date_start-1
After the first three sequence numbers, the output looked like this:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2050-05-15 |
+------+------+-------+------------+------------+
This covers the max range for the price.
After the 4th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
After the 5th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-01-31 |
| a | b | 20 | 2024-02-01 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes as the date overlaps
After the 6th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
The date_start and date_end both are updated
After the 7th sequence number:
+------+------+-------+------------+------------+
| key1 | key2 | price | date_start | date_end |
+------+------+-------+------------+------------+
| a | b | 10 | 2022-01-02 | 2024-09-10 |
| a | b | 20 | 2024-09-11 | 2050-10-10 |
+------+------+-------+------------+------------+
No changes.
So here's an answer. First I expand the dates into rows. Then I use a group by with struct & collect_list, to capture the price and sequence_number together. I take the first item of the array returned (and sorted) from collect list to effectively give me the max of the sequence number to then pull back the price. From there a group the dates by min/max.
from pyspark.sql.functions import min, explode, col, expr, max
data = [
("a" ,"b" , 10 ,"2022-01-03" ,"2022-01-05" , 1 ),
("a" ,"b" , 10 ,"2022-01-02" ,"2050-05-15" , 2 ),
("a" ,"b" , 10 ,"2022-02-02" ,"2022-05-10" , 3 ),
("a" ,"b" , 20 ,"2024-02-01" ,"2050-10-10" , 4 ),
("a" ,"b" , 20 ,"2024-04-01" ,"2025-09-10" , 5 ),
("a" ,"b" , 10 ,"2022-04-02" ,"2024-09-10" , 6 ),
("a" ,"b" , 20 ,"2024-09-11" ,"2050-10-10" , 7 ),
]
columns = ["key1","key2","price","date_start","date_end","sequence_number"]
df = spark.createDataFrame(data).toDF(*columns)
#create all the dates as rows
df_sequence = df.select( "*" , explode( expr("sequence ( to_date(date_start),to_date(date_end), interval 1 day)").alias('day') ) )
df_date_range = df_sequence.groupby( \
col("key1"),
col("key2"),
col("col")
).agg( \# here's the magic
reverse( \#descending sort effectively on sequence number.
array_sort( \#ascending sort of array by first element in struct then second element in struct
collect_list( \# collect all elements of the group
struct( \# get all data we need
col("sequence_number").alias("sequence") ,
col("price" ).alias("price")
)
)
)
)[0].alias("mystruct")\ # pull the first element to get "max"
).select( \
col("key1"),
col("key2"),
col("col"),
col("mystruct.price") \#pull out price from struct
)
#without comments
#df_date_range = df_sequence.groupby( col("key1"),col("key2"), col("col") ).agg( reverse(array_sort(collect_list( struct(col("sequence_number").alias("sequence") ,col("price" ).alias("price") ) )))[0].alias("mystruct") ).select( col("key1"), col("key2"), col("col"), col("mystruct.price") )
df_date_range.groupby( col("key1"),col("key2"), col("price") ).agg( min("col").alias("date_start"), max("col").alias("date_end") ).show()
+----+----+-----+----------+----------+
|key1|key2|price| date_start | date_end|
+----+----+-----+----------+----------+
| a| b| 10|2022-01-02|2024-09-10|
| a| b| 20|2024-09-11|2050-10-10|
+----+----+-----+----------+----------+
This does assume you will only ever use the price once before changing it. If you needed to go to back to a price you'd have to use window logic to identify the price different prices ranges and add that to your collect_list as an extra factor to sort on.
Basically I have a table called cities which looks like this:
+------+-----------+---------+----------+----------------+
| id | name | lat | lng | submitted_by |
|------+-----------+---------+----------+----------------|
| 1 | Pyongyang | 39.0392 | 125.7625 | 15 |
| 2 | Oslo | 59.9139 | 10.7522 | 8 |
| 3 | Hebron | 31.5326 | 35.0998 | 8 |
| 4 | Hebron | 31.5326 | 35.0998 | 10 |
| 5 | Paris | 48.8566 | 2.3522 | 12 |
| 6 | Hebron | 31.5326 | 35.0998 | 7 |
+------+-----------+---------+----------+----------------+
Desired result:
+-----------+---------+
| name | count |
|-----------+---------|
| Hebron | 3 |
| Pyongyang | 1 |
| Oslo | 1 |
| Paris | 1 |
| Total | 6 | <-- The tricky part
+-----------+---------+
In other words, what I need to do is SELECT the SUM of the COUNT in the query I'm currently using:
SELECT name, count(name)::int FROM cities GROUP BY name;
But apparently nested aggregated functions are not allowed in PostgreSQL. I'm guessing I need to use ROLLUP in some way but I can't seem to get it right.
Thanks for the help.
You need to UNION ALL the total sum.
WITH ROLLUP works by summing up the total for every group separate and can't be used here.
CREATE TABLE cities (
"id" INTEGER,
"name" VARCHAR(9),
"lat" FLOAT,
"lng" FLOAT,
"submitted_by" INTEGER
);
INSERT INTO cities
("id", "name", "lat", "lng", "submitted_by")
VALUES
('1', 'Pyongyang', '39.0392', '125.7625', '15'),
('2', 'Oslo', '59.9139', '10.7522', '8'),
('3', 'Hebron', '31.5326', '35.0998', '8'),
('4', 'Hebron', '31.5326', '35.0998', '10'),
('5', 'Paris', '48.8566', '2.3522', '12'),
('6', 'Hebron', '31.5326', '35.0998', '7');
SELECT name, COUNT(name)::int FROM cities GROUP BY name
UNION ALL
SELECT 'Total', COUNT(*) FROM cities
name | count
:-------- | ----:
Hebron | 3
Pyongyang | 1
Oslo | 1
Paris | 1
Total | 6
db<>fiddle here
I have Pyspark dataframe:
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | null | ["67"]
--------------------------------------------
2 | null | ["78"] | ["90"]
--------------------------------------------
3 | ["""] | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
And I need to convert all null values for column1 to empty array []
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | null | ["67"]
--------------------------------------------
2 | [] | ["78"] | ["90"]
--------------------------------------------
3 | ["""] | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
Used this code, but it's not working for me.
df.withColumn("column_1", coalesce(column_1, array().cast("array<string>")))
Appreciate your help!
The code working just fine for me, except that you need to wrap column_1 into quotes "column_1". Plus, you don't need to cast, just array() is enough.
df.withColumn("column_1", coalesce('column_1', array()))
Use fillna() with subset.
Reference https://stackoverflow.com/a/45070181
How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe. This should happen at a key level
I have following data after sorting on key, dates
source_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-08 |
| 10 | BAC | 2018-01-03 | 2018-01-15 |
| 10 | CAS | 2018-01-03 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-03 |
| 20 | DAS | 2018-01-01 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
When the dates are in a range from these rows (i.e. the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.
Here is the output I need..
final_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-21 |
| 10 | BAC | 2018-01-01 | 2018-01-21 |
| 10 | CAS | 2018-01-01 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-12 |
| 20 | DAS | 2017-11-12 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
Appreciate any ideas to achieve this. Thanks in advance!
Here's one approach:
Create new column group_id with null value if begin_dt is within date range from the previous row; otherwise a unique integer
Backfill nulls in group_id with the last non-null value
Compute min(begin_dt) and max(end_dt) within each (key, group_id) partition
Example below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(10, "ABC", "2018-01-01", "2018-01-08"),
(10, "BAC", "2018-01-03", "2018-01-15"),
(10, "CAS", "2018-01-03", "2018-01-21"),
(20, "AAA", "2017-11-12", "2018-01-03"),
(20, "DAS", "2018-01-01", "2018-01-12"),
(20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")
val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")
df.
withColumn("group_id", when(
$"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
).otherwise(monotonically_increasing_id)
).
withColumn("group_id", last($"group_id", ignoreNulls=true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("begin_dt2", min($"begin_dt").over(win2)).
withColumn("end_dt2", max($"end_dt").over(win2)).
orderBy("key", "begin_dt", "end_dt").
show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code| begin_dt| end_dt| group_id| begin_dt2| end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+
I have some dataframe like below :
+--------------------+-------------+----------+--------------------+------------------+
| cond | val | val1 | val2 | val3 |
+--------------------+-------------+----------+--------------------+------------------+
|cond1 | 1 | null | null | null |
|cond1 | null | 2 | null | null |
|cond1 | null | null | 3 | null |
|cond1 | null | null | null | 4 |
|cond2 | null | null | null | 44 |
|cond2 | null | 22 | null | null |
|cond2 | null | null | 33 | null |
|cond2 | 11 | null | null | null |
|cond3 | null | null | null | 444 |
|cond3 | 111 | 222 | null | null |
|cond3 | 1111 | null | null | null |
|cond3 | null | null | 333 | null |
I want to reduce the numbers based value of the some column, I want the resultant column to look like below :
+--------------------+-------------+----------+--------------------+------------------+
| cond | val | val1 | val2 | val3 |
+--------------------+-------------+----------+--------------------+------------------+
|cond1 | 1 | 2 | 3 | 4 |
|cond2 | 11 | 22 | 33 | 44 |
|cond3 | 111,1111 | 222 | 333 | 444 |
Try using .groupBy() and .agg() e.g.
val output = input.groupBy("cond")
.agg(collect_list("val").name("val"))
.agg(collect_list("val1").name("val1"))
.agg(collect_list("val2").name("val2"))
.agg(collect_list("val3").name("val3"))