I have a dataframe like this:
order_type
customer_id
order_id
related_order_id
purchase
123
abc
null
return
123
bcd
null
purchase
234
xyz
null
return
234
zzz
null
Where I want to fill in the related_order_id column as the order_id of the related purchase, only for rows where order_type is return. A return and a purchase row can be related by their customer_id.
I've tried to use withColumn(), but I haven't figured out a way that would allow me to also look at other rows and their column data.
The end result should look something like
order_type
customer_id
order_id
related_order_id
purchase
123
abc
null
return
123
bcd
abc
purchase
234
xyz
null
return
234
zzz
xyz
You can use the lag() function to use data from the previous row.
Assuming a return is always preceded by a purchase, you can do:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import col
w = Window().partitionBy("customer_id").orderBy("order_type")
df.withColumn("related_order_id", F.when(col("order_type")=="return", \
F.lag(col("order_id")).over(w)) \
.otherwise(col("related_order_id"))).show()
Output:
+----------+-----------+--------+----------------+
|order_type|customer_id|order_id|related_order_id|
+----------+-----------+--------+----------------+
| purchase| 123| abc| null|
| return| 123| bcd| abc|
| purchase| 234| xyz| null|
| return| 234| zzz| xyz|
+----------+-----------+--------+----------------+
Related
I have the following dataframe. There are several ID's which have either a numeric or a string value. If the ID is a string_value like "B" the numeric_value is "NULL" as a string. Vice versa for the numeric value e.g. ID "D".
ID string_value numeric_value timestamp
0 B On NULL 1632733508
1 B Off NULL 1632733508
2 A Inactive NULL 1632733511
3 A Active NULL 1632733512
4 D NULL 450 1632733513
5 D NULL 431 1632733515
6 C NULL 20 1632733518
7 C NULL 30 1632733521
Now I want to seperate the dataframe in a new one for each ID by an existing list containing all the unique ID's. Afterwards the new dataframes like "B" in this example, should drop the column with the "NULL" values. So if B is a string_value the numeric_value should be dropped.
ID string_value timestamp
0 B On 1632733508
1 B Off 1632733508
After that, the column with the value should be renamed with the ID "B" and the ID column should be dropped.
B timestamp
0 On 1632733508
1 Off 1632733508
As described, the same procedure should be applied for the numeric values in this case ID "D"
ID numeric_value timestamp
0 D 450 1632733513
1 D 431 1632733515
D timestamp
0 450 1632733513
1 431 1632733515
It is important to safe the original data types within the value column.
Assuming your dataframe is called df and your list of IDs is ids. You can write a function which does what you need and call it for every id.
The function applies the required filter, and the selects the needed columns with the id as an alias.
from pyspark.sql import functions as f
ids = ["B", "A", "D", "C"]
def split_df(df, id):
split_df = df.filter(f.col("ID") == id).select(
f.coalesce(f.col("string_value"), f.col("numeric_value")).alias(id),
f.col("timestamp"),
)
return split_df
dfs = [split_df(df, id) for id in ids]
for df in dfs:
df.show()
output
+---+----------+
| B| timestamp|
+---+----------+
| On|1632733508|
|Off|1632733508|
+---+----------+
+--------+----------+
| A| timestamp|
+--------+----------+
|Inactive|1632733511|
| Active|1632733512|
+--------+----------+
+---+----------+
| D| timestamp|
+---+----------+
|450|1632733513|
|431|1632733515|
+---+----------+
+---+----------+
| C| timestamp|
+---+----------+
| 20|1632733518|
| 30|1632733521|
+---+----------+
I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+
I have data like below, want to take data for same id from one column and put each answer in different new columns respectively
actual
ID Brandid
1 234
1 122
1 134
2 122
3 234
3 122
Excpected
ID BRANDID_1 BRANDID_2 BRANDID_3
1 234 122 134
2 122 - -
3 234 122 -
You can use pivot after a groupBy, but first you can create a column with the future column name using row_number to get monotically number per ID over a Window. Here is one way:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# create the window on ID and as you need orderBy after,
# you can use a constant to keep the original order do F.lit(1)
w = Window.partitionBy('ID').orderBy(F.lit(1))
# create the column with future columns name to pivot on
pv_df = (df.withColumn('pv', F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')))
# groupby the ID and pivot on the created column
.groupBy('ID').pivot('pv')
# in aggregation, you need a function so we use first
.agg(F.first('Brandid')))
and you get
pv_df.show()
+---+---------+---------+---------+
| ID|Brandid_1|Brandid_2|Brandid_3|
+---+---------+---------+---------+
| 1| 234| 122| 134|
| 3| 234| 122| null|
| 2| 122| null| null|
+---+---------+---------+---------+
EDIT: to get the column in order as OP requested, you can use lpad, first define the length for number you want:
nb_pad = 3
and replace in the above method F.concat(F.lit('Brandid_'), F.row_number().over(w).cast('string')) by
F.concat(F.lit('Brandid_'), F.lpad(F.row_number().over(w).cast('string'), nb_pad, "0"))
and if you don't know how many "0" you need to add (here it was number of length of 3 overall), then you can get this value by
nb_val = len(str(sdf.groupBy('ID').count().select(F.max('count')).collect()[0][0]))
I'm using pyspark and hivecontext.sql and I want to filter out all null and empty values from my data.
So I used simple sql commands to first filter out the null values, but it doesen't work.
My code:
hiveContext.sql("select column1 from table where column2 is not null")
but it work without the expression "where column2 is not null"
Error:
Py4JavaError: An error occurred while calling o577.showString
I think it was due to my select is wrong.
Data example:
column 1 | column 2
null | 1
null | 2
1 | 3
2 | 4
null | 2
3 | 8
Objective:
column 1 | column 2
1 | 3
2 | 4
3 | 8
Tks
We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One of the way to read Hive table is using the pysaprk shell.
We need to register the data frame we get from reading the hive table. Then we can run the SQL query.
You have to give database_name.table and run the same query it will work. Please let me know if that helps
It work for me:
df.na.drop(subset=["column1"])
Have you entered null values manually?
If yes then it will treat those as normal strings,
I tried following two use cases
dbname.person table in hive
name age
aaa null // this null is entered manually -case 1
Andy 30
Justin 19
okay NULL // this null came as this field was left blank. case 2
---------------------------------
hiveContext.sql("select * from dbname.person").show();
+------+----+
| name| age|
+------+----+
| aaa |null|
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
-----------------------------
case 2
hiveContext.sql("select * from dbname.person where age is not null").show();
+------+----+
| name|age |
+------+----+
| aaa |null|
| Andy| 30 |
|Justin| 19 |
+------+----+
------------------------------------
case 1
hiveContext.sql("select * from dbname.person where age!= 'null'").show();
+------+----+
| name| age|
+------+----+
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
------------------------------------
I hope above use cases would clear your doubts about filtering null values
out.
and if you are querying a table registered in spark then use sqlContext.
I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11