Logic with dates pyspark - pyspark

I am using Pyspark and I have data like this in the dataframe
and I want the output like this
The logic goes like this - from table 1 above,
the first date of category B for id=1 is 08/06/2022 and the first date for category A is 13/06/2022.So, for any dates on or after 13/06/2022 should have both categories A and B.
So, for 08/06/2022, there is category B only and for 13/06/2022 there is category A and B. For 24/06/2022, there is just category A in table1 but the output should have category B too as the first date of category B is 13/06/2022 and for 26/07/2022, there is just category B in table 1 but output should have both category and category B for 26/07/2022.
How do I achieve this in pyspark?

# input dataframe creation
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id', 'cat', 'dt']). \
withColumn('dt', func.col('dt').cast('date'))
# required solution
data_sdf. \
withColumn('min_dt', func.min('dt').over(wd.partitionBy('id'))). \
withColumn('all_cats', func.collect_set('cat').over(wd.partitionBy('id'))). \
withColumn('cat_arr',
func.when(func.col('min_dt') == func.col('dt'), func.array(func.col('cat'))).
otherwise(func.col('all_cats'))
). \
drop('cat', 'min_dt', 'all_cats'). \
dropDuplicates(). \
withColumn('cat', func.explode('cat_arr')). \
drop('cat_arr'). \
orderBy('id', 'dt', 'cat'). \
show()
# +---+----------+---+
# |id |dt |cat|
# +---+----------+---+
# |1 |2022-06-08|B |
# |1 |2022-06-13|A |
# |1 |2022-06-13|B |
# |1 |2022-06-24|A |
# |1 |2022-06-24|B |
# +---+----------+---+
I've used a subset of the posted data. The idea of the approach is that you create an array of distinct categories and apply that to all dates except the minimum date. The minimum date will only have that row's category (not all categories). The array can then be exploded to get the desired result for all dates.

Related

How to match column values of one table with column Name another table pyspark

I have two dataframes
dataframe_1
Id
qname
qval
01
Mango
[100,200]
01
Banana
[500,400,800]
dataframe_2
reqId
Mango
Banana
Orange
Apple
1000
100
500
NULL
NULL
1001
200
500
NULL
NULL
1002
200
800
NULL
NULL
1003
900
1100
NULL
NULL
Expected Result
Id
ReqId
01
1000
01
1001
01
10002
Please give me some idea. I need to match all qname and value of dataframe_1 to the columns of dataframe_2, ignoring the NULL columns of dataframe_2. Get all the reqId from dataframe_2.
Note - All qname and val of a particular id of dataframe_1 should match with all the columns of dataframe_2, ignoring nulls. For example, id -01 , has two qname and val. These two should match with corresponding column names of dataframe_2.
The logic is:
In df2, for each column, pair "reqId" with the column.
In df2, introduce a dummy column with some constant value and group by this column so all values are in one group.
Unpivot df2.
Join df1 and above processed df2.
For each element in "qval" list, filter corresponding "reqId" from joined df2 column.
Group by "id" and explode "reqId".
df1 = spark.createDataFrame(data=[["01","Mango",[100,200]],["01","Banana",[500,400,800]],["02","Banana",[800,1100]]], schema=["Id","qname","qval"])
df2 = spark.createDataFrame(data=[[1000,100,500,None,None],[1001,200,500,None,None],[1002,200,800,None,None],[1003,900,1100,None,None]], schema="reqId int,Mango int,Banana int,Orange int,Apple int")
for c in df2.columns:
if c != "reqId":
df2 = df2.withColumn(c, F.array(c, "reqId"))
df2 = df2.withColumn("dummy", F.lit(0)) \
.groupBy("dummy") \
.agg(*[F.collect_list(c).alias(c) for c in df2.columns]) \
.drop("dummy", "reqId")
stack_cols = ", ".join([f"{c}, '{c}'" for c in df2.columns])
df2 = df2.selectExpr(f"stack({len(df2.columns)},{stack_cols}) as (qval2, qname2)")
#F.udf(returnType=ArrayType(IntegerType()))
def compare_qvals(qval, qval2):
return [x[1] for x in qval2 if x[0] in qval]
#
df_result = df1.join(df2, on=(df1.qname == df2.qname2)) \
.withColumn("reqId", compare_qvals("qval", "qval2")) \
.groupBy("Id") \
.agg(F.flatten(F.array_distinct(F.collect_list("reqId"))).alias("reqId")) \
.withColumn("reqId", F.explode("reqId"))
Output:
+---+-----+
|Id |reqId|
+---+-----+
|01 |1000 |
|01 |1001 |
|01 |1002 |
|02 |1002 |
|02 |1003 |
+---+-----+
PS - To cover case with multiple "Id"s, I have added some extra data to the sample dataset, hence output has some extra rows.

Pyspark - Count zero value columns between each pair of non-zero value columns

Let's say we have this table (first row is the title)
How can I count the number of zero value cells between two non-zero value cells?
For example, the output for the above table should be a list of (3,2)
Panda interrow and cell counting may work but it is clearly not efficient for big dataset. Please help. Thanks!
You can use following series of transformations.
I have added some more data to test boundary cases.
To see the output up to a particular step, just comment all steps that follow:
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
[0,0,0,0,0,0,0,0,0,0,0],
[1,1,1,1,1,1,1,1,1,1,1],
[1,0,1,0,1,0,1,0,1,0,1],
[0,1,0,1,0,1,0,1,0,1,0],
[0,1,0,0,0,1,0,0,1,0,0]
],
[str(i) for i in range(0,11)]
)
df.select(F.array(df.columns).alias("c")) \
.select(F.array_join("c", delimiter="").alias("c")) \
.select(F.regexp_replace("c", r"^0*", "").alias("c")) \
.select(F.regexp_replace("c", r"0*$", "").alias("c")) \
.select(F.split("c", "1").alias("c")) \
.select(F.transform("c", lambda s: F.length(s)).alias("c")) \
.select(F.filter("c", lambda i: i > 0).alias("c")) \
.show(truncate=False)
[Out]:
+---------------+
|c |
+---------------+
|[] |
|[] |
|[1, 1, 1, 1, 1]|
|[1, 1, 1, 1] |
|[3, 2] |
+---------------+

how I can groupby a column and use it to groupby the other column?

I am classifying a column to different parts based on their first letter. It means that if they have the same 4 letter, they are in a same class. I use the following code to do that:
# this code extracts the first 4 elements of each title
df1_us2 = df1_us2.withColumn("first_2_char", df1_us2.clean_company_name.substr(1,4))
#this code group them in a list
group_user = df1_us2.groupBy('first_2_char').agg(collect_set('col1').alias('cal11'))
Each title has a description, I want this classification happen for the description as well:
Example:
col1 description
summer a season
summary it is a brief
common having similar
communication null
house living place
output:
col11 description1
['summer','summary'] ['a season',' it is a brief']
['common','communication'] ['having similar', null]
['house'] ['living place ']
How I can modify the above code to get description1?
Note: if a description is null, the null should be in the list. Because I am gonna use index of elements incol1 to get their description. So both of columns should have the same size of list per each row.
collect_list should work as aggregation function:
from pyspark.sql import functions as F
df = ...
df.withColumn('f2c', df.col1.substr(1,2)) \
.fillna('null') \
.groupby('f2c') \
.agg(F.collect_list('col1').alias('col11'),
F.collect_list('description').alias('description1')) \
.drop('f2c') \
.show(truncate=False)
To include the null values in the arrays they are replaced with strings first.
Output:
+-----------------------+-------------------------+
|col11 |description1 |
+-----------------------+-------------------------+
|[house] |[living place] |
|[common, communication]|[having similar, null] |
|[summer, summary] |[a season, it is a brief]|
+-----------------------+-------------------------+
For further processing the two arrays can be combined into a map using map_from_arrays:
[...]
.withColumn('map', F.map_from_arrays('col11', 'description1')) \
.show(truncate=False)
Output:
+-----------------------+-------------------------+-------------------------------------------------+
|col11 |description1 |map |
+-----------------------+-------------------------+-------------------------------------------------+
|[house] |[living place] |{house -> living place} |
|[common, communication]|[having similar, null] |{common -> having similar, communication -> null}|
|[summer, summary] |[a season, it is a brief]|{summer -> a season, summary -> it is a brief} |
+-----------------------+-------------------------+-------------------------------------------------+

Calculating difference between two dates in PySpark

Currently I'm working with a dataframe and need to calculate the number of days (as integer) between two dates formatted as timestamp
I've opted for this solution:
from pyspark.sql.functions import lit, when, col, datediff
df1 = df1.withColumn("LD", datediff("MD", "TD"))
But after calculating sum from a list I get an error: "Column in not iterable" which makes me impossible to calculate sum of the rows based on column names
col_list = ["a", "b", "c"]
df2 = df1.withColumn("My_Sum", sum([F.col(c) for c in col_list]))
How can I deal with it in order to calculate the difference between dates and then calculate the sum of the rows given the names of certain columns?
The datediff has nothing to do with the sum of a column. The pyspark sql sum function takes in 1 column and it calculates the sum of the rows in that column.
Here are a couple of ways to get the sum of a column from a list of columns using list comprehension.
Single row output with the sum of the column
data_sdf. \
select(*[func.sum(c).alias(c+'_sum') for c in col_list]). \
show()
# +-----+-----+-----+
# |a_sum|b_sum|c_sum|
# +-----+-----+-----+
# | 1337| 3778| 6270|
# +-----+-----+-----+
the sum of all rows of the column in each row
from pyspark.sql.window import Window as wd
data_sdf. \
select('*',
*[func.sum(c).over(wd.partitionBy()).alias(c+'_sum') for c in col_list]
). \
show(5)
# +---+---+---+-----+-----+-----+
# | a| b| c|a_sum|b_sum|c_sum|
# +---+---+---+-----+-----+-----+
# | 45| 58|125| 1337| 3778| 6270|
# | 9| 99|143| 1337| 3778| 6270|
# | 33| 91|146| 1337| 3778| 6270|
# | 21| 85|118| 1337| 3778| 6270|
# | 30| 55|101| 1337| 3778| 6270|
# +---+---+---+-----+-----+-----+
# only showing top 5 rows

export many files from a table

I have a sql query that generate a table with the below format
|sex |country|popularity|
|null |null | x |
|null |value | x |
|value|null | x |
|value|null | x |
|null |value | x |
|value|value | x |
value for sex column could be woman,man
value for country could be Italy,England,US etc.
x is a int
Now i would like to save four files based on data combination(value,null). So file1 consist of (value,value) for column sex,country.
file2 consist of (value,null) for column sex,country. file3 consist of (null,value) and file4 consist of
(null,null).
I have searched a lot of things but i couldn't find any useful info. I have also tried the below
val df1 = data.withColumn("combination",concat(col("sex") ,lit(","), col("country")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
but i receive more files because this command generate files based on all possible data of (sex-country).
Same with the below
val df1 = data.withColumn("combination",concat(col("sex")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("text.csv")
Is there any command similar to partitionby that gives me a combination of pairs (value,null) and not for columns?
You can convert the columns into Boolean depending on whether they are null or not, and concat into a string, which will look like "true_true", "true_false" etc.
df = df.withColumn("coltype", concat(col("sex").isNull(), lit("_"), col("country").isNull()))
df.coalesce(1)
.write
.partitionBy("coltype")
.format("csv")
.option("header", "true")
.mode("overwrite")
.save("output")