How can I run this task in pyspark? - pyspark

I have this Df in pyspark.
cod, date , value_1, value_2, value_3
1 , 2021-03 , 0, 3, 3
2 , 2021-04 , 2, 0, 0
3 , 2021-05 , 3, 3, 3
4 , 2021-03 , 0, 0, 0
I need to add a column, that counts the zeros that are in the value columns by cod, and that it remains like this.
cod, date , value_1, value_2, value_3, new_column
1 , 2021-03 , 0, 3, 3, 1
2 , 2021-04 , 2, 0, 0, 2
3 , 2021-05 , 3, 3, 3, 0
4 , 2021-03 , 0, 0, 0, 3
I use pyspark SQL.

You can check for 0s in the value columns, and count them by rows.
from pyspark.sql import functions as F
value_columns = ['value_1', 'value_2', 'value_3']
df.withColumn("new_column", sum(F.when(df[col] == 0, 1)\
.otherwise(0) for col in value_columns)).show()
+---+-------+-------+-------+-------+----------+
|cod| date|value_1|value_2|value_3|new_column|
+---+-------+-------+-------+-------+----------+
| 1|2021-03| 0| 3| 3| 1|
| 2|2021-04| 2| 0| 0| 2|
| 3|2021-05| 3| 3| 3| 0|
| 4|2021-03| 0| 0| 0| 3|
+---+-------+-------+-------+-------+----------+

Related

Custom order pyspark dataframe usign a column

I have a pyspark dataframe df :
I want to proprtize items based on Type column this order : AAIC>AAFC>TBIC>TBFC and among them uisng RANK column i.e items with lower rank prioritzed within above order groups.
Any values in Type column other than AAIC or AAFC TBIC or TBFC I want to relabel them as NON
ITEM
Type
RANK
1
AAIC
11
2
AAFC
8
3
TBIC
2
4
TBFC
1
5
XYZ
5
6
AAIC
7
7
JHK
10
8
SWE
3
9
TBIC
4
10
AAFC
9
11
AAFC
6
Desired pyspark dataframe df :-
ITEM
Type
RANK
NEW_RANK
6
AAIC
7
1
1
AAIC
11
2
11
AAFC
6
3
2
AAFC
8
4
10
AAFC
9
5
3
TBIC
2
6
9
TBIC
4
7
4
TBFC
1
8
8
NON
3
9
5
NON
5
10
7
NON
10
11
You may check this code:
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
(1, "AAIC", 11),
(2, "AAFC", 8),
(3, "TBIC", 2),
(4, "TBFC", 1),
(5, "XYZ", 5),
(6, "AAIC", 7),
(7, "JHK", 10),
(8, "SWE", 3),
(9, "TBIC", 4),
(10, "AAFC", 9),
(11, "AAFC", 6),
]
inputDf = spark.createDataFrame(inputData, schema=["item", "type", "rank"])
preprocessedDf = inputDf.withColumn(
"type",
F.when(
F.col("type").isin(["AAIC", "AAFC", "TBIC", "TBFC"]), F.col("type")
).otherwise(F.lit("NON")),
).withColumn(
"priority",
F.when(F.col("type") == F.lit("AAIC"), 1).otherwise(
F.when(F.col("type") == F.lit("AAFC"), 2).otherwise(
F.when(F.col("type") == F.lit("TBIC"), 3).otherwise(
F.when(F.col("type") == F.lit("TBFC"), 4).otherwise(F.lit(5))
)
)
),
)
windowSpec = Window.partitionBy().orderBy("priority", "rank")
preprocessedDf.withColumn("NEW_RANK", F.row_number().over(windowSpec)).drop(
"priority"
).show()
Priorities for codes are hardcoded which may be hard to maintain in case of more values. You may want to adjust this part if it needs to be more flexible
I am moving all records to one partition to calculate the correct row_order. Its a common problem, its hard to calculate consistent ids with given order in distributed manner. If your dataset is big, there may be need to think about something else, probably more complicated
output:
+----+----+----+--------+
|item|type|rank|NEW_RANK|
+----+----+----+--------+
| 6|AAIC| 7| 1|
| 1|AAIC| 11| 2|
| 11|AAFC| 6| 3|
| 2|AAFC| 8| 4|
| 10|AAFC| 9| 5|
| 3|TBIC| 2| 6|
| 9|TBIC| 4| 7|
| 4|TBFC| 1| 8|
| 8| NON| 3| 9|
| 5| NON| 5| 10|
| 7| NON| 10| 11|
+----+----+----+--------+

Window function with PySpark

I have a PySpark Dataframe and my goal is to create a Flag column whose value depends on the value of the Amount column.
Basically, for each Group, I want to know if in any of the first three months, there is an amount greater than 0 and if that is the case, the value of the Flag column will be 1 for all the group, otherwise the value will be 0.
I will include an example to clarify a bit better.
Initial PySpark Dataframe:
Group
Month
Amount
A
1
0
A
2
0
A
3
35
A
4
0
A
5
0
B
1
0
B
2
0
C
1
0
C
2
0
C
3
0
C
4
13
D
1
0
D
2
24
D
3
0
Final PySpark Dataframe:
Group
Month
Amount
Flag
A
1
0
1
A
2
0
1
A
3
35
1
A
4
0
1
A
5
0
1
B
1
0
0
B
2
0
0
C
1
0
0
C
2
0
0
C
3
0
0
C
4
13
0
D
1
0
1
D
2
24
1
D
3
0
1
Basically, what I want is for each group, to sum the amount of the first 3 months. If that sum is greater than 0, the flag is 1 for all the elements of the group, and otherwise is 0.
You can create the flag column by applying a Window function. Create a psuedo-column which becomes 1 if the criteria is met and then finally sum over the psuedo-column and if it's greater than 0, then there was atleast once row that met the criteria and set the flag to 1.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("A", 1, 0, ),
("A", 2, 0, ),
("A", 3, 35, ),
("A", 4, 0, ),
("A", 5, 0, ),
("B", 1, 0, ),
("B", 2, 0, ),
("C", 1, 0, ),
("C", 2, 0, ),
("C", 3, 0, ),
("C", 4, 13, ),
("D", 1, 0, ),
("D", 2, 24, ),
("D", 3, 0, ), ]
df = spark.createDataFrame(data, ("Group", "Month", "Amount", ))
ws = W.partitionBy("Group").orderBy("Month").rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
criteria = F.when((F.col("Month") < 4) & (F.col("Amount") > 0), F.lit(1)).otherwise(F.lit(0))
(df.withColumn("flag", F.when(F.sum(criteria).over(ws) > 0, F.lit(1)).otherwise(F.lit(0)))
).show()
"""
+-----+-----+------+----+
|Group|Month|Amount|flag|
+-----+-----+------+----+
| A| 1| 0| 1|
| A| 2| 0| 1|
| A| 3| 35| 1|
| A| 4| 0| 1|
| A| 5| 0| 1|
| B| 1| 0| 0|
| B| 2| 0| 0|
| C| 1| 0| 0|
| C| 2| 0| 0|
| C| 3| 0| 0|
| C| 4| 13| 0|
| D| 1| 0| 1|
| D| 2| 24| 1|
| D| 3| 0| 1|
+-----+-----+------+----+
"""
You can use Window function with count and when.
w = Window.partitionBy('Group')
df = df.withColumn('Flag', F.count(
F.when((F.col('Month') < 4) & (F.col('Amount') > 0), True)).over(w))
.withColumn('Flag', F.when(F.col('Flag') > 0, 1).otherwise(0))

Calculate the sum on the 24 hours time frame in spark dataframe

I want to calculate the sum on date and date+1(24 hours) by filtering the rows based on hours.
1, 2018-05-01 02:12:00,1
1, 2018-05-01 03:16:10,2
1, 2018-05-01 09:12:00,4
1, 2018-05-01 14:18:00,3
1, 2018-05-01 18:32:00,1
1, 2018-05-01 20:12:00,1
1, 2018-05-02 01:22:00,1
1, 2018-05-02 02:12:00,1
1, 2018-05-02 08:30:00,1
1, 2018-05-02 10:12:00,1
1, 2018-05-02 11:32:00,1
1, 2018-05-02 18:12:00,1
1, 2018-05-03 03:12:00,1
1, 2018-05-03 08:22:00,1
Here, example I have filtered the rows from 9AM to 9AM(next date)
Output
1, 2018-05-01,12
1, 2018-05-02,5
First define df for reproducibility:
import pandas as pd
import io
data=\
"""
1, 2018-05-01 02:12:00,1
1, 2018-05-01 03:16:10,2
1, 2018-05-01 09:12:00,4
1, 2018-05-01 14:18:00,3
1, 2018-05-01 18:32:00,1
1, 2018-05-01 20:12:00,1
1, 2018-05-02 01:22:00,1
1, 2018-05-02 02:12:00,1
1, 2018-05-02 08:30:00,1
1, 2018-05-02 10:12:00,1
1, 2018-05-02 11:32:00,1
1, 2018-05-02 18:12:00,1
1, 2018-05-03 03:12:00,1
1, 2018-05-03 08:22:00,1
"""
df = pd.read_csv(io.StringIO(data), sep = ',', names = ['id','t', 'n'], parse_dates =['t'])
Then use pd.Grouper with frequency set to 24h and base parameter set to 9, which indicates period is beggining at 9 a.m.:
df.groupby(pd.Grouper(key='t', freq='24h', base=9)).n.sum()
result:
t
2018-04-30 09:00:00 3
2018-05-01 09:00:00 12
2018-05-02 09:00:00 5
Freq: 24H, Name: n, dtype: int64
Just shift the time of your timestamp column by 9 hours and then groupby the date of the adjusted column:
from pyspark.sql.functions import expr, sum as fsum
df
# DataFrame[id: int, dtime: timestamp, cnt: int]
df.groupby("id", expr("date(dtime - interval 9 hours) as ddate")) \
.agg(fsum("cnt").alias("cnt")) \
.show()
+---+----------+---+
| id| ddate|cnt|
+---+----------+---+
| 1|2018-05-01| 12|
| 1|2018-05-02| 5|
| 1|2018-04-30| 3|
+---+----------+---+
Use date_format(), date_add(),to_date() and then groupBy,aggregate spark built in functions.
Example:
Spark-Scala:
df.show()
//+---+-------------------+---+
//| id| date|cnt|
//+---+-------------------+---+
//| 1|2018-05-01 02:12:00| 1|
//| 1|2018-05-01 03:16:10| 2|
//| 1|2018-05-01 09:12:00| 4|
//| 1|2018-05-01 14:18:00| 3|
//| 1|2018-05-01 18:32:00| 1|
//| 1|2018-05-01 20:12:00| 1|
//| 1|2018-05-02 01:22:00| 1|
//| 1|2018-05-02 02:12:00| 1|
//| 1|2018-05-02 08:30:00| 1|
//| 1|2018-05-02 10:12:00| 1|
//| 1|2018-05-02 11:32:00| 1|
//| 1|2018-05-02 18:12:00| 1|
//| 1|2018-05-03 03:12:00| 1|
//| 1|2018-05-03 08:22:00| 1|
//+---+-------------------+---+
df.withColumn("hour",when(date_format(col("date"),"HH").cast("int") >= 9,to_date(col("date"))).otherwise(date_add(to_date(col("date")),-1))).
groupBy("id","hour").
agg(sum("cnt").cast("int").alias("sum")).
show()
//+---+----------+---+
//| id| hour|sum|
//+---+----------+---+
//| 1|2018-05-01| 12|
//| 1|2018-05-02| 5|
//| 1|2018-04-30| 3|
//+---+----------+---+
Pyspark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("hour",when(date_format(col("date"),"HH").cast("int") >= 9,to_date(col("date"))).otherwise(date_add(to_date(col("date")),-1))).\
groupBy("id","hour").\
agg(sum("cnt").cast("int").alias("sum")).\
show()
#+---+----------+---+
#| id| hour|sum|
#+---+----------+---+
#| 1|2018-05-01| 12|
#| 1|2018-05-02| 5|
#| 1|2018-04-30| 3|
#+---+----------+---+

Pyspark - Count non zero columns in a spark data frame for each row

I have dataframe, I need to count number of non zero columns by row in Pyspark.
ID COL1 COL2 COL3
1 0 1 -1
2 0 0 0
3 -17 20 15
4 23 1 0
Expected Output:
ID COL1 COL2 COL3 Count
1 0 1 -1 2
2 0 0 0 0
3 -17 20 15 3
4 23 1 0 1
There are various approaches to achieve this, below I am listing one of the simple approaches -
df = sqlContext.createDataFrame([
[1, 0, 1, -1],
[2, 0, 0, 0],
[3, -17, 20, 15],
[4, 23, 1, 0]],
["ID", "COL1", "COL2", "COL3"]
)
#Check columns list removing ID columns
df.columns[1:]
['COL1', 'COL2', 'COL3']
#import functions
from pyspark.sql import functions as F
#Adding new column count having sum/addition(if column !=0 then 1 else 0)
df.withColumn(
"count",
sum([
F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:]
])
).show()
+---+----+----+----+-----+
| ID|COL1|COL2|COL3|count|
+---+----+----+----+-----+
| 1| 0| 1| -1| 2|
| 2| 0| 0| 0| 0|
| 3| -17| 20| 15| 3|
| 4| 23| 1| 0| 2|
+---+----+----+----+-----+

scala - Spark : how to get the resultSet with condition in a groupedData

Is there a way to the group Dataframe using its own schema?
This is produces data of format :
Country | Class | Name | age
US, 1,'aaa',21
US, 1,'bbb',20
BR, 2,'ccc',30
AU, 3,'ddd',20
....
I would want to do some like
Country | Class 1 Students | Class 2 Students
US , 2, 0
BR , 0, 1
....
condition 1. Country Groupping.
condition 2. get only 1 or 2 class value
this is a source code..
val df = Seq(("US", 1, "AAA",19),("US", 1, "BBB",20),("KR", 2, "CCC",29),
("AU", 3, "DDD",18)).toDF("country", "class", "name","age")
df.groupBy("country").agg(count($"name") as "Cnt")
You should use pivot function.
val df = Seq(("US", 1, "AAA",19),("US", 1, "BBB",20),("KR", 2, "CCC",29),
("AU", 3, "DDD",18)).toDF("country", "class", "name","age")
df.groupBy("country").pivot("class").agg(count($"name") as "Cnt").show
+-------+---+---+---+
|country| 1| 2| 3|
+-------+---+---+---+
| AU| 0| 0| 1|
| US| 2| 0| 0|
| KR| 0| 1| 0|
+-------+---+---+---+