compute the rate in pyspark dataframe - pyspark

I have a spark dataframe like this :
date isF
190502 1
190502 0
190503 1
190504 1
190504 0
190505 1
I would like to compute for each date the rate of "isF" when isF =1.
The expected result is :
date rate
190502 0.5
190503 1
190504 0.5
190505 1
I tryed somethong like this but here i compute the sum, how can I do to compute the rate? :
stats_daily_df = (tx_wd_df
.groupBy("date", "isF")
.agg(# select
when(col("isF") == 1, (sum("isF")).alias("sum"))
.otherwise(0)) # else 0.00
)

IIUC, Below can help:
df.groupBy('date').agg((f.sum('isF')/f.count('isF')).alias('rate')).show()
+------+----+
| date|rate|
+------+----+
|190505| 1.0|
|190502| 0.5|
|190504| 0.5|
|190503| 1.0|
+------+----+

Related

PySpark: Pivot only one row to column

I have a dataframe like so:
df = sc.parallelize([("num1", "1"), ("num2", "5"), ("total", "10")]).toDF(("key", "val"))
key val
num1 1
num2 5
total 10
I want to pivot only the total row to a new column and keep its value for each row:
key val total
num1 1 10
num2 5 10
I've tried pivoting and aggregating but cannot get the one column with the same value.
You could join a dataframe with only the total to a dataframe without the total.
Another option would be to collect the total and add it as a literal.
from pyspark.sql import functions as f
# option 1
df1 = df.filter("key <> 'total'")
df2 = df.filter("key = 'total'").select(f.col('val').alias('total'))
df1.join(df2).show()
+----+---+-----+
| key|val|total|
+----+---+-----+
|num1| 1| 10|
|num2| 5| 10|
+----+---+-----+
# option 2
total = df.filter("key = 'total'").select('val').collect()[0][0]
df.filter("key <> 'total'").withColumn('total', f.lit(total)).show()
+----+---+-----+
| key|val|total|
+----+---+-----+
|num1| 1| 10|
|num2| 5| 10|
+----+---+-----+

Pyspark - Count non zero columns in a spark data frame for each row

I have dataframe, I need to count number of non zero columns by row in Pyspark.
ID COL1 COL2 COL3
1 0 1 -1
2 0 0 0
3 -17 20 15
4 23 1 0
Expected Output:
ID COL1 COL2 COL3 Count
1 0 1 -1 2
2 0 0 0 0
3 -17 20 15 3
4 23 1 0 1
There are various approaches to achieve this, below I am listing one of the simple approaches -
df = sqlContext.createDataFrame([
[1, 0, 1, -1],
[2, 0, 0, 0],
[3, -17, 20, 15],
[4, 23, 1, 0]],
["ID", "COL1", "COL2", "COL3"]
)
#Check columns list removing ID columns
df.columns[1:]
['COL1', 'COL2', 'COL3']
#import functions
from pyspark.sql import functions as F
#Adding new column count having sum/addition(if column !=0 then 1 else 0)
df.withColumn(
"count",
sum([
F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:]
])
).show()
+---+----+----+----+-----+
| ID|COL1|COL2|COL3|count|
+---+----+----+----+-----+
| 1| 0| 1| -1| 2|
| 2| 0| 0| 0| 0|
| 3| -17| 20| 15| 3|
| 4| 23| 1| 0| 2|
+---+----+----+----+-----+

Need to write a spark scala code for the following sample data( Columns to Rows)

Need help to write a program in Spark to make Columns into rows?
TExt file:
{Summary
Report Id : 001
Type of Report : Medical
Start Date & Time: 15/07/2015 10:10:11
End Date & Time: 28/07/2018 15:12:05
Coordinates : 18° 52’ 01’’ N, 78° 12’ 15’’ E
No.
Freq.
Type
Angle
Power
P
PI
P Type
M Type
S Type
R Type
Time
File name
1
1000
Vis_typ_001
45.5
5
100
1000
PRI_7
M_15
S_2
R1
27/07/2018 10:12:05
Ac13.avi
2
408.55
Vis_typ_002
12
0
0
0
 
M_12
S_3
R5
27/07/2018 12:18:05
070.mp4
Total no of records received : 3
No. of reports passed :2
No. of reports failed :1
Comment: Good Result
}
This is just a modified answer of #terminally-chill in the similar post same post in the question. If this is executing in the windows machine, below should be the delimiters to split. Used this post as reference delimiter in windows usage to answer.
val rdd2 = rdd.flatMap {
case (file, data) =>
data.split("\r\n\r\n").map(strBlock => {
strBlock.split("\r\n")
})
}
// create DataFrame that contains an array
val df = rdd2.toDF
// convert array of items to columns
val finalDf = df.select((0 until 13).map(i => col("value")(i)): _*)
finalDf.show
Output:
+--------+--------+-----------+--------+--------+--------+--------+------------+--------+--------+---------+--------------------+---------+
|value[0]|value[1]| value[2]|value[3]|value[4]|value[5]|value[6]| value[7]|value[8]|value[9]|value[10]| value[11]|value[12]|
+--------+--------+-----------+--------+--------+--------+--------+------------+--------+--------+---------+--------------------+---------+
| No| Freq| Type| Angle| Power| P| PI| P Type| M Type| S Type| R Type| Time|File name|
| 1| 1000|Vis_typ_001| 45.5| 5| 100| 1000| PRI_7| M_15| S_2| R1|27/07/2018 10:12...| Ac13.avi|
| 2| 408.55|Vis_typ_002| 12| 0| 0| 0|JUNKKKKKKKKK| M_12| S_3| R5|27/07/2018 12:18...| 070.mp4|
+--------+--------+-----------+--------+--------+--------+--------+------------+--------+--------+---------+--------------------+---------+

Spark Dataframes : CASE statement while using Window PARTITION function Syntax

I need to check a Condition whether if ReasonCode is "YES" , then use ProcessDate as one of the PARTITION column else do not.
The equivalent SQL query is below:
SELECT PNum, SUM(SIAmt) OVER (PARTITION BY PNum,
ReasonCode ,
CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END
ORDER BY ProcessDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) SumAmt
from TABLE1
I have tried so far the below query, but unable to incorporate the condition
"CASE WHEN ReasonCode = 'YES' THEN ProcessDate ELSE NULL END" in Spark Dataframes
val df = inputDF.select("PNum")
.withColumn("SumAmt", sum("SIAmt").over(Window.partitionBy("PNum","ReasonCode").orderBy("ProcessDate")))
Input Data:
---------------------------------------
Pnum ReasonCode ProcessDate SIAmt
---------------------------------------
1 No 1/01/2016 200
1 No 2/01/2016 300
1 Yes 3/01/2016 -200
1 Yes 4/01/2016 200
---------------------------------------
Expected Output:
---------------------------------------------
Pnum ReasonCode ProcessDate SIAmt SumAmt
---------------------------------------------
1 No 1/01/2016 200 200
1 No 2/01/2016 300 500
1 Yes 3/01/2016 -200 -200
1 Yes 4/01/2016 200 200
---------------------------------------------
Any Suggestion/help on Spark dataframe instead of spark-sql query ?
You can apply the same exact copy of SQL in api form as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = inputDF
.withColumn("SumAmt", sum("SIAmt").over(Window.partitionBy(col("PNum"),col("ReasonCode"), when(col("ReasonCode") === "Yes", col("ProcessDate")).otherwise(null)).orderBy("ProcessDate")))
You can add the .rowsBetween(Long.MinValue, 0) part too, which should give you
+----+----------+-----------+-----+------+
|Pnum|ReasonCode|ProcessDate|SIAmt|SumAmt|
+----+----------+-----------+-----+------+
| 1| Yes| 4/01/2016| 200| 200|
| 1| No| 1/01/2016| 200| 200|
| 1| No| 2/01/2016| 300| 500|
| 1| Yes| 3/01/2016| -200| -200|
+----+----------+-----------+-----+------+

Deciles or other quantile rank for Pyspark column

I have a pyspark DF with multiple numeric columns and I want to, for each column calculate the decile or other quantile rank for that row based on each variable.
This is simple for pandas as we can create a new column for each variable using the qcut function to assign the value 0 to n-1 for 'q' as in pd.qcut(x,q=n).
How can this be done in pyspark? I have tried the following but clearly the break points are not unique between these thirds. I want to get the lower 1/3 of the data assigned 1, the next 1/3 assigned 2 and the top 1/3 assigned 3. I want to be able to change this and perhaps use 1/10, 1/32 etc
w = Window.partitionBy(data.var1).orderBy(data.var1)
d2=df.select(
"var1",
ntile(3).over(w).alias("ntile3")
)
agged=d2.groupby('ntile3').agg(F.min("var1").alias("min_var1"),F.max("var1").alias("max_var1"),F.count('*'))
agged.show()
+------+--------+--------+--------+
|ntile3|min_var1|max_var1|count(1)|
+------+--------+--------+--------+
| 1| 0.0| 210.0| 517037|
| 3| 0.0| 206.0| 516917|
| 2| 0.0| 210.0| 516962|
+------+--------+--------+--------+
QuantileDiscretizer from 'pyspark.ml.feature' can be used.
values = [(0.1,), (0.4,), (1.2,), (1.5,)]
df = spark.createDataFrame(values, ["values"])
qds = QuantileDiscretizer(numBuckets=2,
... inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error")
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
+------+-------+
|values|buckets|
+------+-------+
| 0.1| 0.0|
| 0.4| 1.0|
| 1.2| 1.0|
| 1.5| 1.0|
+------+-------+
You can use the percent_rank from pyspark.sql.functions with a window function. For instance for computing deciles you can do:
from pyspark.sql.window import Window
from pyspark.sql.functions import ceil, percent_rank
w = Window.orderBy(data.var1)
data.select('*', ceil(10 * percent_rank().over(w)).alias("decile"))
By doing so you first compute the percent_rank, and then you multiply this by 10 and take the upper integer. Consequently, all values with a percent_rank between 0 and 0.1 will be added to decile 1, all values with a percent_rank between 0.1 and 0.2 will be added to decile 2, etc.
In the accepted answer fit is called two times. Thus change from
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
to
qds.setHandleInvalid("skip").fit(df).transform(df).show()