How to create rows and increment it in given df in pyspark - pyspark

What I want is create a new row based on the given dataframe I have and It looks like the following:
TEST_schema = StructType([StructField("date", StringType(), True),\
StructField("col1", IntegerType(), True),
StructField("col2", IntegerType(), True)\
])
TEST_data = [('2020-08-17',0,0),('2020-08-18',2,1),('2020-08-19',0,2),('2020-08-20',3,0),('2020-08-21',4,2),\
('2020-08-22',1,3),('2020-08-23',2,2),('2020-08-24',1,2),('2020-08-25',3,1)]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df = TEST_df.withColumn("date",to_date("date", 'yyyy-MM-dd'))
TEST_df.show()
+----------+----+----+
| date|col1|col2|
+----------+----+----+
|2020-08-17| 0| 0|
|2020-08-18| 2| 1|
|2020-08-19| 0| 2|
|2020-08-20| 3| 0|
|2020-08-21| 4| 2|
|2020-08-22| 1| 3|
|2020-08-23| 2| 2|
|2020-08-24| 1| 2|
|2020-08-25| 3| 1|
+----------+----+----+
Let's say I want to calculate for today's date which is current_date() and let's say i want to calculate col1 as follows: If col1 >0 return col1+col2, otherwise 0 where date == yesturday 's date which is going to be current_date() -1
calculate col2 as follow, coalesce( lag(col2),0)
so my result dataframe would be something like this:
+----------+----+----+
| date|col1|want|
+----------+----+----+
|2020-08-17| 0| 0|
|2020-08-18| 2| 0|
|2020-08-19| 0| 1|
|2020-08-20| 3| 2|
|2020-08-21| 4| 0|
|2020-08-22| 1| 2|
|2020-08-23| 2| 3|
|2020-08-24| 1| 2|
|2020-08-25| 3| 2|
|2020-08-26| 4| 1|
+----------+----+----+
This would be so easy if we use withcolumn (column based) method but I want to know how to do this with rows. My initial idea is calculate by column first and transpose it and make it rowbased.

IIUC, you can try the following:
Step-1: create a new dataframe with a single row having current_date() as date, nulls for col1 and col2 and then union it back to the TEST_df (Note: change all 2020-08-26 to current_date() in your final code):
df_new = TEST_df.union(spark.sql("select '2020-08-26', null, null"))
Edit: Practically, data are partitioned and each partition should add one row, you can do something like the following:
from pyspark.sql.functions import current_date, col, lit
#columns used for Window partitionBy
cols_part = ['pcol1', 'pcol2']
df_today = TEST_df.select([
(current_date() if c == 'date' else col(c) if c in cols_part else lit(None)).alias(c)
for c in TEST_df.columns
]).distinct()
df_new = TEST_df.union(df_today)
Step-2: do calculations to fill the above null values:
df_new.selectExpr(
"date",
"IF(date < '2020-08-26', col1, lag(IF(col1>0, col1+col2,0)) over(order by date)) as col1",
"lag(col2,1,0) over(order by date) as col2"
).show()
+----------+----+----+
| date|col1|col2|
+----------+----+----+
|2020-08-17| 0| 0|
|2020-08-18| 2| 0|
|2020-08-19| 0| 1|
|2020-08-20| 3| 2|
|2020-08-21| 4| 0|
|2020-08-22| 1| 2|
|2020-08-23| 2| 3|
|2020-08-24| 1| 2|
|2020-08-25| 3| 2|
|2020-08-26| 4| 1|
+----------+----+----+

Related

How to do a groupBy by a given column but still keep all the rows of the original DataFrame?

I want to do a groupBy and aggregate by a given column in PySpark but I still want to keep all the rows from the original DataFrame.
For example lets say we have the following DataFrame and we want to do a max on the "value" column then we would get the result below.
Original DataFrame
+--+-----+
|id|value|
+--+-----+
| 1| 1|
| 1| 2|
| 2| 3|
| 2| 4|
+--+-----+
Result
+--+-----+---+
|id|value|max|
+--+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+--+-----+---+
You can do it simply by joining aggregated dataframe with original dataframe
aggregated_df = (
df
.groupby('id')
.agg(F.max('value').alias('max'))
)
max_value_df = (
df
.join(aggregated_df, 'id')
)
Use window function
df.withColumn('max', max('value').over(Window.partitionBy('id'))).show()
+---+-----+---+
| id|value|max|
+---+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+---+-----+---+

How to compare value of one row with all the other rows in PySpark on grouped values

Problem statement
Consider the following data (see code generation at the bottom)
+-----+-----+-------+--------+
|index|group|low_num|high_num|
+-----+-----+-------+--------+
| 0| 1| 1| 1|
| 1| 1| 2| 2|
| 2| 1| 3| 3|
| 3| 2| 1| 3|
+-----+-----+-------+--------+
Then for a given index, I want to count how many times that one indexes high_num is greater than low_num for all low_num in the group.
For instance, consider the second row with index: 1. Index: 1 is in group: 1 and the high_num is 2. high_num on index 1 is greater than the high_num on index 0, equal to low_num, and smaller than the one on index 2. So the high_num of index: 1 is greater than low_num across the group once, so then I want the value in the answer column to say 1.
Dataset with desired output
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+
Dataset generation code
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.getOrCreate()
)
## Example df
## Note the inclusion of "desired" which is the desired output.
df = spark.createDataFrame(
[
(0, 1, 1, 1, 0),
(1, 1, 2, 2, 1),
(2, 1, 3, 3, 2),
(3, 2, 1, 3, 1)
],
schema=["index", "group", "low_num", "high_num", "desired"]
)
Pseudocode that might have solved the problem
A pseusocode might look like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w_spec = Window.partitionBy("group").rowsBetween(
Window.unboundedPreceding, Window.unboundedFollowing)
## F.collect_list_when does not exist
## F.current_col does not exist
## Probably wouldn't work like this anyways
ddf = df.withColumn("Counts",
F.size(F.collect_list_when(
F.current_col("high_number") > F.col("low_number"), 1
).otherwise(None).over(w_spec))
)
You can do a filter on the collect_list, and check its size:
import pyspark.sql.functions as F
df2 = df.withColumn(
'desired',
F.expr('size(filter(collect_list(low_num) over (partition by group), x -> x < high_num))')
)
df2.show()
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+

how to apply functions in pyspark?

I have a function that returns specific date, that looks like this:
def specific_date(date_input):
specificdate= """select *
from vw
where date = {date_1}
""".format(date_1 = date_input)
day_result = sqlContext.sql(specificdate)
return day_result
and i have a dataframe that looks like this:
df1_schema = StructType([StructField("Date", StringType(), True),\
StructField("col1", IntegerType(), True),\
StructField("id", StringType(), True),\
StructField("col2", IntegerType(), True),\
StructField("col3", IntegerType(), True),\
StructField("col4", IntegerType(), True),\
StructField("coln", IntegerType(), True)])
df_data = [('2020-08-01',0,'M1',3,3,2,2),('2020-08-02',0,'M1',2,3,0,1),\
('2020-08-03',0,'M1',3,3,2,3),('2020-08-04',0,'M1',3,3,2,1),\
('2020-08-01',0,'M2',1,3,3,1),('2020-08-02',0,'M2',-1,3,1,2)]
rdd = sc.parallelize(df_data)
df1 = sqlContext.createDataFrame(df_data, df1_schema)
df1 = df1.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df1.show()
+----------+----+---+----+----+----+----+
| Date|col1| id|col2|col3|col4|coln|
+----------+----+---+----+----+----+----+
|2020-08-01| 0| M1| 3| 3| 2| 2|
|2020-08-02| 0| M1| 2| 3| 0| 1|
|2020-08-03| 0| M1| 3| 3| 2| 3|
|2020-08-04| 0| M1| 3| 3| 2| 1|
|2020-08-01| 0| M2| 1| 3| 3| 1|
|2020-08-02| 0| M2| -1| 3| 1| 2|
+----------+----+---+----+----+----+----+
df1.createOrReplaceTempView("vw")
Then if I call a function specific_date(F.date_add('2020-08-01' , 1))
this would give me the dataframe where dates are '2020-08-02'
+----------+----+---+----+----+----+----+
| Date|col1| id|col2|col3|col4|coln|
+----------+----+---+----+----+----+----+
|2020-08-02| 0| M1| 2| 3| 0| 1|
|2020-08-02| 0| M2| -1| 3| 1| 2|
+----------+----+---+----+----+----+----+
I tried many methods to do this, but didn't seem to work, any help would be appreciated..
If you really want to use a function to add days to given datetime and also use the SQL query:
def specific_date(date_input, days_to_add):
start_date = datetime.datetime.strptime(date_input, "%Y-%m-%d")
end_date = start_date + datetime.timedelta(days = days_to_add)
specificdate= "SELECT * FROM vw WHERE Date = date_format('{date_1}', 'yyyy-MM-dd')".format(date_1 = end_date)
day_result = sqlContext.sql(specificdate)
return day_result
And just use it as, where you provide the date_input and days_to_add
specific_date('2020-08-01', 1)
which will give you dataframe
+----------+----+---+----+----+----+----+
| Date|col1| id|col2|col3|col4|coln|
+----------+----+---+----+----+----+----+
|2020-08-02| 0| M1| 2| 3| 0| 1|
|2020-08-02| 0| M2| -1| 3| 1| 2|
+----------+----+---+----+----+----+----+
But far better would be to just use
day_result = df1.filter(df1.Date == '2020-08-02')
If you do not need a function that uses a tempview you could easily do something like that by:
import datetime
d = datetime.datetime.strptime("2020-08-01", "%Y-%m-%d")
d += datetime.timedelta(days=+1)
df1.where(col('Date') == d).show()
+----------+----+---+----+----+----+----+
| Date|col1| id|col2|col3|col4|coln|
+----------+----+---+----+----+----+----+
|2020-08-02| 0| M1| 2| 3| 0| 1|
|2020-08-02| 0| M2| -1| 3| 1| 2|
+----------+----+---+----+----+----+----+
One issue with the code you provided is that the spark function F.date_add returns a column object. This cannot be directly used in the where statement.

Spark Dataframe: Group and rank rows on a certain column value

I am trying to rank a column when the "ID" column numbering starts from 1 to max and then resets from 1.
So, the first three rows have a continuous numbering on "ID"; hence these should be grouped with group rank =1. Rows four and five are in another group, group rank = 2.
The rows are sorted by "rownum" column. I am aware of the row_number window function but I don't think I can apply for this use case as there is no constant window. I can only think of looping through each row in the dataframe but not sure how I can update a column when number resets to 1.
val df = Seq(
(1, 1 ),
(2, 2 ),
(3, 3 ),
(4, 1),
(5, 2),
(6, 1),
(7, 1),
(8, 2)
).toDF("rownum", "ID")
df.show()
Expected result is below:
You can do it with 2 window-functions, the first one to flag the state, the second one to calculate a running sum:
df
.withColumn("increase", $"ID" > lag($"ID",1).over(Window.orderBy($"rownum")))
.withColumn("group_rank_of_ID",sum(when($"increase",lit(0)).otherwise(lit(1))).over(Window.orderBy($"rownum")))
.drop($"increase")
.show()
gives:
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
As #Prithvi noted, we can use lead here.
The tricky part is in order to use window function such as lead, we need to at least provide the order.
Consider
val nextID = lag('ID, 1, -1) over Window.orderBy('rownum)
val isNewGroup = 'ID <= nextID cast "integer"
val group_rank_of_ID = sum(isNewGroup) over Window.orderBy('rownum)
/* you can try
df.withColumn("intermediate", nextID).show
// ^^^^^^^-- can be `isNewGroup`, or other vals
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 0|
| 2| 2| 0|
| 3| 3| 0|
| 4| 1| 1|
| 5| 2| 1|
| 6| 1| 2|
| 7| 1| 3|
| 8| 2| 3|
+------+---+----------------+
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID + 1).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
*/

How to select all columns in spark sql query in aggregation function

Hi I am new to spark sql.
I have a query like this.
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This prints only 3 columns.
tagShortID,Timestamp,maxAvgValue
But I want to display all the column along with this column.Any help or suggestion would be appreciated.
One alternative, usually good for your specific case is to use Window Functions, because it avoids the need to join with the original data:
import org.apache.spark.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("tagShortID", "Timestamp")
val result = averageDF.withColumn("maxAvgValue", max($"RSSI_Weight_avg").over(windowSpec))
You can find here a good article explaining the Window Functions functionality in Spark.
Please note that it requires either Spark 2+ or a HiveContext in Spark versions 1.4 ~ 1.6.
Here is the simple example with the column name you have
This is your averageDF dataframe with dummy data
+----------+---------+---------------+---------+--------+---------------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|
+----------+---------+---------------+---------+--------+---------------+
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+
After you have a groupby and aggravation
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This did not return all the columns you selected because after groupby and aggregation the only the used and result column are returned, As below
+----------+---------+-----------+
|tagShortID|Timestamp|maxAvgValue|
+----------+---------+-----------+
| 2| 2| 2|
| 1| 1| 1|
+----------+---------+-----------+
To get all the columns you need to join this two dataframes
averageDF.join(highvalueresult, Seq("tagShortID", "Timestamp"))
and the final result will be
+----------+---------+---------------+---------+--------+---------------+-----------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|maxAvgValue|
+----------+---------+---------------+---------+--------+---------------+-----------+
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+-----------+
I hope this clears your confusion.