(pyspark)How to Divide Time Intervals into Time Periods - pyspark

I have a dataframe created by sparksql with IDs corresponding to checkin_datetime and checkout_datetime.As the picture shows.
I would like to divide this time interval into one-hour time periods. As the picture shows.
Code to create sparkdataframe:
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df)

To compute hourly interval,
First explode hourly intervals between checkin_datetime and checkout_datetime. We do this by computing the hours between the checkin_datetime and checkout_datetime and iterating over the range to generate the intervals.
Once we have exploded the intervals to find the next_hour, we can use this to identify the gap between checkin_datetime and next_hour or checkout_datetime and next_hour.
from pyspark.sql import functions as F
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df).withColumn("checkin_datetime", F.to_timestamp("checkin_datetime", "MM-dd-yyyy HH:mm")).withColumn("checkout_datetime", F.to_timestamp("checkout_datetime", "MM-dd-yyyy HH:mm"))
unix_checkin = F.unix_timestamp("checkin_datetime")
unix_checkout = F.unix_timestamp("checkout_datetime")
start_hour_checkin = F.date_trunc("hour", "checkin_datetime")
unix_start_hour_checkin = F.unix_timestamp(start_hour_checkin)
checkout_next_hour = F.date_trunc("hour", "checkout_datetime") + F.expr("INTERVAL 1 HOUR")
diff_hours = F.floor((unix_checkout - unix_start_hour_checkin) / 3600)
next_hour = F.explode(F.transform(F.sequence(F.lit(0), diff_hours), lambda x: F.to_timestamp(F.unix_timestamp(start_hour_checkin) + (x + 1) * 3600)))
minute = (F.when(start_hour_checkin == F.date_trunc("hour", "checkout_datetime"), (unix_checkout - unix_checkin) / 60)
.when(checkout_next_hour == F.col("next_hour"), (unix_checkout - F.unix_timestamp(F.date_trunc("hour", "checkout_datetime"))) / 60)
.otherwise(F.least((F.unix_timestamp(F.col("next_hour")) - unix_checkin) / 60, F.lit(60)))
).cast("int")
(df1.withColumn("next_hour", next_hour)
.withColumn("minutes", minute)
.withColumn("hr", F.date_format(F.expr("next_hour - INTERVAL 1 HOUR"), "H"))
.withColumn("day", F.to_date(F.expr("next_hour - INTERVAL 1 HOUR")))
.select("ID", "checkin_datetime", "checkout_datetime", "day", "hr", "minutes")
).show()
"""
+---+-------------------+-------------------+----------+---+-------+
| ID| checkin_datetime| checkout_datetime| day| hr|minutes|
+---+-------------------+-------------------+----------+---+-------+
| 4|2019-04-01 13:07:00|2019-04-01 13:09:00|2019-04-01| 13| 2|
| 4|2019-04-01 13:09:00|2019-04-01 13:12:00|2019-04-01| 13| 3|
| 4|2019-04-01 14:06:00|2019-04-01 14:07:00|2019-04-01| 14| 1|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 14| 5|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 15| 6|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 20| 37|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 21| 32|
| 22|2019-04-01 21:38:00|2019-04-01 21:42:00|2019-04-01| 21| 4|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-01| 23| 38|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-02| 0| 23|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 1| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 2| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 3| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 4| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 5| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 6| 15|
+---+-------------------+-------------------+----------+---+-------+
"""

Related

Collect statistics from the DataFrame

I'm collecting dataframe statistics.
The maximum minimum average value of the column. The number of zeros in the column. The number of empty values in the column.
Conditions:
Number of columns n < 2000
Number of dataframe entries r < 10^9
The stack() function is used for the solution
https://www.hadoopinrealworld.com/understanding-stack-function-in-spark/#:~:text=stack%20function%20in%20Spark%20takes,an%20argument%20followed%20by%20expressions.&text=stack%20function%20will%20generate%20n%20rows%20by%20evaluating%20the%20expressions.
What scares:
The number of rows in the intermediate dataframe rusultDF. cal("period_date").dropDuplicates * columnsNames.size * r = many
Input:
val columnsNames = List("col_name1", "col_name2")
+---------+---------+-----------+
|col_name1|col_name2|period_date|
+---------+---------+-----------+
| 11| 21| 2022-01-31|
| 12| 22| 2022-01-31|
| 13| 23| 2022-03-31|
+---------+---------+-----------+
Output:
+-----------+---------+----------+----------+---------+---------+---------+
|period_date| columns|count_null|count_zero|avg_value|mix_value|man_value|
+-----------+---------+----------+----------+---------+---------+---------+
| 2022-01-31|col_name2| 0| 0| 21.5| 21| 22|
| 2022-03-31|col_name1| 0| 0| 13.0| 13| 13|
| 2022-03-31|col_name2| 0| 0| 23.0| 23| 23|
| 2022-01-31|col_name1| 0| 0| 11.5| 11| 12|
+-----------+---------+----------+----------+---------+---------+---------+
My solution:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local").appName("spark test5").getOrCreate()
import spark.implicits._
case class RealStructure(col_name1: Int, col_name2: Int, period_date: String)
val userTableDf = List(
RealStructure(11, 21, "2022-01-31"),
RealStructure(12, 22, "2022-01-31"),
RealStructure(13, 23, "2022-03-31")
) toDF()
//userTableDf.show()
//Start
new StatisticCollector(userTableDf)
class StatisticCollector(userTableDf: DataFrame) {
val columnsNames = List("col_name1", "col_name2")
val stack = s"stack(${columnsNames.length}, ${columnsNames.map(name => s"'$name', $name").mkString(",")})"
val resultDF = userTableDf.select(col("period_date"),
expr(s"$stack as (columns, values)")
)
//resultDF.show()
println(stack)
/**
+-----------+---------+------+
|period_date| columns|values|
+-----------+---------+------+
| 2022-01-31|col_name1| 11|
| 2022-01-31|col_name2| 21|
| 2022-01-31|col_name1| 12|
| 2022-01-31|col_name2| 22|
| 2022-03-31|col_name1| 13|
| 2022-03-31|col_name2| 23|
+-----------+---------+------+
stack(2, 'col_name1', col_name1,'col_name2', col_name2)
**/
val superResultDF = resultDF.groupBy(col("period_date"), col("columns")).agg(
sum(when(col("values").isNull, 1).otherwise(0)).alias("count_null"),
sum(when(col("values") === 0, 1).otherwise(0)).alias("count_zero"),
avg("values").cast("double").alias("avg_value"),
min(col("values")).alias("mix_value"),
max(col("values")).alias("man_value")
)
superResultDF.show()
}
Please give your assessment, if you see what can be solved more efficiently, then write how you would solve it.
The calculation speed is important.
It is necessary as quickly as it is provided by God.

six months record of every account number in pyspark

I have tried rank function but it gives results in numbers which goes beyond 180 days.
This is the result I am getting but I do not want this result this is wrong it is giving transaction beyond 180 days
window = Window.partitionBy(df3['acctno']).orderBy(df3['trans_date'])
df3.select('*', rank().over(window).alias('rank')) \
.filter(col('rank') <= 180) \
.show(500)
{ |year|month|day|date| txnrefid| acc|branch|channel|rank|
+----+-----+---+-----------+-----------+--------------+----------+-----------+----+
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666| CBS| 4|
But I want like this
{|year|month|day|date| txnrefid| acc|branch|channel|rank|
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666|
}
As you edited your question, here is a new answer that use a different approach.
The idea is to get for each account number the min date, compute the limite date (min date + 180 days) then remove all the lines that are older.
df.count() # I used your sample data, so 60 lines
> 60
w = Window.partitionBy(df["acctno"])
df = df.withColumn("min_date", F.min(F.col("trans_date").cast("date")).over(w))
df = df.where(
F.col("trans_date").cast("date")
<= F.date_add( # Use F.date_add to add days or F.add_months to add month.
F.col("min_date"), 180
)
).drop("min_date")
df.count() # Final dataframe limited to 180 days, nothing older than 2020-08-04
> 54
if you want the first 6 month, then you should use fields "year" and "month", not "trans_date".
something like window = Window.partitionBy(df3['acctno']).orderBy(df3['year'], df3['month']) should give you better results.
Then you filter on rank <= 6
df3.select("*", dense_rank().over(window).alias("rank")).filter(col("rank") <= 6).show(500)
EDIT: You need to use a dense_rank

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Pyspark groupBy Pivot Transformation

I'm having a hard time framing the following Pyspark dataframe manipulation.
Essentially I am trying to group by category and then pivot/unmelt the subcategories and add new columns.
I've tried a number of ways, but they are very slow and and are not leveraging Spark's parallelism.
Here is my existing (slow, verbose) code:
from pyspark.sql.functions import lit
df = sqlContext.table('Table')
#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
#loop over subcategory
listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
num = 1
for b in dfArraySub:
#renames all columns to append a number
for c in b.columns:
if c not in ['category','sub_category','date']:
column_name = str(c)+'_'+str(num)
b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
b = b.drop('sub_category')
num += 1
#if no df exists, create one and continually join new columns
try:
all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
except:
all_subs = b
#Fixes missing columns on union
try:
try:
diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
for d in diff_columns:
all_subs = all_subs.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except:
diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
for d in diff_columns:
all_cats = all_cats.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except Exception as e:
print e
all_cats = all_subs
But this is very slow. Any guidance would be greatly appreciated!
Your output is not really logical, but we can achieve this result using the pivot function. You need to precise your rules otherwise I can see a lot of cases it may fails.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df.show()
+----------+---------+------------+------------+------------+
| date| category|sub_category|metric_sales|metric_trans|
+----------+---------+------------+------------+------------+
|2018-01-01|furniture| bed| 100| 75|
|2018-01-01|furniture| chair| 110| 85|
|2018-01-01|furniture| shelf| 35| 30|
|2018-02-01|furniture| bed| 55| 50|
|2018-02-01|furniture| chair| 45| 40|
|2018-02-01|furniture| shelf| 10| 15|
|2018-01-01| rug| circle| 2| 5|
|2018-01-01| rug| square| 3| 6|
|2018-02-01| rug| circle| 3| 3|
|2018-02-01| rug| square| 4| 5|
+----------+---------+------------+------------+------------+
df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|2018-02-01| rug| 3| 3| 4| 5| null| null|
|2018-02-01|furniture| 55| 50| 45| 40| 10| 15|
|2018-01-01|furniture| 100| 75| 110| 85| 35| 30|
|2018-01-01| rug| 2| 5| 3| 6| null| null|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+

drop all columns with a special condition on a column spark

I have a dataset and I need to drop columns which has a standard deviation equal to 0. I've tried:
val df = spark.read.option("header",true)
.option("inferSchema", "false").csv("C:/gg.csv")
val finalresult = df
.agg(df.columns.map(stddev(_)).head, df.columns.map(stddev(_)).tail: _*)
I want to compute the standard deviation of each column and drop the column if it it is is equal to zero
RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe,
0,87,72,160,5,0.6993,2.9421,2.3745,3,4,
1,54,70,163,5,0.6301,2.7273,2.2205,3,4,
2,72,51,164,5,0.6551,2.9834,2.3993,3,4,
3,75,74,170,5,0.6966,2.9654,2.3699,3,4,
4,108,62,165,5,0.6087,2.7093,2.1619,3,4,
5,84,61,159,5,0.6876,2.938,2.3601,3,4,
6,89,64,168,5,0.6757,2.9547,2.3676,3,4,
7,75,72,160,5,0.7432,2.9331,2.3339,3,4,
8,64,62,153,5,0.6505,2.7676,2.2255,3,4,
9,82,58,159,5,0.6748,2.992,2.4043,3,4,
10,67,49,160,5,0.6633,2.9367,2.333,3,4,
11,85,53,160,5,0.6821,2.981,2.3822,3,4,
You can try this, use getValueMap and filter to get the column names which you want to drop, and then drop them:
//Extract the standard deviation from the data frame summary:
val stddev = df.describe().filter($"summary" === "stddev").drop("summary").first()
// Use `getValuesMap` and `filter` to get the columns names where stddev is equal to 0:
val to_drop = stddev.getValuesMap[String](df.columns).filter{ case (k, v) => v.toDouble == 0 }.keys
//Drop 0 stddev columns
df.drop(to_drop.toSeq: _*).show
+---------+-----+---+------+------+---------+--------+
|RowNumber|Poids|Age|Taille| Hmean|CoocParam|LdpParam|
+---------+-----+---+------+------+---------+--------+
| 0| 87| 72| 160|0.6993| 2.9421| 2.3745|
| 1| 54| 70| 163|0.6301| 2.7273| 2.2205|
| 2| 72| 51| 164|0.6551| 2.9834| 2.3993|
| 3| 75| 74| 170|0.6966| 2.9654| 2.3699|
| 4| 108| 62| 165|0.6087| 2.7093| 2.1619|
| 5| 84| 61| 159|0.6876| 2.938| 2.3601|
| 6| 89| 64| 168|0.6757| 2.9547| 2.3676|
| 7| 75| 72| 160|0.7432| 2.9331| 2.3339|
| 8| 64| 62| 153|0.6505| 2.7676| 2.2255|
| 9| 82| 58| 159|0.6748| 2.992| 2.4043|
| 10| 67| 49| 160|0.6633| 2.9367| 2.333|
| 11| 85| 53| 160|0.6821| 2.981| 2.3822|
+---------+-----+---+------+------+---------+--------+
OK, I have written a solution that is independent of your dataset. Required imports and example data:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, stddev, col}
val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
df.show(5)
// +---+---+
// | X1| X2|
// +---+---+
// | 1| 0|
// | 2| 0|
// | 3| 0|
// | 4| 0|
// | 5| 0|
First compute standard deviation by column:
// no need to rename but I did it to become more human
// readable when you show df2
val aggs = df.columns.map(c => stddev(c).as(c))
val stddevs = df.select(aggs: _*)
stddevs.show // df2 contains the stddev of each columns
// +-----------------+---+
// | X1| X2|
// +-----------------+---+
// |288.5307609250702|0.0|
// +-----------------+---+
Collect the first row and filter columns to keep:
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
Select and check the results:
df.select(columnsToKeep: _*).show
// +---+
// | X1|
// +---+
// | 1|
// | 2|
// | 3|
// | 4|
// | 5|