Count The Number of Duplicate Values during the preceding timeperiod - pyspark

I am looking to count how many duplicate values there are in a previous time period (24 hours to make it simple).
For example in this data set
I am looking to come up with values such as this
Here is some how to create the below dataframe
data = [("5812", "2020-12-27T17:28:32.000+0000"),("5812", "2020-12-25T17:35:32.000+0000"), ("5812", "2020-12-25T13:04:05.000+0000"), ("7999", "2020-12-25T09:23:01.000+0000"),("5999","2020-12-25T07:29:52.000+0000"), ("5814", "2020-12-25T12:23:05.000+0000"), ("5814", "2020-12-25T11:52:57.000+0000"), ("5814", "2020-12-24T11:00:57.000+0000") ,("5999", "2020-12-24T07:29:52.000+0000")]
df = spark.createDataFrame(data).toDF(*columns)
I have been using windowed functions to get things such as distinct counts, sums etc, but I can't quite figure out how to get a count of duplicate values over the same time period.

Let's try this step by step
Cast column timestamp to TimestampType format.
Create a column of collect_list of mcc (say mcc_list) in the last 24 hours using window with range between interval 24 hours and current row frame.
Create a column of set/unique collection of mc_list (say mcc_set) using array_distinct function. This column could also be created using collect_set over the same window in step 2.
For each value of mcc_set, get its count in the mcc_list. Duplicated mcc value will have a count of > 1 so we can filter it. After that, the array will only contain the duplicated mcc, use size to count how many mcc are duplicated in the last 24 hours.
These steps put into a code could be like this
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = (df
.withColumn('ts', F.col('timestamp').cast(TimestampType()))
.withColumn('mcc_list', F.expr("collect_list(mcc) over (order by ts range between interval 24 hours preceding and current row)"))
.withColumn('mcc_set', F.array_distinct('mcc_list'))
.withColumn('dups', F.expr("size(filter(transform(mcc_set, a -> size(filter(mcc_list, b -> b = a))), c -> c > 1))"))
# .drop(*['ts', 'mcc_list', 'mcc_set']))
)
df.show(truncate=False)
# +----+----------------------------+-------------------+------------------------------------+------------------------+----+
# |mcc |timestamp |ts |mcc_list |mcc_set |dups|
# +----+----------------------------+-------------------+------------------------------------+------------------------+----+
# |5812|2020-12-27T17:28:32.000+0000|2020-12-27 17:28:32|[5812] |[5812] |0 |
# |5812|2020-12-25T17:35:32.000+0000|2020-12-25 17:35:32|[5999, 7999, 5814, 5814, 5812, 5812]|[5999, 7999, 5814, 5812]|2 |
# |5812|2020-12-25T13:04:05.000+0000|2020-12-25 13:04:05|[5999, 7999, 5814, 5814, 5812] |[5999, 7999, 5814, 5812]|1 |
# |5814|2020-12-25T12:23:05.000+0000|2020-12-25 12:23:05|[5999, 7999, 5814, 5814] |[5999, 7999, 5814] |1 |
# |5814|2020-12-25T11:52:57.000+0000|2020-12-25 11:52:57|[5999, 7999, 5814] |[5999, 7999, 5814] |0 |
# |7999|2020-12-25T09:23:01.000+0000|2020-12-25 09:23:01|[5814, 5999, 7999] |[5814, 5999, 7999] |0 |
# |5999|2020-12-25T07:29:52.000+0000|2020-12-25 07:29:52|[5999, 5814, 5999] |[5999, 5814] |1 |
# |5814|2020-12-24T11:00:57.000+0000|2020-12-24 11:00:57|[5999, 5814] |[5999, 5814] |0 |
# |5999|2020-12-24T07:29:52.000+0000|2020-12-24 07:29:52|[5999] |[5999] |0 |
# +----+----------------------------+-------------------+------------------------------------+------------------------+----+
You can drop unwanted columns afterwards.

Related

Calculating difference between two dates in PySpark

Currently I'm working with a dataframe and need to calculate the number of days (as integer) between two dates formatted as timestamp
I've opted for this solution:
from pyspark.sql.functions import lit, when, col, datediff
df1 = df1.withColumn("LD", datediff("MD", "TD"))
But after calculating sum from a list I get an error: "Column in not iterable" which makes me impossible to calculate sum of the rows based on column names
col_list = ["a", "b", "c"]
df2 = df1.withColumn("My_Sum", sum([F.col(c) for c in col_list]))
How can I deal with it in order to calculate the difference between dates and then calculate the sum of the rows given the names of certain columns?
The datediff has nothing to do with the sum of a column. The pyspark sql sum function takes in 1 column and it calculates the sum of the rows in that column.
Here are a couple of ways to get the sum of a column from a list of columns using list comprehension.
Single row output with the sum of the column
data_sdf. \
select(*[func.sum(c).alias(c+'_sum') for c in col_list]). \
show()
# +-----+-----+-----+
# |a_sum|b_sum|c_sum|
# +-----+-----+-----+
# | 1337| 3778| 6270|
# +-----+-----+-----+
the sum of all rows of the column in each row
from pyspark.sql.window import Window as wd
data_sdf. \
select('*',
*[func.sum(c).over(wd.partitionBy()).alias(c+'_sum') for c in col_list]
). \
show(5)
# +---+---+---+-----+-----+-----+
# | a| b| c|a_sum|b_sum|c_sum|
# +---+---+---+-----+-----+-----+
# | 45| 58|125| 1337| 3778| 6270|
# | 9| 99|143| 1337| 3778| 6270|
# | 33| 91|146| 1337| 3778| 6270|
# | 21| 85|118| 1337| 3778| 6270|
# | 30| 55|101| 1337| 3778| 6270|
# +---+---+---+-----+-----+-----+
# only showing top 5 rows

How to plot a graph with data gaps in Zeppelin?

Dataframe was extracted to a temp table to plot the data density per time unit (1 day):
val dailySummariesDf =
getDFFromJdbcSource(SparkSession.builder().appName("test").master("local").getOrCreate(), s"SELECT * FROM values WHERE time > '2020-06-06' and devicename='Voltage' limit 100000000")
.persist(StorageLevel.MEMORY_ONLY_SER)
.groupBy($"digital_twin_id", window($"time", "1 day")).count().as("count")
.withColumn("windowstart", col("window.start"))
.withColumn("windowstartlong", unix_timestamp(col("window.start")))
.orderBy("windowstart")
dailySummariesDf.
registerTempTable("bank")
Then I plot it with %sql processor
%sql
select windowstart, count
from bank
and
%sql
select windowstartlong, count
from bank
What I get is shown below:
So, my expectation is to have gaps in this graph, as there were days with no data at all. But instead I see it being plotted densely, with October days plotted right after August, not showing a gap for September.
How can I force those graphs to display gaps and regard the real X axis values?
Indeed, grouping a dataset by window column won't produce any rows for the intervals that did not contain any original rows within those intervals.
One way to deal with that I can think of, is to add a bunch of fake rows ("manually fill in the gaps" in raw dataset), and only then apply a groupBy/window. For your case, that can be done by creating a trivial one-column dataset containing all the dates within a range you're interested in, and then joining it to your original dataset.
Here is my quick attempt:
import spark.implicits._
import org.apache.spark.sql.types._
// Define sample data
val df = Seq(("a","2021-12-01"),
("b","2021-12-01"),
("c","2021-12-01"),
("a","2021-12-02"),
("b","2021-12-17")
).toDF("c","d").withColumn("d",to_timestamp($"d"))
// Define a dummy dataframe for the range 12/01/2021 - 12/30/2021
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val start = DateTime.parse("2021-12-01",DateTimeFormat.forPattern("yyyy-MM-dd")).getMillis/1000
val end = start + 30*24*60*60
val temp = spark.range(start,end,24*60*60).toDF().withColumn("tc",to_timestamp($"id".cast(TimestampType))).drop($"id")
// Fill the gaps in original dataframe
val nogaps = temp.join(df, temp.col("tc") === df.col("d"), "left")
// Aggregate counts by a tumbling 1-day window
val result = nogaps.groupBy(window($"tc","1 day","1 day","5 hours")).agg(sum(when($"c".isNotNull,1).otherwise(0)).as("count"))
result.withColumn("windowstart",to_date(col("window.start"))).select("windowstart","count").orderBy("windowstart").show(false)
+-----------+-----+
|windowstart|count|
+-----------+-----+
|2021-12-01 |3 |
|2021-12-02 |1 |
|2021-12-03 |0 |
|2021-12-04 |0 |
|2021-12-05 |0 |
|2021-12-06 |0 |
|2021-12-07 |0 |
|2021-12-08 |0 |
|2021-12-09 |0 |
|2021-12-10 |0 |
|2021-12-11 |0 |
|2021-12-12 |0 |
|2021-12-13 |0 |
|2021-12-14 |0 |
|2021-12-15 |0 |
|2021-12-16 |0 |
|2021-12-17 |1 |
|2021-12-18 |0 |
|2021-12-19 |0 |
|2021-12-20 |0 |
+-----------+-----+
For illustration purposes only :)

Why aggregation function pyspark.sql.functions.collect_list() adds local timezone offset on display?

I run the following code in a pyspark shell session. Running collect_list() after a groupBy, changes how timestamps are displayed (a UTC+02:00 offset is added, probably because this is the local offset at Greece where the code is run). Although the display is problematic, the timestamp under the hood remains unchanged. This can be observed either by adding a column with the actual unix timestamps or by reverting the dataframe to its initial shape through using pyspark.sql.functions.explode(). Is this a bug?
import datetime
import os
from pyspark.sql import functions, types, udf
# configure utc timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")
os.environ['TZ']
time.tzset()
# create DataFrame
date_time = datetime.datetime(year = 2019, month=1, day=1, hour=12)
data = [(1, date_time), (1, date_time)]
schema = types.StructType([types.StructField("id", types.IntegerType(), False), types.StructField("time", types.TimestampType(), False)])
df_test = spark.createDataFrame(data, schema)
df_test.show()
+---+-------------------+
| id| time|
+---+-------------------+
| 1|2019-01-01 12:00:00|
| 1|2019-01-01 12:00:00|
+---+-------------------+
# GroupBy and collect_list
df_test1 = df_test.groupBy("id").agg(functions.collect_list("time"))
df_test1.show(1, False)
+---+----------------------------------------------+
|id |collect_list(time) |
+---+----------------------------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|
+---+----------------------------------------------+
# add column with unix timestamps
to_timestamp = functions.udf(lambda x : [value.timestamp() for value in x], types.ArrayType(types.FloatType()))
df_test1.withColumn("unix_timestamp",to_timestamp(functions.col("collect_list(time)")))
df_test1.show(1, False)
+---+----------------------------------------------+----------------------------+
|id |collect_list(time) |unix_timestamp |
+---+----------------------------------------------+----------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|[1.54634394E9, 1.54634394E9]|
+---+----------------------------------------------+----------------------------+
# explode list to distinct rows
df_test1.groupBy("id").agg(functions.collect_list("time")).withColumn("test", functions.explode(functions.col("collect_list(time)"))).show(2, False)
+---+----------------------------------------------+-------------------+
|id |collect_list(time) |test |
+---+----------------------------------------------+-------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
+---+----------------------------------------------+-------------------+
ps. 1.54634394E9 == 2019-01-01 12:00:00, which is the correct UTC timestamp
For me the code above works, but does not convert the time as in your case.
Maybe check what is your session time zone (and, optionally, set it to some tz):
spark.conf.get('spark.sql.session.timeZone')
In general TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

Spark Dataframes: Add Conditional column to dataframe

I want to add a conditional column Flag to dataframe A. When the following two conditions are satisfied, add 1 to Flag, otherwise 0:
num from dataframe A is in between numStart and numEnd from dataframe B.
If the above condition satifies, check if include is 1.
DataFrame A (it's a very big dataframe, containing millions of rows):
+----+------+-----+------------------------+
|num |food |price|timestamp |
+----+------+-----+------------------------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|
+----+------+-----+------------------------+
DataFrame B (it's a very small DF, containing only 100 rows):
+----------+-----------+-------+
|numStart |numEnd |include|
+----------+-----------+-------+
|0 |200 |1 |
|250 |1050 |0 |
|2000 |3000 |1 |
|10001 |15001 |1 |
+----------+-----------+-------+
Expected output:
+----+------+-----+------------------------+----------+
|num |food |price|timestamp |Flag |
+----+------+-----+------------------------+----------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|0 |
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|1 |
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|1 |
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|0 |
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|0 |
+----+------+-----+------------------------+----------+
You can left-join dfB to dfA based on the condition you described in (i), then build a Flag column using withColumn and the coalesce function to "default" to 0:
Records for which a match was found would use the include value of the matching dfB record
Records for which there was no match would have include=null, and per your requirement such records should get Flag=0, so we use coalesce which in case of null returns the default value with a literal lit(0)
Lastly, get rid of the dfB columns which are of no interest to you:
import org.apache.spark.sql.functions._
import spark.implicits._ // assuming "spark" is your SparkSession
dfA.join(dfB, $"num".between($"numStart", $"numEnd"), "left")
.withColumn("Flag", coalesce($"include", lit(0)))
.drop(dfB.columns: _*)
.show()
// +----+------+-----+--------------------+----+
// | num| food|price| timestamp|Flag|
// +----+------+-----+--------------------+----+
// |1275|tomato| 1.99|2018-07-21T00:00:...| 0|
// | 145|carrot| 0.45|2018-07-21T00:00:...| 1|
// |2678| apple| 0.99|2018-07-21T01:00:...| 1|
// |6578|banana| 1.29|2018-07-20T01:11:...| 0|
// |1001| taco| 2.59|2018-07-21T01:00:...| 0|
// +----+------+-----+--------------------+----+
Join the two dataframes together on the first condition while keeping all rows in dataframe A (i.e. with a left join, see code below). After the join, the include column can be renamed Flag and any NaN values inside it are set to 0. The two extra columns, numStart and numEnd are dropped.
The code can thus be written as follows:
A.join(B, $"num" >= $"numStart" && $"num" <= $"numEnd", "left")
.withColumnRenamed("include", "Flag")
.drop("numStart", "numEnd")
.na.fill(Map("Flag" -> 0))

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense