I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out, I need to use a Window Function like:
Window \
.partitionBy('id') \
.orderBy('start')
I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
but would like to achieve something like:
.rangeBetween("7 days", 0)
Spark >= 2.3
Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress.
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, mean(some_value) OVER (
PARTITION BY id
ORDER BY CAST(start AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean FROM df""").show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Spark < 2.3
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
Finally query:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Far from pretty but works.
* Hive Language Manual, Types
Spark 3.3 is released, but...
The answer may be as old as Spark 1.5.0:
datediff.
datediff(col_name, '1000') will return an integer difference of days from 1000-01-01 to col_name.
As the first argument, it accepts dates, timestamps and even strings.
As the second, it even accepts 1000.
The answer
Date difference in days - depending on the data type of the order column:
date
Spark 3.1+
.orderBy(F.expr("unix_date(col_name)")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
timestamp
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
long - UNIX time in microseconds (e.g. 1672534861000000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000000).rangeBetween(-7, 0)
long - UNIX time in milliseconds (e.g. 1672534861000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000).rangeBetween(-7, 0)
long - UNIX time in seconds (e.g. 1672534861)
Spark 2.1+
.orderBy(F.col("col_name") / 86400).rangeBetween(-7, 0)
long in format yyyyMMdd
Spark 3.3+
.orderBy(F.expr("unix_date(to_date(col_name, 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(cast(col_name as string), 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(cast(col_name as string), 'yyyyMMdd'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp(F.col("col_name").cast('string'), 'yyyyMMdd') / 86400).rangeBetween(-7, 0)
string in date format of 'yyyy-MM-dd'
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name))")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other date format (e.g. 'MM-dd-yyyy')
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name, 'MM-dd-yyyy'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp("col_name", 'MM-dd-yyyy') / 86400).rangeBetween(-7, 0)
string in timestamp format of 'yyyy-MM-dd HH:mm:ss'
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other timestamp format (e.g. 'MM-dd-yyyy HH:mm:ss')
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy HH:mm:ss'), '1000')")).rangeBetween(-7, 0)
Fantastic solution #zero323, if you want to operate with minutes instead of days as I have to, and you don't need to partition with id, so you only have to modify a simply part of the code as I show:
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, sum(total) OVER (
ORDER BY CAST(reading_date AS timestamp)
RANGE BETWEEN INTERVAL 45 minutes PRECEDING AND CURRENT ROW
) AS sum_total FROM df""").show()
Related
I have a parquet file partitioned by a date field (YYYY-MM-DD).
How to read the (current date-1 day) records from the file efficiently in Pyspark - please suggest.
PS: I would not like to read the entire file and then filter the records as the data volume is huge.
There are multiple ways to go about this:
Suppose this is the input data and you write out the dataframe partitioned on "date" column:
data = [(datetime.date(2022, 6, 12), "Hello"), (datetime.date(2022, 6, 19), "World")]
schema = StructType([StructField("date", DateType()),StructField("message", StringType())])
df = spark.createDataFrame(data, schema=schema)
df.write.mode('overwrite').partitionBy('date').parquet('./test')
You can read the parquet files associated to a given date with this syntax:
spark.read.parquet('./test/date=2022-06-19').show()
# The catch is that the date column is gonna be omitted from your dataframe
+-------+
|message|
+-------+
| World|
+-------+
# You could try adding the date column with lit syntax.
(spark.read.parquet('./test/date=2022-06-19')
.withColumn('date', f.lit('2022-06-19').cast(DateType()))
.show()
)
# Output
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
The more efficient solution is using the delta tables:
df.write.mode('overwrite').partitionBy('date').format('delta').save('/test')
spark.read.format('delta').load('./test').where(f.col('date') == '2022-06-19').show()
The spark engine uses the _delta_log to optimize your query and only reads the parquet files that are applicable to your query. Also, the output will have all the columns:
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
you can read it by passing date variable while reading.
This is dynamic code, you nor need to hardcode date, just append it with path
>>> df.show()
+-----+-----------------+-----------+----------+
|Sr_No| User_Id|Transaction| dt|
+-----+-----------------+-----------+----------+
| 1|paytm 111002203#p| 100D|2022-06-29|
| 2|paytm 111002203#p| 50C|2022-06-27|
| 3|paytm 111002203#p| 20C|2022-06-26|
| 4|paytm 111002203#p| 10C|2022-06-25|
| 5| null| 1C|2022-06-24|
+-----+-----------------+-----------+----------+
>>> df.write.partitionBy("dt").mode("append").parquet("/dir1/dir2/sample.parquet")
>>> from datetime import date
>>> from datetime import timedelta
>>> today = date.today()
#Here i am taking two days back date, for one day back you can do (days=1)
>>> yesterday = today - timedelta(days = 2)
>>> two_days_back=yesterday.strftime('%Y-%m-%d')
>>> path="/di1/dir2/sample.parquet/dt="+two_days_back
>>> spark.read.parquet(path).show()
+-----+-----------------+-----------+
|Sr_No| User_Id|Transaction|
+-----+-----------------+-----------+
| 2|paytm 111002203#p| 50C|
+-----+-----------------+-----------+
I have a Spark DataFrame with a timestamp column in milliseconds since the epoche. The column is a string. I now want to transform the column to a readable human time but keep the milliseconds.
For example:
1614088453671 -> 23-2-2021 13:54:13.671
Every example i found transforms the timestamp to a normal human readable time without milliseconds.
What i have:
+------------------+
|epoch_time_seconds|
+------------------+
|1614088453671 |
+------------------+
What i want to reach:
+------------------+------------------------+
|epoch_time_seconds|human_date |
+------------------+------------------------+
|1614088453671 |23-02-2021 13:54:13.671 |
+------------------+------------------------+
The time before the milliseconds can be obtained using date_format from_unixtime, while the milliseconds can be obtained using a modulo. Combine them using format_string.
val df2 = df.withColumn(
"human_date",
format_string(
"%s.%s",
date_format(
from_unixtime(col("epoch_time_seconds")/1000),
"dd-MM-yyyy HH:mm:ss"
),
col("epoch_time_seconds") % 1000
)
)
df2.show(false)
+------------------+-----------------------+
|epoch_time_seconds|human_date |
+------------------+-----------------------+
|1614088453671 |23-02-2021 13:54:13.671|
+------------------+-----------------------+
I am trying to calculate the number of days between current_timestamp() and max(timestamp_field) from a table.
maxModifiedDate = spark.sql("select date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss') as maxModifiedDate,date_format(current_timestamp(),'MM/dd/yyyy hh:mm:ss') as CurrentTimeStamp, datediff(current_timestamp(), date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss')) as daysDiff from db.tbl")
but I get null for daysDiff. Why is that and how can I fix it?
------------------+-------------------+--------+
| maxModifiedDate| CurrentTimeStamp|daysDiff|
+-------------------+-------------------+--------+
|01/29/2020 05:07:51|06/29/2020 08:36:28| null|
+-------------------+-------------------+--------+
Check this out: I used to_timestamp to convert into dateformat and used datediff function to calculate the time difference.
from pyspark.sql import functions as F
# InputDF
# +-------------------+-------------------+
# | maxModifiedDate| CurrentTimeStamp|
# +-------------------+-------------------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28|
# +-------------------+-------------------+
df.select("maxModifiedDate","CurrentTimeStamp",F.datediff( F.to_timestamp("CurrentTimeStamp", format= 'MM/dd/yyyy'), F.to_timestamp("maxModifiedDate", format= 'MM/dd/yyyy')).alias("datediff")).show()
# +-------------------+-------------------+--------+
# | maxModifiedDate| CurrentTimeStamp|datediff|
# +-------------------+-------------------+--------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28| 152|
# +-------------------+-------------------+--------+
Using sparksql
spark.sql("select maxModifiedDate,CurrentTimeStamp, datediff(to_timestamp(CurrentTimeStamp, 'MM/dd/yyyy'), to_timestamp(maxModifiedDate, 'MM/dd/yyyy')) as datediff from table ").show()
date_format is used to change timestamp formats instead use to_date(col,'fmt'), unix_timestamp+from_unixtime,to_timestamp functions with datediff.
df.show()
#+-------------------+-------------------+
#| maxModifiedDate| CurrentTimeStamp|
#+-------------------+-------------------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28|
#+-------------------+-------------------+
spark.sql("select maxModifiedDate,CurrentTimeStamp,datediff(to_date(maxModifiedDate, 'MM/dd/yyyy'),to_date(CurrentTimeStamp,'MM/dd/yyyy')) as daysDiff from tmp").show()
#+-------------------+-------------------+--------+
#| maxModifiedDate| CurrentTimeStamp|daysDiff|
#+-------------------+-------------------+--------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28| -152|
#+-------------------+-------------------+--------+
I think you could try to define your own function to solve your problem, since datediff() is only able to compute difference between dates and not datetimes.
I suggest you something like this, casting your datetime to long:
diff_datetime = col("end_time").cast("long") - col("start_time").cast("long")
df = df.withColumn("diff", diff/60)
Or casting your result to timestamp using SQL
SELECT datediff(F.to_timestamp(end_date), F.to_timestamp(start_date))
In this case, I'm going to get the difference in seconds between two datetimes, but you can edit this result changing the scale factor (60 for seconds, 60*60 for minutes...)
Alternatively, if you want to use that function, you have to cast your datetime column to a date column (without hours, minutes and seconds) using to_date() and then apply datediff().
In the DataFrame df I have a column datetime that contains timestamp values. The problem is that in some rows these are unix timestamps, while in other rows these are yyyyMMddHHmm format.
How can I verify that each given value is unix timestamp and if it's not to convert it into timestamp?
df.withColumn("timestamp", unix_timestamp(col("datetime")))
I assume that when...otherwise should be used, but how to check that a value is the unix timestamp?
You can use when/otherwise along with the date parsing methods. Here is some example code. I differentiated using just the length of the string, but you could also check the result of parsing them.
from pyspark.sql.functions import *
data = [
('201001021011',),
('201101021011',),
('1539721852',),
('1539721853',)
]
df = sc.parallelize(data).toDF(['date'])
df2 = df.withColumn('date',
when(length('date') != 12, from_unixtime('date', 'yyyyMMddHHmm')) \
.otherwise(col('date'))
)
df3 = df2.withColumn('date', to_timestamp('date', 'yyyyMMddHHmm'))
df3.show()
Outputs this:
+-------------------+
| date|
+-------------------+
|2010-01-02 10:11:00|
|2011-01-02 10:11:00|
|2018-10-16 16:30:00|
|2018-10-16 16:30:00|
+-------------------+
If column datetime consists of only Unix-timestamp strings or "yyyyMMddHHmm"-formatted strings, you can differentiate the two string formats based on their length, since the former has 10 digits or less whereas the latter is a fixed 12:
val df = Seq(
(1, "1538384400"),
(2, "1538481600"),
(3, "201809281800"),
(4, "1538548200"),
(5, "201809291530")
).toDF("id", "datetime")
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise($"datetime")
)
// +---+------------+----------+
// | id| datetime| timestamp|
// +---+------------+----------+
// | 1| 1538384400|1538384400|
// | 2| 1538481600|1538481600|
// | 3|201809281800|1538182800|
// | 4| 1538548200|1538548200|
// | 5|201809291530|1538260200|
// +---+------------+----------+
In case there are other string formats in column datetime, you can narrow down the conditions for Unix timestamp to a range corresponding to the range of date-time in your dataset. For example, Unix timestamp should be a 10-digit number post 2001-09-09 (and for the next 250+ years) and would start with 10 to 15 up till now:
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise(when(regexp_extract($"datetime", "^(1[0-5]\\d{8})$", 1) === $"datetime", $"datetime").
otherwise(null) // Or, additional conditions for other cases
))
I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.