I have data which starts from 1st Jan 2017 to 7th Jan 2017 and it is a week wanted weekly aggregate. I used window function in following manner
val df_v_3 = df_v_2.groupBy(window(col("DateTime"), "7 day"))
.agg(sum("Value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
I am having data in dataframe as
DateTime,value
2017-01-01T00:00:00.000+05:30,1.2
2017-01-01T00:15:00.000+05:30,1.30
--
2017-01-07T23:30:00.000+05:30,1.43
2017-01-07T23:45:00.000+05:30,1.4
I am getting output as :
2016-12-29T05:30:00.000+05:30,2017-01-05T05:30:00.000+05:30,723.87
2017-01-05T05:30:00.000+05:30,2017-01-12T05:30:00.000+05:30,616.74
It shows that my day is starting from 29th Dec 2016 but in actual data is starting from 1 Jan 2017,why this margin is occuring?
For tumbling windows like this it is possible to set an offset to the starting time, more information can be found in the blog here. A sliding window is used, however, by setting both "window duration" and "sliding duration" to the same value, it will be the same as a tumbling window with starting offset.
The syntax is like follows,
window(column, window duration, sliding duration, starting offset)
With your values I found that an offset of 64 hours would give a starting time of 2017-01-01 00:00:00.
val data = Seq(("2017-01-01 00:00:00",1.0),
("2017-01-01 00:15:00",2.0),
("2017-01-08 23:30:00",1.43))
val df = data.toDF("DateTime","value")
.withColumn("DateTime", to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss"))
val df2 = df
.groupBy(window(col("DateTime"), "1 week", "1 week", "64 hours"))
.agg(sum("value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
Will give this resulting dataframe:
+-------------------+-------------------+-------------+
| start| end|aggregate_sum|
+-------------------+-------------------+-------------+
|2017-01-01 00:00:00|2017-01-08 00:00:00| 3.0|
|2017-01-08 00:00:00|2017-01-15 00:00:00| 1.43|
+-------------------+-------------------+-------------+
The solution with the python API looks a bit more intuitive since the window function works with the following options:
window(timeColumn, windowDuration, slideDuration=None, startTime=None)
see:
https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/functions.html
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC
with which to start window intervals. For example, in order to have
hourly tumbling windows that start 15 minutes past the hour, e.g.
12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.
No need for a workaround with sliding duration, I used a 3 days "delay" as startTime to match the desired tumbling window:
from datetime import datetime
from pyspark.sql.functions import sum, window
df_ex = spark.createDataFrame([(datetime(2017,1,1, 0,0) , 1.), \
(datetime(2017,1,1,0,15) , 2.), \
(datetime(2017,1,8,23,30) , 1.43)], \
["Datetime", "value"])
weekly_ex = df_ex \
.groupBy(window("Datetime", "1 week", startTime="3 day" )) \
.agg(sum("value").alias('aggregate_sum'))
weekly_ex.show(truncate=False)
For the same result:
+------------------------------------------+-------------+
|window |aggregate_sum|
+------------------------------------------+-------------+
|[2017-01-01 00:00:00, 2017-01-08 00:00:00]|3.0 |
|[2017-01-08 00:00:00, 2017-01-15 00:00:00]|1.43 |
+------------------------------------------+-------------+
Related
Why just the Jan works when try to convert using the code below?
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "D-MMM-yyyy"))
display(df2)
Result:
Date
------------
undefined
2021-01-02
D is a day of year.
The first one works because 02 is in fact in January, but 05 is not in November.
If you try:
data = [{"date": "05-Jan-2000"}, {"date": "02-Jan-2021"}]
It will work for both.
However, you need d which is the day of the month. So use d-MMM-yyyy.
For further information please see: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
D is day-of-the-year.
What you're looking for is d - day of the month.
PySpark supports the Java DateTimeFormatter patterns: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "dd-MMM-yyyy"))
df2.show()
+----------+
| date|
+----------+
|2000-11-05|
|2021-01-02|
+----------+
I have a dataframe that looks like this:
+--------+-------------------------------------+-----------+
| Worker | Schedule | Overtime |
+--------+-------------------------------------+-----------+
| 1 | 23344--23344--23344--23344--23344-- | 3 |
+--------+-------------------------------------+-----------+
| 2 | 34455--34455--34455--34455--34455-- | 2 |
+--------+-------------------------------------+-----------+
| 3 | 466554-466554-466554-466554-466554- | 1 |
+--------+-------------------------------------+-----------+
Each number in the long 35-digit string in Schedule is a worker's work hours in a 35-day window.
Here is how to read each row:
Worker #1 works 2hr on Monday, 3hr on Tuesday, 3hr on Wednesday, 4hr on Thursday, 4hr on Friday, then off on Saturday and Sunday... (same for the following weeks in that 35-day window)
Worker #3 works 4hr on Monday, 6hr on Tuesday, 6hr on Wednesday, 5hr on Thursday, 5hr on Friday, 4hr on Saturday, then off on Sunday... (same for the following weeks in that 35-day window)
I would like to implement the following operation:
- For each day of a worker's schedule, if the hour he works that day + Overtime is <= 6, add that overtime hours to his schedule. No change is applied to days off (marked with -)
For example:
Worker #1's updated schedule would look like:
56644--56644--56644--56644--56644--
2+3 <= 6 -> add 3 hrs
3+3 <= 6 -> add 3 hrs
4+3 !<= 6 -> no edit
-- -> days off, no edit
Using same logic, Worker #2's updated schedule would look like:
56655--56655--56655--56655--56655--
Worker #3's updated schedule would look like:
566665-566665-566665-566665-566665-
I am wondering how do I perform this operation in PySpark?
Much appreciation for your help!
The shortest way (and probably the best performance) is using Spark SQL transform, which will loop through an array of your schedule, and perform the comparison element-wise. Even though, the code would look a bit cryptic.
from pyspark.sql import functions as F
(df
.withColumn('recalculated_schedule', F.expr('concat_ws("", transform(split(schedule, ""), x -> case when x = "-" then "-" when x + overtime <= 6 then cast(x + overtime as int) else x end))'))
.show(10, False)
)
+------+-----------------------------------+--------+-----------------------------------+
|worker|schedule |overtime|recalculated_schedule |
+------+-----------------------------------+--------+-----------------------------------+
|1 |23344--23344--23344--23344--23344--|3 |56644--56644--56644--56644--56644--|
|2 |34455--34455--34455--34455--34455--|2 |56655--56655--56655--56655--56655--|
|3 |466554-466554-466554-466554-466554-|1 |566665-566665-566665-566665-566665-|
+------+-----------------------------------+--------+-----------------------------------+
I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.
The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy.
So it looks like this:
+--------------------+
| timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+
The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year.
What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!
Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))
Looking for scala code to replicate https://www.epochconverter.com/seconds-days-since-y0
I have a spark streaming job reading the avro message. The message has a column of type int and holds Days Since 1970-01-01. I want to convert that to date.
dataFrame.select(from_avro(col("Value"), valueRegistryConfig) as 'value)
.select("value.*")
.withColumn("start_date",'start_date)
start_date is holding an integer value like 18022 i.e Days Since 1970-01-01. I want to convert this value to a date
18022 - > Sun May 05 2019
Use default date as 1970-01-01 and pass number of days to date_add function.
This will give you date but will be 1 day additional so you do minus 1.
Something like this:
var dataDF = Seq(("1970-01-01",18091),("1970-01-01",18021),("1970-01-01",18022)).toDF("date","num")
dataDF.select(
col("date"),
expr("date_add(date,num-1)").as("date_add")).show(10,false)
+----------+----------+
|date |date_add |
+----------+----------+
|1970-01-01|2019-07-13|
|1970-01-01|2019-05-04|
|1970-01-01|2019-05-05|
+----------+----------+
I have some columns with dates from a source files that look like 4/23/19
The 4 being the month, the 23 being the day and the 19 being 2019
How do I convert this to a timestamp in pyspark?
So far
def ParseDateFromFormats(col, formats):
return coalesce(*[to_timestamp(col, f) for f in formats])
df2 = df2.withColumn("_" + field.columnName, ParseDateFromFormats(df2[field.columnName], ["dd/MM/yyyy hh:mm", "dd/MM/yyyy", "dd-MMM-yy"]).cast(field.simpleTypeName))
There doesn't seem to be a date format that would work
The reason why your code didn't work might be cause you reversed days and months.
This works:
from pyspark.sql.functions import to_date
time_df = spark.createDataFrame([('4/23/19',)], ['dt'])
time_df.withColumn('proper_date', to_date('dt', 'MM/dd/yy')).show()
+-------+-----------+
| dt|proper_date|
+-------+-----------+
|4/23/19| 2019-04-23|
+-------+-----------+