pyspark- generating date sequence - date

I am trying to generate a date sequence
from pyspark.sql import functions as F
df1 = df.withColumn("start_dt", F.to_date(F.col("start_date"), "yyyy-mm-dd")) \
.withColumn("end_dt", F.to_date(F.col("end_date"), "yyyy-mm-dd"))
df1.select("start_dt", "end_dt").show()
print("type(start_dt)", type("start_dt"))
print("type(end_dt)", type("end_dt"))
df2 = df1.withColumn("lineoffdate", F.expr("""sequence(start_dt,end_dt,1)"""))
Below is the output
+---------------+----------+
| start_date | end_date|
+---------------+----------+
| 2020-02-01|2020-03-21|
+---------------+----------+
type(start_dt) <class 'str'>
type(end_dt) <class 'str'>
cannot resolve 'sequence(start_dt, end_dt, 1)' due to data type mismatch: sequence only supports integral, timestamp or date types; line 1 pos 0;
Even after converting the start dt and end dt to date or timestamp, I see the type of the column still str and getting above mentioned error while generating the date sequence.

You are correct in saying it should work with date or timestamp(calendar types), however, the only mistake you were making was you were putting the "step" in sequence as integer, when it should be calendar interval(like interval 1 day):
df.withColumn("start_date",F.to_date("start_date")) \
.withColumn("end_date", F.to_date("end_date")) \
.withColumn(
"lineofdate",
F.expr("""sequence(start_date,end_date,interval 1 day)""") \
) \
.show()
# output:
# +----------+----------+--------------------+
# |start_date| end_date| lineofdate|
# +----------+----------+--------------------+
# |2020-02-01|2020-03-21|[2020-02-01, 2020...|
# +----------+----------+--------------------+

Related

How can I operate with an interval type variable?

I have two timestamp columns('tpep_pickup_datetime' and 'tpep_dropoff_datetime') and when I calculate the difference between them, I get an interval variable.
yellowcab = yellowcab \
.withColumn('tpep_pickup_datetime', to_timestamp('tpep_pickup_datetime','yyyy-MM-dd HH:mm:ss'))\
.withColumn('tpep_dropoff_datetime', to_timestamp('tpep_dropoff_datetime','yyyy-MM-dd HH:mm:ss'))
yellowcab = yellowcab \
.withColumn('total_time', col('tpep_dropoff_datetime')-col('tpep_pickup_datetime'))
The result looks like that:
I want to transform 'total_time' column to an 'int' variable with the time converted to seconds.
I have tried to extract the hours and the minutes from the interval variable and then multiply them in order to convert to seconds, but I have not been able to do it
Cast the interval into int.
data = [['2020-08-01 00:02:53', '2020-08-01 00:28:54']]
df = spark.createDataFrame(data, ['t1', 't2']) \
.withColumn('t1', f.to_timestamp('t1','yyyy-MM-dd HH:mm:ss')) \
.withColumn('t2', f.to_timestamp('t2','yyyy-MM-dd HH:mm:ss')) \
.withColumn('interval', (f.col('t2') - f.col('t1')).cast('int')) \
.show()
+-------------------+-------------------+--------+
| t1| t2|interval|
+-------------------+-------------------+--------+
|2020-08-01 00:02:53|2020-08-01 00:28:54| 1561|
+-------------------+-------------------+--------+

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

pyspark: change string to timestamp

I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct
Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+

spark sql datediff in days

I am trying to calculate the number of days between current_timestamp() and max(timestamp_field) from a table.
maxModifiedDate = spark.sql("select date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss') as maxModifiedDate,date_format(current_timestamp(),'MM/dd/yyyy hh:mm:ss') as CurrentTimeStamp, datediff(current_timestamp(), date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss')) as daysDiff from db.tbl")
but I get null for daysDiff. Why is that and how can I fix it?
------------------+-------------------+--------+
| maxModifiedDate| CurrentTimeStamp|daysDiff|
+-------------------+-------------------+--------+
|01/29/2020 05:07:51|06/29/2020 08:36:28| null|
+-------------------+-------------------+--------+
Check this out: I used to_timestamp to convert into dateformat and used datediff function to calculate the time difference.
from pyspark.sql import functions as F
# InputDF
# +-------------------+-------------------+
# | maxModifiedDate| CurrentTimeStamp|
# +-------------------+-------------------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28|
# +-------------------+-------------------+
df.select("maxModifiedDate","CurrentTimeStamp",F.datediff( F.to_timestamp("CurrentTimeStamp", format= 'MM/dd/yyyy'), F.to_timestamp("maxModifiedDate", format= 'MM/dd/yyyy')).alias("datediff")).show()
# +-------------------+-------------------+--------+
# | maxModifiedDate| CurrentTimeStamp|datediff|
# +-------------------+-------------------+--------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28| 152|
# +-------------------+-------------------+--------+
Using sparksql
spark.sql("select maxModifiedDate,CurrentTimeStamp, datediff(to_timestamp(CurrentTimeStamp, 'MM/dd/yyyy'), to_timestamp(maxModifiedDate, 'MM/dd/yyyy')) as datediff from table ").show()
date_format is used to change timestamp formats instead use to_date(col,'fmt'), unix_timestamp+from_unixtime,to_timestamp functions with datediff.
df.show()
#+-------------------+-------------------+
#| maxModifiedDate| CurrentTimeStamp|
#+-------------------+-------------------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28|
#+-------------------+-------------------+
spark.sql("select maxModifiedDate,CurrentTimeStamp,datediff(to_date(maxModifiedDate, 'MM/dd/yyyy'),to_date(CurrentTimeStamp,'MM/dd/yyyy')) as daysDiff from tmp").show()
#+-------------------+-------------------+--------+
#| maxModifiedDate| CurrentTimeStamp|daysDiff|
#+-------------------+-------------------+--------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28| -152|
#+-------------------+-------------------+--------+
I think you could try to define your own function to solve your problem, since datediff() is only able to compute difference between dates and not datetimes.
I suggest you something like this, casting your datetime to long:
diff_datetime = col("end_time").cast("long") - col("start_time").cast("long")
df = df.withColumn("diff", diff/60)
Or casting your result to timestamp using SQL
SELECT datediff(F.to_timestamp(end_date), F.to_timestamp(start_date))
In this case, I'm going to get the difference in seconds between two datetimes, but you can edit this result changing the scale factor (60 for seconds, 60*60 for minutes...)
Alternatively, if you want to use that function, you have to cast your datetime column to a date column (without hours, minutes and seconds) using to_date() and then apply datediff().

How to check that a value is the unix timestamp in Scala?

In the DataFrame df I have a column datetime that contains timestamp values. The problem is that in some rows these are unix timestamps, while in other rows these are yyyyMMddHHmm format.
How can I verify that each given value is unix timestamp and if it's not to convert it into timestamp?
df.withColumn("timestamp", unix_timestamp(col("datetime")))
I assume that when...otherwise should be used, but how to check that a value is the unix timestamp?
You can use when/otherwise along with the date parsing methods. Here is some example code. I differentiated using just the length of the string, but you could also check the result of parsing them.
from pyspark.sql.functions import *
data = [
('201001021011',),
('201101021011',),
('1539721852',),
('1539721853',)
]
df = sc.parallelize(data).toDF(['date'])
df2 = df.withColumn('date',
when(length('date') != 12, from_unixtime('date', 'yyyyMMddHHmm')) \
.otherwise(col('date'))
)
df3 = df2.withColumn('date', to_timestamp('date', 'yyyyMMddHHmm'))
df3.show()
Outputs this:
+-------------------+
| date|
+-------------------+
|2010-01-02 10:11:00|
|2011-01-02 10:11:00|
|2018-10-16 16:30:00|
|2018-10-16 16:30:00|
+-------------------+
If column datetime consists of only Unix-timestamp strings or "yyyyMMddHHmm"-formatted strings, you can differentiate the two string formats based on their length, since the former has 10 digits or less whereas the latter is a fixed 12:
val df = Seq(
(1, "1538384400"),
(2, "1538481600"),
(3, "201809281800"),
(4, "1538548200"),
(5, "201809291530")
).toDF("id", "datetime")
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise($"datetime")
)
// +---+------------+----------+
// | id| datetime| timestamp|
// +---+------------+----------+
// | 1| 1538384400|1538384400|
// | 2| 1538481600|1538481600|
// | 3|201809281800|1538182800|
// | 4| 1538548200|1538548200|
// | 5|201809291530|1538260200|
// +---+------------+----------+
In case there are other string formats in column datetime, you can narrow down the conditions for Unix timestamp to a range corresponding to the range of date-time in your dataset. For example, Unix timestamp should be a 10-digit number post 2001-09-09 (and for the next 250+ years) and would start with 10 to 15 up till now:
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise(when(regexp_extract($"datetime", "^(1[0-5]\\d{8})$", 1) === $"datetime", $"datetime").
otherwise(null) // Or, additional conditions for other cases
))