from_unixtime gave me an awkward value - pyspark

Hi I used from_unixtime to convert this value 1632837232439 and I got 53712-07-21 01:53:59 is this right? I can't make sense of this, I used
df = df.select(from_unixtime(df_sixty60['createdOn']).alias("date_key"))
Thanks for you help even if you can suggest other ways of representing this. Thanks

Try with to_timestamp() function and divide by 1000 as your epoch timestamp has milliseconds included.
Example:
df.show()
#+-------------+
#| createdOn|
#+-------------+
#|1632837232439|
#+-------------+
df.select(to_timestamp(df['createdOn']/1000).alias("date_key")).show(10,False)
#+-----------------------+
#|date_key |
#+-----------------------+
#|2021-09-28 13:53:52.439|
#+-----------------------+

Related

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds

I have a Spark DataFrame with a timestamp column in milliseconds since the epoche. The column is a string. I now want to transform the column to a readable human time but keep the milliseconds.
For example:
1614088453671 -> 23-2-2021 13:54:13.671
Every example i found transforms the timestamp to a normal human readable time without milliseconds.
What i have:
+------------------+
|epoch_time_seconds|
+------------------+
|1614088453671 |
+------------------+
What i want to reach:
+------------------+------------------------+
|epoch_time_seconds|human_date |
+------------------+------------------------+
|1614088453671 |23-02-2021 13:54:13.671 |
+------------------+------------------------+
The time before the milliseconds can be obtained using date_format from_unixtime, while the milliseconds can be obtained using a modulo. Combine them using format_string.
val df2 = df.withColumn(
"human_date",
format_string(
"%s.%s",
date_format(
from_unixtime(col("epoch_time_seconds")/1000),
"dd-MM-yyyy HH:mm:ss"
),
col("epoch_time_seconds") % 1000
)
)
df2.show(false)
+------------------+-----------------------+
|epoch_time_seconds|human_date |
+------------------+-----------------------+
|1614088453671 |23-02-2021 13:54:13.671|
+------------------+-----------------------+

spark sql datediff in days

I am trying to calculate the number of days between current_timestamp() and max(timestamp_field) from a table.
maxModifiedDate = spark.sql("select date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss') as maxModifiedDate,date_format(current_timestamp(),'MM/dd/yyyy hh:mm:ss') as CurrentTimeStamp, datediff(current_timestamp(), date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss')) as daysDiff from db.tbl")
but I get null for daysDiff. Why is that and how can I fix it?
------------------+-------------------+--------+
| maxModifiedDate| CurrentTimeStamp|daysDiff|
+-------------------+-------------------+--------+
|01/29/2020 05:07:51|06/29/2020 08:36:28| null|
+-------------------+-------------------+--------+
Check this out: I used to_timestamp to convert into dateformat and used datediff function to calculate the time difference.
from pyspark.sql import functions as F
# InputDF
# +-------------------+-------------------+
# | maxModifiedDate| CurrentTimeStamp|
# +-------------------+-------------------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28|
# +-------------------+-------------------+
df.select("maxModifiedDate","CurrentTimeStamp",F.datediff( F.to_timestamp("CurrentTimeStamp", format= 'MM/dd/yyyy'), F.to_timestamp("maxModifiedDate", format= 'MM/dd/yyyy')).alias("datediff")).show()
# +-------------------+-------------------+--------+
# | maxModifiedDate| CurrentTimeStamp|datediff|
# +-------------------+-------------------+--------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28| 152|
# +-------------------+-------------------+--------+
Using sparksql
spark.sql("select maxModifiedDate,CurrentTimeStamp, datediff(to_timestamp(CurrentTimeStamp, 'MM/dd/yyyy'), to_timestamp(maxModifiedDate, 'MM/dd/yyyy')) as datediff from table ").show()
date_format is used to change timestamp formats instead use to_date(col,'fmt'), unix_timestamp+from_unixtime,to_timestamp functions with datediff.
df.show()
#+-------------------+-------------------+
#| maxModifiedDate| CurrentTimeStamp|
#+-------------------+-------------------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28|
#+-------------------+-------------------+
spark.sql("select maxModifiedDate,CurrentTimeStamp,datediff(to_date(maxModifiedDate, 'MM/dd/yyyy'),to_date(CurrentTimeStamp,'MM/dd/yyyy')) as daysDiff from tmp").show()
#+-------------------+-------------------+--------+
#| maxModifiedDate| CurrentTimeStamp|daysDiff|
#+-------------------+-------------------+--------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28| -152|
#+-------------------+-------------------+--------+
I think you could try to define your own function to solve your problem, since datediff() is only able to compute difference between dates and not datetimes.
I suggest you something like this, casting your datetime to long:
diff_datetime = col("end_time").cast("long") - col("start_time").cast("long")
df = df.withColumn("diff", diff/60)
Or casting your result to timestamp using SQL
SELECT datediff(F.to_timestamp(end_date), F.to_timestamp(start_date))
In this case, I'm going to get the difference in seconds between two datetimes, but you can edit this result changing the scale factor (60 for seconds, 60*60 for minutes...)
Alternatively, if you want to use that function, you have to cast your datetime column to a date column (without hours, minutes and seconds) using to_date() and then apply datediff().

PySpark custom TimestampType column conversion

I'm trying to convert a String column, which is essentially a date, into a TimestampType column, however I'm having problems with how to split the value.
-RECORD 0-------------------------------------
year | 2016
month | 4
arrival_date | 2016-04-30
date_added | 20160430
allowed_date | 10292016
I have 3 columns, all of which are in different formats so I'm trying to find a way to split the string in a custom way, since the date_added column is yyyymmdd and the allowed_date is mmddyyyy.
I've tried something in the lines of:
df_imigration.withColumn('cc'.F.date_format(df_imigration.allowed_date.cast(dataType=t.TimestampType()), "yyyy-mm-dd"))
But with no success and I'm kind of stuck tring to find what's the right or best way to solve this.
The t and F aliases are for the following imports:
from pyspark.sql import functions as F
from pyspark.sql import types as t
The problem with your code is you are casting the date without specifying the date format.
To specify the format you should use function to_timestamp().
Here I have created a dataframe with three different formats and it worked.
df1 = spark.createDataFrame([("20201231","12312020","31122020"), ("20201231","12312020","31122020" )], ["ID","Start_date","End_date"])
df1=df1.withColumn('cc',f.date_format(f.to_timestamp(df1.ID,'yyyymmdd'), "yyyy-mm-dd"))
df1=df1.withColumn('dd',f.date_format(f.to_timestamp(df1.Start_date,'mmddyyyy'), "yyyy-mm-dd"))
df1.withColumn('ee',f.date_format(f.to_timestamp(df1.End_date,'ddmmyyyy'), "yyyy-mm-dd")).show()
Output:
+--------+----------+--------+----------+----------+----------+
| ID|Start_date|End_date| cc| dd| ee|
+--------+----------+--------+----------+----------+----------+
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
+--------+----------+--------+----------+----------+----------+
Hope it helps!

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

For my learning , i have been using below sample dataset .
+-------------------+-----+-----+-----+-----+-------+
| MyDate| Open| High| Low|Close| Volume|
+-------------------+-----+-----+-----+-----+-------+
|2006-01-03 00:00:00|983.8|493.8|481.1|492.9|1537660|
|2006-01-04 00:00:00|979.6|491.0|483.5|483.8|1871020|
|2006-01-05 00:00:00|972.2|487.8|484.0|486.2|1143160|
|2006-01-06 00:00:00|977.8|489.0|482.0|486.2|1370250|
|2006-01-09 00:00:00|973.4|487.4|483.0|483.9|1680740|
+-------------------+-----+-----+-----+-----+-------+
I tried to change "MyDate" column values to different format like "YYYY-MON" and written like this..
citiDataDF.withColumn("New-Mydate",to_timestamp($"MyDate", "yyyy-MON")).show(5)
After executing the code, found that new column "New-Mydate". but i couldn't see the desired output format. can you please help
You need date_format instead to_timestamp:
val citiDataDF = List("2006-01-03 00:00:00").toDF("MyDate")
citiDataDF.withColumn("New-Mydate",date_format($"New-Mydate", "yyyy-MMM")).show(5)
Result:
+-------------------+----------+
| MyDate|New-Mydate|
+-------------------+----------+
|2006-01-03 00:00:00| 2006-Jan|
+-------------------+----------+
Note: Three "M" mean the month as string, if you want a month as Int, you must use only two "M"

convert date to integer scala spark

I have a dataframe, that contain, 2 columns of date start_date and finish_date; and I created a new column to add the moyen between the 2 dates.
+-----+--------+-------+---------+-----+--------------------+-------------------
start_date| finish_date| moyen_date|
+-----+--------+-------+---------+-----+--------------------+-------------------
2010-11-03 15:56:... |2010-11-03 17:43:...| 0|
2010-11-03 17:43:... |2010-11-05 13:21:...| 2|
2010-11-05 13:21:... |2010-11-05 14:08:...| 0|
2010-11-05 14:08:... |2010-11-05 14:08:...| 0|
+-----+--------+-------+---------+-----+--------------------+-------------------
I calculated the difference between the 2 dates:
var result = sqlDF.withColumn("moyen_date",datediff(col("finish_date"), col("start_date")))
But I want to convert start_date and finish_date to integer, knowing that each column contain date + time.
Someone can help me please. ?
Thank you
Considering this as part of your dataframe:
df.show(false)
+---------------------+
|ts |
+---------------------+
|2010-11-03 15:56:34.0|
+---------------------+
unix_timestamp returns the number of milliseconds since epoch. The input column should be of type timestamp. The output column is of type long.
df.withColumn("unix_ts" , unix_timestamp($"ts").show(false)
+---------------------+----------+
|ts |unix_ts |
+---------------------+----------+
|2010-11-03 15:56:34.0|1288817794|
+---------------------+----------+
To convert it back to timestamp format of your choice, you can use from_unixtime which also takes an optional timestamp format as a parameter. You are using to_date, that's why you're only getting the date and not the time.
df.withColumn("unix_ts" , unix_timestamp($"ts") )
.withColumn("from_utime" , from_unixtime($"unix_ts" , "yyyy-MM-dd HH:mm:ss.S"))
.show(false)
+---------------------+----------+---------------------+
|ts |unix_ts |from_utime |
+---------------------+----------+---------------------+
|2010-11-03 15:56:34.0|1288817794|2010-11-03 15:56:34.0|
+---------------------+----------+---------------------+
The column from_utime here will be of type string though. To convert it to timestamp, you can simple use:
df.withColumn("from_utime" , $"from_utime".cast("timestamp") )
Since it's already in ISO date format, no specific conversion is needed. For any other format, you will need to use a combination of unix_timestamp and from_unixtime.