from_unixtime gave me an awkward value

from_unixtime gave me an awkward value - pyspark

Hi I used from_unixtime to convert this value 1632837232439 and I got 53712-07-21 01:53:59 is this right? I can't make sense of this, I used
df = df.select(from_unixtime(df_sixty60['createdOn']).alias("date_key"))
Thanks for you help even if you can suggest other ways of representing this. Thanks

Try with to_timestamp() function and divide by 1000 as your epoch timestamp has milliseconds included.
Example:
df.show()
#+-------------+
#| createdOn|
#+-------------+
#|1632837232439|
#+-------------+
df.select(to_timestamp(df['createdOn']/1000).alias("date_key")).show(10,False)
#+-----------------------+
#|date_key |
#+-----------------------+
#|2021-09-28 13:53:52.439|
#+-----------------------+

Related

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds

I have a Spark DataFrame with a timestamp column in milliseconds since the epoche. The column is a string. I now want to transform the column to a readable human time but keep the milliseconds.
For example:
1614088453671 -> 23-2-2021 13:54:13.671
Every example i found transforms the timestamp to a normal human readable time without milliseconds.
What i have:
+------------------+
|epoch_time_seconds|
+------------------+
|1614088453671 |
+------------------+
What i want to reach:
+------------------+------------------------+
|epoch_time_seconds|human_date |
+------------------+------------------------+
|1614088453671 |23-02-2021 13:54:13.671 |
+------------------+------------------------+

The time before the milliseconds can be obtained using date_format from_unixtime, while the milliseconds can be obtained using a modulo. Combine them using format_string.
val df2 = df.withColumn(
"human_date",
format_string(
"%s.%s",
date_format(
from_unixtime(col("epoch_time_seconds")/1000),
"dd-MM-yyyy HH:mm:ss"
),
col("epoch_time_seconds") % 1000
)
)
df2.show(false)
+------------------+-----------------------+
|epoch_time_seconds|human_date |
+------------------+-----------------------+
|1614088453671 |23-02-2021 13:54:13.671|
+------------------+-----------------------+

spark sql datediff in days

I am trying to calculate the number of days between current_timestamp() and max(timestamp_field) from a table.
maxModifiedDate = spark.sql("select date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss') as maxModifiedDate,date_format(current_timestamp(),'MM/dd/yyyy hh:mm:ss') as CurrentTimeStamp, datediff(current_timestamp(), date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss')) as daysDiff from db.tbl")
but I get null for daysDiff. Why is that and how can I fix it?
------------------+-------------------+--------+
| maxModifiedDate| CurrentTimeStamp|daysDiff|
+-------------------+-------------------+--------+
|01/29/2020 05:07:51|06/29/2020 08:36:28| null|
+-------------------+-------------------+--------+

Check this out: I used to_timestamp to convert into dateformat and used datediff function to calculate the time difference.
from pyspark.sql import functions as F
# InputDF
# +-------------------+-------------------+
# | maxModifiedDate| CurrentTimeStamp|
# +-------------------+-------------------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28|
# +-------------------+-------------------+
df.select("maxModifiedDate","CurrentTimeStamp",F.datediff( F.to_timestamp("CurrentTimeStamp", format= 'MM/dd/yyyy'), F.to_timestamp("maxModifiedDate", format= 'MM/dd/yyyy')).alias("datediff")).show()
# +-------------------+-------------------+--------+
# | maxModifiedDate| CurrentTimeStamp|datediff|
# +-------------------+-------------------+--------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28| 152|
# +-------------------+-------------------+--------+
Using sparksql
spark.sql("select maxModifiedDate,CurrentTimeStamp, datediff(to_timestamp(CurrentTimeStamp, 'MM/dd/yyyy'), to_timestamp(maxModifiedDate, 'MM/dd/yyyy')) as datediff from table ").show()

date_format is used to change timestamp formats instead use to_date(col,'fmt'), unix_timestamp+from_unixtime,to_timestamp functions with datediff.
df.show()
#+-------------------+-------------------+
#| maxModifiedDate| CurrentTimeStamp|
#+-------------------+-------------------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28|
#+-------------------+-------------------+
spark.sql("select maxModifiedDate,CurrentTimeStamp,datediff(to_date(maxModifiedDate, 'MM/dd/yyyy'),to_date(CurrentTimeStamp,'MM/dd/yyyy')) as daysDiff from tmp").show()
#+-------------------+-------------------+--------+
#| maxModifiedDate| CurrentTimeStamp|daysDiff|
#+-------------------+-------------------+--------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28| -152|
#+-------------------+-------------------+--------+

I think you could try to define your own function to solve your problem, since datediff() is only able to compute difference between dates and not datetimes.
I suggest you something like this, casting your datetime to long:
diff_datetime = col("end_time").cast("long") - col("start_time").cast("long")
df = df.withColumn("diff", diff/60)
Or casting your result to timestamp using SQL
SELECT datediff(F.to_timestamp(end_date), F.to_timestamp(start_date))
In this case, I'm going to get the difference in seconds between two datetimes, but you can edit this result changing the scale factor (60 for seconds, 60*60 for minutes...)
Alternatively, if you want to use that function, you have to cast your datetime column to a date column (without hours, minutes and seconds) using to_date() and then apply datediff().

PySpark custom TimestampType column conversion

I'm trying to convert a String column, which is essentially a date, into a TimestampType column, however I'm having problems with how to split the value.
-RECORD 0-------------------------------------
year | 2016
month | 4
arrival_date | 2016-04-30
date_added | 20160430
allowed_date | 10292016
I have 3 columns, all of which are in different formats so I'm trying to find a way to split the string in a custom way, since the date_added column is yyyymmdd and the allowed_date is mmddyyyy.
I've tried something in the lines of:
df_imigration.withColumn('cc'.F.date_format(df_imigration.allowed_date.cast(dataType=t.TimestampType()), "yyyy-mm-dd"))
But with no success and I'm kind of stuck tring to find what's the right or best way to solve this.
The t and F aliases are for the following imports:
from pyspark.sql import functions as F
from pyspark.sql import types as t

The problem with your code is you are casting the date without specifying the date format.
To specify the format you should use function to_timestamp().
Here I have created a dataframe with three different formats and it worked.
df1 = spark.createDataFrame([("20201231","12312020","31122020"), ("20201231","12312020","31122020" )], ["ID","Start_date","End_date"])
df1=df1.withColumn('cc',f.date_format(f.to_timestamp(df1.ID,'yyyymmdd'), "yyyy-mm-dd"))
df1=df1.withColumn('dd',f.date_format(f.to_timestamp(df1.Start_date,'mmddyyyy'), "yyyy-mm-dd"))
df1.withColumn('ee',f.date_format(f.to_timestamp(df1.End_date,'ddmmyyyy'), "yyyy-mm-dd")).show()
Output:
+--------+----------+--------+----------+----------+----------+
| ID|Start_date|End_date| cc| dd| ee|
+--------+----------+--------+----------+----------+----------+
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
|20201231| 12312020|31122020|2020-12-31|2020-12-31|2020-12-31|
+--------+----------+--------+----------+----------+----------+
Hope it helps!

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

For my learning , i have been using below sample dataset .
+-------------------+-----+-----+-----+-----+-------+
| MyDate| Open| High| Low|Close| Volume|
+-------------------+-----+-----+-----+-----+-------+
|2006-01-03 00:00:00|983.8|493.8|481.1|492.9|1537660|
|2006-01-04 00:00:00|979.6|491.0|483.5|483.8|1871020|
|2006-01-05 00:00:00|972.2|487.8|484.0|486.2|1143160|
|2006-01-06 00:00:00|977.8|489.0|482.0|486.2|1370250|
|2006-01-09 00:00:00|973.4|487.4|483.0|483.9|1680740|
+-------------------+-----+-----+-----+-----+-------+
I tried to change "MyDate" column values to different format like "YYYY-MON" and written like this..
citiDataDF.withColumn("New-Mydate",to_timestamp($"MyDate", "yyyy-MON")).show(5)
After executing the code, found that new column "New-Mydate". but i couldn't see the desired output format. can you please help

You need date_format instead to_timestamp:
val citiDataDF = List("2006-01-03 00:00:00").toDF("MyDate")
citiDataDF.withColumn("New-Mydate",date_format($"New-Mydate", "yyyy-MMM")).show(5)
Result:
+-------------------+----------+
| MyDate|New-Mydate|
+-------------------+----------+
|2006-01-03 00:00:00| 2006-Jan|
+-------------------+----------+
Note: Three "M" mean the month as string, if you want a month as Int, you must use only two "M"

convert date to integer scala spark

I have a dataframe, that contain, 2 columns of date start_date and finish_date; and I created a new column to add the moyen between the 2 dates.
+-----+--------+-------+---------+-----+--------------------+-------------------
start_date| finish_date| moyen_date|
+-----+--------+-------+---------+-----+--------------------+-------------------
2010-11-03 15:56:... |2010-11-03 17:43:...| 0|
2010-11-03 17:43:... |2010-11-05 13:21:...| 2|
2010-11-05 13:21:... |2010-11-05 14:08:...| 0|
2010-11-05 14:08:... |2010-11-05 14:08:...| 0|
+-----+--------+-------+---------+-----+--------------------+-------------------
I calculated the difference between the 2 dates:
var result = sqlDF.withColumn("moyen_date",datediff(col("finish_date"), col("start_date")))
But I want to convert start_date and finish_date to integer, knowing that each column contain date + time.
Someone can help me please. ?
Thank you

Considering this as part of your dataframe:
df.show(false)
+---------------------+
|ts |
+---------------------+
|2010-11-03 15:56:34.0|
+---------------------+
unix_timestamp returns the number of milliseconds since epoch. The input column should be of type timestamp. The output column is of type long.
df.withColumn("unix_ts" , unix_timestamp($"ts").show(false)
+---------------------+----------+
|ts |unix_ts |
+---------------------+----------+
|2010-11-03 15:56:34.0|1288817794|
+---------------------+----------+
To convert it back to timestamp format of your choice, you can use from_unixtime which also takes an optional timestamp format as a parameter. You are using to_date, that's why you're only getting the date and not the time.
df.withColumn("unix_ts" , unix_timestamp($"ts") )
.withColumn("from_utime" , from_unixtime($"unix_ts" , "yyyy-MM-dd HH:mm:ss.S"))
.show(false)
+---------------------+----------+---------------------+
|ts |unix_ts |from_utime |
+---------------------+----------+---------------------+
|2010-11-03 15:56:34.0|1288817794|2010-11-03 15:56:34.0|
+---------------------+----------+---------------------+
The column from_utime here will be of type string though. To convert it to timestamp, you can simple use:
df.withColumn("from_utime" , $"from_utime".cast("timestamp") )
Since it's already in ISO date format, no specific conversion is needed. For any other format, you will need to use a combination of unix_timestamp and from_unixtime.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

from_unixtime gave me an awkward value - pyspark

Hi I used from_unixtime to convert this value 1632837232439 and I got 53712-07-21 01:53:59 is this right? I can't make sense of this, I used df = df.select(from_unixtime(df_sixty60['createdOn']).alias("date_key")) Thanks for you help even if you can suggest other ways of representing this. Thanks

Related

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds

spark sql datediff in days

PySpark custom TimestampType column conversion

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

convert date to integer scala spark

Categories

Resources