I have two timestamp columns in a dataframe that I'd like to get the minute difference of, or alternatively, the hour difference of. Currently I'm able to get the day difference, with rounding, by doing
val df2 = df1.withColumn("time", datediff(df1("ts1"), df1("ts2")))
However, when i looked at the doc page
https://issues.apache.org/jira/browse/SPARK-8185
I didn't see any extra parameters to change the unit. Is their a different function I should be using for this?
You can get the difference in seconds by
import org.apache.spark.sql.functions._
val diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
Then you can do some math to get the unit you want. For example:
val df2 = df1
.withColumn( "diff_secs", diff_secs_col )
.withColumn( "diff_mins", diff_secs_col / 60D )
.withColumn( "diff_hrs", diff_secs_col / 3600D )
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
Or, in pyspark:
from pyspark.sql.functions import *
diff_secs_col = col("ts1").cast("long") - col("ts2").cast("long")
df2 = df1 \
.withColumn( "diff_secs", diff_secs_col ) \
.withColumn( "diff_mins", diff_secs_col / 60D ) \
.withColumn( "diff_hrs", diff_secs_col / 3600D ) \
.withColumn( "diff_days", diff_secs_col / (24D * 3600D) )
The answer given by Daniel de Paula works, but that solution does not work in the case where the difference is needed for every row in your table. Here is a solution that will do that for each row:
import org.apache.spark.sql.functions
val df2 = df1.selectExpr("(unix_timestamp(ts1) - unix_timestamp(ts2))/3600")
This first converts the data in the columns to a unix timestamp in seconds, subtracts them and then converts the difference to hours.
A useful list of functions can be found at:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$
Related
In my PySpark dataframe, I have a column 'TimeStamp' which is in DateTime format. I want to covert that to 'Date' format and then use that in the 'GroupBy'.
df = spark.sql("SELECT * FROM `myTable`")
df.filter((df.somthing!="thing"))
df.withColumn('MyDate', col('Timestamp').cast('date')
df.groupBy('MyDate').count().show()
But I get this error:
cannot resolve 'MyDate' given input columns:
Can you please help me with this ?
each time you do df. you are creating a new dataframe.
df was only initialized in your first line of code, so that dataframe object does not have the new column MyDate.
you can look at the id() of each object to see
df = spark.sql("SELECT * FROM `myTable`")
print(id(df))
print(id(df.filter(df.somthing!="thing")))
this is correct syntax to chain operations
df = spark.sql("SELECT * FROM myTable")
df = (df
.filter(df.somthing != "thing")
.withColumn('MyDate', col('Timestamp').cast('date'))
.groupBy('MyDate').count()
)
df.show(truncate=False)
UPDATE: this is a better way to write it
df = (
spark.sql(
"""
SELECT *
FROM myTable
""")
.filter(col("something") != "thing")
.withColumn("MyDate", col("Timestamp").cast("date"))
.groupBy("MyDate").count()
)
I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy HH:MM:SS"))
df.show()
I am getting below output and it's not coming properly
Can anyone advise how to do this properly as I am new to pyspark?
Seems that your data is in Microseconds (1/1,000,000 second) so you would have to divide by 1,000,000
df = spark.createDataFrame(
[
('1645904274665267',),
('1645973845823770',),
('1644134156697560',),
('1644722868485010',),
('1644805678702121',),
('1645071502180365',),
('1644220446396240',),
('1645736052650785',),
('1646006645296010',),
('1644544811297016',),
('1644614023559317',),
('1644291365608571',),
('1645643575551339',)
], ['Event_Time']
)
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime(f.col("Event_Time")/1000000))
df.show(truncate = False)
output
+----------------+-------------------+
|Event_Time |date |
+----------------+-------------------+
|1645904274665267|2022-02-26 20:37:54|
|1645973845823770|2022-02-27 15:57:25|
|1644134156697560|2022-02-06 08:55:56|
|1644722868485010|2022-02-13 04:27:48|
|1644805678702121|2022-02-14 03:27:58|
|1645071502180365|2022-02-17 05:18:22|
|1644220446396240|2022-02-07 08:54:06|
|1645736052650785|2022-02-24 21:54:12|
|1646006645296010|2022-02-28 01:04:05|
|1644544811297016|2022-02-11 03:00:11|
|1644614023559317|2022-02-11 22:13:43|
|1644291365608571|2022-02-08 04:36:05|
|1645643575551339|2022-02-23 20:12:55|
+----------------+-------------------+
I am trying to add a float column to a TimestampType column in pyspark, but there does not seem to be a way to do this while maintaining the fractional seconds. example of float_seconds is 19.702300786972046, example of timestamp is 2021-06-17 04:31:32.48761
what I want:
calculated_df = beginning_df.withColumn("calculated_column", float_seconds_col + TimestampType_col)
I have tried the following methods, but neither completely solves the problem:
#method 1 adds a single time, but cannot be used to add an entire column to the timestamp column.
calculated_df = beginning_df.withColumn("calculated_column",col("TimestampType_col") + F.expr('INTERVAL 19.702300786 seconds'))
#method 2 converts the float column to unixtime, but cuts off the decimals (which are important)
timestamp_seconds = beginning_df.select(from_unixtime("float_seconds"))
Image of the two columns in question
You could achieve it using a UDF as follows:
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StructType, StructField, FloatType, TimestampType
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
schema = (StructType([
StructField('dt', TimestampType(), nullable=True),
StructField('sec', FloatType(), nullable=True),
]))
item1 = {
"dt": datetime.fromtimestamp(1611859271.516),
"sec": 19.702300786,
}
item2 = {
"dt": datetime.fromtimestamp(1611859271.517),
"sec": 19.702300787,
}
item3 = {
"dt": datetime.fromtimestamp(1611859271.518),
"sec": 19.702300788,
}
df = spark.createDataFrame([item1, item2, item3], schema=schema)
df.printSchema()
#udf(returnType=TimestampType())
def add_time(dt, sec):
return dt + timedelta(seconds=sec)
df = df.withColumn("new_dt", add_time(col("dt"), col("sec")))
df.printSchema()
df.show(truncate=False)
Timestamp data type supports nanoseconds (max 9 digits precision). Your float_seconds_col has precision > 9 digits (15 in your example, it is femto-seconds), it will be lost if converted to Timestamp anyway.
Plain vanilla Hive:
select
timestamp(cast(concat(cast(unix_timestamp(TimestampType_col) as string), --seconds
'.',
regexp_extract(TimestampType_col,'\\.(\\d+)$')) --fractional
as decimal (30, 15)
) + float_seconds_col --round this value to nanos to get better timestamp conversion (round(float_seconds_col,9))
) as result --max precision is 9 (nanoseconds)
from
(
select 19.702300786972046 float_seconds_col,
timestamp('2021-06-17 04:31:32.48761') TimestampType_col
) s
Result:
2021-06-17 04:31:52.189910786
//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.
I have two columns A(year1) and B(year2) in spark. I need to create a column C which has to contain an array of years between year 1 and year 2 .
suppose A - 1990 & B - 1993
o/p C - should be [1990,1990,1991,1991,1992,1992,1993,1993]
could anyone come up with a solution (spark) with out using udf
You could try, assume df contains year1 and year2.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
years = spark.range(2020).withColumnRenamed('id', 'year')
df = (
df
.withColumn(
‘id’,
F. monotonically_increasing_id()
) # EDIT: There was a missing bracket here
.join(
years,
F.col(‘year’).between(‘year1’, ‘year2’),
)
.groupBy(
‘id’
)
.agg(
F.collect_list(‘year’).alias(‘years’)
)
)
Let me know it this doesn't work.