quick question that I did not find on google.
What is the best way to create a Spark Dataframe with time stamps
lets say I have startpoint endpoint and a 15 min interval. What would be the best way to resolve this on spark ?
try SparkSQL inside udf like this DF.withColumn("regTime",current_timestamp) ,it's can replace a column ,
you also can convert a String to date or timestamp use DF.withColumn("regTime", to_date(spark_temp_user_day("dateTime")))
regTime is a column of DF
Related
In DataBricks notebook using pyspark I need to create/add a new timestamp column based on an existing date column while adding hours to it based on an existing hours-bin integer column - this is to support the creation of an event-driven time-series feature set, which requires in this case that the timestamp be limited to date and hour (no minutes, seconds, etc...). I have tried using string-based expr(), date_add(), various formatted-string and cast() combinations but I get a maddening slew of errors related to column access, parsing issues and the like. What is the simplest way to accomplish this?
In my opinion, unix_timestamp is the simplest method:
dfResult = dfSource.withColumn("yourNewTimestampColName",
(unix_timestamp(col("yourExistingDateCol")) +
(col("yourExistingHoursCol")*3600)).cast("timestamp"))
Where yourNewTimestampColName represents the name of the timestamp column that you want to add, yourExistingDateCol represents a date column that must be present with this name within the dfSource dataframe and yourExistingHoursCol represents an integer-based hour column that must also be present with this name within the dfSource dataframe.
The unix_timestamp() method adds to the date in seconds, so to add hours multiply yourExistingHoursCol by 3,600 seconds, to add minutes multiply by 60, to add days multiply 3,600*24, etc....
Executing display(dfResult) should show structure/content of the dfSource dataframe with a new column named yourNewTimestampColName containing the date/hour combination requested.
timestampbefore timestamp after #pissall's code I have a Timestamp column with 0.5Hz frequency, that results in millions of rows. I am willing to reduce this data size by having a timestamp in an hourly manner. i.e 24 observations for a particular day.
I already reduced the data size by filtering the data by year, month and day. but as it is still very big i want to reduce it now to hourly basis.
I am working on Databricks and using PySpark for the same.
i used following command to reduce my data size from years to a Day.
df = df.filter(df.Timestamp.between('2019-09-03 00:00:00','2019-09-04 00:00:00'))
I would appreciate your help.
Thanks
Java.util.gregori...
You can replace the minutes and seconds part of your datetime using a UDF. Might not be the best solution, but here you go:
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())
df = df.withColumn("Timestamp", date_replace_udf(F.col("Timestamp")))
Another reference: How to truncate the time on a DateTime object in Python?
I have passed a string (datestr) to a function (that do ETL on a dataframe in spark using scala API) however at some point I need to filter the dataframe by a certain date
something like :
df.filter(col("dt_adpublished_simple") === date_add(datestr, -8))
where datestr is the parameter that I passed to the function.
Unfortunately, the function date_add requires a column type as a first param.
Can anyone help me with how to convert the param into a column or a similar solution that will solve the issue?
You probably only need to use lit to create a String Column from your input String. And then, use to_date to create a Date Column from the previous one.
df.filter(col("dt_adpublished_simple") === date_add(to_date(lit(datestr), format), -8))
I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?
Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour
I am processing the data in Spark shell and have a dataframe with a date column. The format of the column is like "2017-05-01 00:00:00.0", but I want to change all the values to "2017-05-01" without the "00:00:00.0".
Thanks!
Just use String.split():
"2017-05-01 00:00:00.0".split(" ")(0)