Spark Scala Convert Int Column into Datetime - scala

I have Datetime stored in the following format - YYYYMMDDHHMMSS. (Data Type -Long Int)
Sample Data -
This Temp View - ingestionView comes from a DataFrame.
Now I want to introduce a new column newingestiontime in the dataframe which is of the format YYYY-MM-DD HH:MM:SS.
One of the ways I have tried this is, but it didnt work either -
val res = ingestiondatetimeDf.select(col("ingestiontime"), unix_timestamp(col("newingestiontime"), "yyyyMMddHHmmss").cast(TimestampType).as("timestamp"))
Output -
Please help me here , and If there is a better way to establish this, I will be delighted to learn new thing.
Thanks in advance.

Use from_unixtime & unix_timestamp.
Check below code.
scala> df
.withColumn(
"newingestiontime",
from_unixtime(
unix_timestamp($"ingestiontime".cast("string"),
"yyyyMMddHHmmss")
)
)
.show(false)
+--------------+-------------------+
|ingestiontime |newingestiontime |
+--------------+-------------------+
|20200501230000|2020-05-01 23:00:00|
+--------------+-------------------+

Related

convert a string type(MM/dd/YYYY hh:mm:ss AM/PM) to date format in PySpark?

I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))

Scala date format

I have a data_date that gives a format of yyyymmdd:
beginDate = Some(LocalDate.of(startYearMonthDay(0), startYearMonthDay(1),
startYearMonthDay(2)))
var Date = beginDate.get
.......
val data_date = Date.toString().replace("-", "")
This will give me a result of '20180202'
however, I need the result to be 201802 (yyyymm) for my usecase. I don't want to change the value of beginDate, I just want to change the data_date value to fit my usecase, how do I do that? is there a split function I can use?
Thanks!
It's not clear from the code snippet that you're using Spark, but the tags imply that, so I'll give an answer using Spark built-in functions. Suppose your DataFrame is called df with date column my_date_column. Then, you can simply use date_format
scala> import org.apache.spark.sql.functions.date_format
import org.apache.spark.sql.functions.date_format
scala> df.withColumn("my_new_date_column", date_format($"my_date_column", "YYYYMM")).
| select($"my_new_date_column").limit(1).show
// for example:
+------------------+
|my_new_date_column|
+------------------+
| 201808|
+------------------+
The way to to it with DateTimeFormatter.
val formatter = DateTimeFormatter.ofPattern("YMM")
val data_date = Date.format(foramatter)
I recommend you to read through DateTimeFormatter docs, so you can format date the way you want.
You can do this by only taking the first 6 characters of the resulting string.
i.e.
val s = "20180202"
s.substring(0, 6) // returns "201802"

Convert epochmilli to DDMMYYYY - Spark Scala

I have a dataframe with one of the column containing timestamps respresented in epochmilli (column type is long) and I need to convert them to a column with DDMMYY using withColumn
Something like:
1528102439 ---> 040618
How do I achieve this?
val df_DateConverted = df.withColumn("Date",from_unixtime(df.col("timestamp").divide(1000),"ddMMyy"))

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.

Return null from dayofyear function - Spark SQL

I'm new to Databricks & Spark/Scala.
I'm currently working on a machine learning to do sales forecasting.
I used the function dayofyear to create features.
The only problem is that returns me null value.
I tried with this csv because i was using an another one and i thought this could come from this.
But apparently, i was wrong.
I read the docs about this function but the description is really short.
I tried with dayofmonth or weekofyear, same result.
Can you explain me how I can fix this ? What am I doing wrong ?
val path = "dbfs:/databricks-datasets/asa/planes/plane-data.csv"
val df = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
display(df)
import org.apache.spark.sql.functions._
val df2 = df.withColumn("dateofyear", dayofyear(df("issue_date")))
display(df2)
Here's the result : Result
You can cast the issue_date to timestamp before using dayofyear function as
data.withColumn("issue_date", unix_timestamp($"issue_date", "MM/dd/yyyy").cast(TimestampType))
.withColumn("dayofyear", dayofyear($"issue_date"))
Hope this helps!