Convert date to another format Scala Spark - scala

I am reading a CSV that contains two types of date:
dd-MMM-yyyy hh:mm:ss -> 13-Dec-2019 17:10:00
dd/MM/yyyy hh:mm -> 11/02/2020 17:33
I am trying to transform all dates of the first type into the second type but I can't find a good solution. I am trying this:
val pr_date = readeve.withColumn("Date", when(to_date(col("Date"),"dd-MMM-yyyy hh:mm:ss").isNotNull,
to_date(col("Date"),"dd/MM/yyyy hh:mm")))
pr_date.show(25)
And I get the entire Date column as null values:
I am trying with this function:
def to_date_(col: Column,
formats: Seq[String] = Seq("dd-MMM-yyyy hh:mm:ss", "dd/MM/yyyy hh:mm")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
val p2 = readeve.withColumn("Date",to_date_(readeve.col(("Date")))).show(125)
And in the first type of date i get nulls too:
What am I doing wrong? (new with Scala Spark)
Scala version: 2.11.7
Spark version: 2.4.3

Try code below? Note that 17 is HH, not hh. Also try to_timestamp instead of to_date because you want to keep the time.
val pr_date = readeve.withColumn(
"Date",
coalesce(
date_format(to_timestamp(col("Date"),"dd-MMM-yyyy HH:mm:ss"),"dd/MM/yyyy HH:mm"),
date_format(to_timestamp(col("Date"),"dd/MM/yyyy HH:mm"),"dd/MM/yyyy HH:mm")
)
)

Related

How to add extra date column in DataFrame by using Spark?

I have variable, for example:
val loadingDate: = LocalDateTime.of(2020, 1, 2, 0, 0, 0)
I need to add an extra column by using the value of this variable.
When I try to do this:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
DF.withColumn("dttm", expr(s"$loadingDate.format(formatter)}").cast("timestamp"))
I get error like that:
Exception in thread "main" java.lang.reflect.InvocationTargetException
Caused by: org.apache.spark.sql.catalyst.parser.ParseException
mismutched input '00' expecting <EOF>(line 1, pos 11)
==SQL==
2020-01-02 00:00:00
-------------^^^
Can I use variables of type LocalDateTime for adding extra columns in Spark? Or do I have to use other types?
I need to get a date from an external system and use this date in Spark. How can I do this the best way? Which types to use?
You can use your parsed string val dateString = s"$loadingDate.format(formatter)" and convert it into Spark DateType using to_date() function, first of all you have to convert String into literal(or in other words, represent your string as a column), to do so use lit(dateString).
val date: LocalDateTime = LocalDateTime.of(2020, 1, 2, 0,0, 0)
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val formattedDate = date.format(formatter).
val dfWithYourDate = df.withColumn("your_date", to_date(lit(formattedDate), "yyyy-MM-dd HH:mm:ss"))
If you need TimestampType instead of to_date() use function to_timestamp()

convert a string type(MM/dd/YYYY hh:mm:ss AM/PM) to date format in PySpark?

I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))

Convert timestamp column from UTC to EST in spark scala

I have a column in spark dataframe of timestamp type with date format like '2019-06-13T11:39:10.244Z'
My goal is to convert this column into EST time(subtracting 4 hours) keeping the same format.
I tried it using from_utc_timestamp api but it seems it is converting the UTC time to my local timezone (+5:30) and adding it to the timestamp then subtracting 4 hours from it. I tried to use Joda time but for some reason it is adding 33 days to the EST time
innput = 2019-06-13T11:39:10.244Z
using from_utc_timestamp api:
val tDf = df.withColumn("newTimeCol", to_utc_timestamp(col("timeCol"), "America/New_York"))
output = 2019-06-13T13:09:10.244Z+5:30
using Joda time package:
val coder : (String => String) = (arg: String) => {
new DateTime(arg, DateTimeZone.UTC).minusHours(4).toString("yyyy-mm-dd'T'HH:mm:s.SS'Z'")}
val sqlfunc = udf(coder)
val tDf = df.withColumn("newTime", sqlfunc(col("_c20")))
output = 2019-39-13T07:39:10.244Z
desired output = 2019-06-13T07:39:10.244Z
Kindly advise how should I proceed. Thanks in advance
There is a typo in your format string when creating the output.
Your format string should be yyyy-MM-dd'T'HH:mm:s.SS'Z' but it is yyyy-mm-dd'T'HH:mm:s.SS'Z'.
mm is the format char for minutes while MM is the format char for the months. You can check all format chars here.

How to order string of exact format (dd-MM-yyyy HH:mm) using sparkSQL or Dataframe API

I want a dataframe to be reordered in ascending order based on a datetime column which is in the format of "23-07-2018 16:01"
My program sorts to date level but not HH:mm standard.I want output to include HH:mm details as well sorted according to it.
package com.spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{to_date, to_timestamp}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object conversion{
def main(args:Array[String]) = {
val spark = SparkSession.builder().master("local").appName("conversion").enableHiveSupport().getOrCreate()
import spark.implicits._
val sourceDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load("D:\\2018_Sheet1.csv")
val modifiedDF = sourceDF.withColumn("CredetialEndDate",to_date($"CredetialEndDate","dd-MM-yyyy HH:mm"))
//This converts into "dd-MM-yyyy" but "dd-MM-yyyy HH:mm" is expected
//what is the equivalent Dataframe API to convert string to HH:mm ?
modifiedDF.createOrReplaceGlobalTempView("conversion")
val sortedDF = spark.sql("select * from global_temp.conversion order by CredetialEndDate ASC ").show(50)
//dd-MM-YYYY 23-07-2018 16:01
}
}
So my result should have the column in the format "23-07-2018 16:01" instead of just "23-07-2018" and having sorted ascending manner.
The method to_date converts the column into a DateType which has date only, no time. Try to use to_timestamp instead.
Edit: If you want to do the sorting but keep the original string representation you can do something like:
val modifiedDF = sourceDF.withColumn("SortingColumn",to_timestamp($"CredetialEndDate","dd-MM-yyyy HH:mm"))
and then modify the result to:
val sortedDF = spark.sql("select * from global_temp.conversion order by SortingColumnASC ").drop("SortingColumn").show(50)

Scala : how to convert integer to time stamp

I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930
You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))