getting yearmonth date format in sparksql - pyspark

I'm trying to get year month column using this function:
date_format(delivery_date,'mmmmyyyy')
but I'm getting wrong values for the month
ex.
example of the output I want to get:
if I have this date 16-9-2020 I want to get the format as 202009

If you have 16-9-2020 as a string, you can convert it to date and then to your format like this:
spark = SparkSession.builder.getOrCreate()
data = [
{"delivery_date": "16-9-2020"},
]
df = spark.createDataFrame(data)
df = df.withColumn(
"result", F.date_format(F.to_date("delivery_date", "dd-M-yyyy"), "yyyyMM")
)
Result:
+-------------+------+
|delivery_date|result|
+-------------+------+
| 16-9-2020|202009|
+-------------+------+

Related

Is there a date format in Spark only for the year or month?

For a dataframe df with a column date_string - which represents a string like "20220331" - the following works perfectly:
df = df.withColumn("date",to_date(col("date_string"),"yyyymmdd"))
For "20220331" the column "date" of type date - just as required - now looks like this: 2022-03-31
I now want two columns "year" and "month" of type date. For "20220331" the column year should be 2022 and the column month should be 2022-03. The following does not work:
df = df.withColumn("year",to_date(col("date_string"),"yyyy")
.withColumn("month",to_date(col("date_string"),"yyyymm")))
Is it even possible in Spark to have something in the form of yyyy and yyyy-mm in the date type?
You can use date_format:
scala> Seq(1).toDF("seq").select(
| date_format(current_timestamp(),"yyyyMM")
| ).show
+----------------------------------------+
|date_format(current_timestamp(), yyyyMM)|
+----------------------------------------+
| 202203|
+----------------------------------------+
Alternatively, if your date is stored as a string, you could just substring the values out.

convert a string type(MM/dd/YYYY hh:mm:ss AM/PM) to date format in PySpark?

I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))

Convert date to another format Scala Spark

I am reading a CSV that contains two types of date:
dd-MMM-yyyy hh:mm:ss -> 13-Dec-2019 17:10:00
dd/MM/yyyy hh:mm -> 11/02/2020 17:33
I am trying to transform all dates of the first type into the second type but I can't find a good solution. I am trying this:
val pr_date = readeve.withColumn("Date", when(to_date(col("Date"),"dd-MMM-yyyy hh:mm:ss").isNotNull,
to_date(col("Date"),"dd/MM/yyyy hh:mm")))
pr_date.show(25)
And I get the entire Date column as null values:
I am trying with this function:
def to_date_(col: Column,
formats: Seq[String] = Seq("dd-MMM-yyyy hh:mm:ss", "dd/MM/yyyy hh:mm")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
val p2 = readeve.withColumn("Date",to_date_(readeve.col(("Date")))).show(125)
And in the first type of date i get nulls too:
What am I doing wrong? (new with Scala Spark)
Scala version: 2.11.7
Spark version: 2.4.3
Try code below? Note that 17 is HH, not hh. Also try to_timestamp instead of to_date because you want to keep the time.
val pr_date = readeve.withColumn(
"Date",
coalesce(
date_format(to_timestamp(col("Date"),"dd-MMM-yyyy HH:mm:ss"),"dd/MM/yyyy HH:mm"),
date_format(to_timestamp(col("Date"),"dd/MM/yyyy HH:mm"),"dd/MM/yyyy HH:mm")
)
)

Scala : how to convert integer to time stamp

I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930
You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))

Timestamp changes format when writing to csv file spark

I am trying to save a dataframe to a csv file, that contains a timestamp.
The problem that this column changes of format one written in the csv file. Here is the code I used:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\datasetTest.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long") * 300).cast("timestamp")).drop("time")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ -> "mean")).toMap).sort("new_time") //agg(avg(all columns..)
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
when display via df.show it shoes the correct format
But in the csv file it shoes this format:
Use option to format timestamp into desired one which you need:
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv")
or
finalresult.coalesce(1).write.format("csv").option("delimiter", "\t").option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("escape", "\\").save("C:/mydata.csv")
Here is the code snippet that worked for me to modify the CSV output format for timestamps.
I needed a 'T' character in there, and no seconds or microseconds. The timestampFormat option did work for this.
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm")
Such as 2017-02-20T06:53
If you substitute a space for 'T' then you get this:
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd HH:mm")
Such as 2017-02-20 06:53