I am trying to convert a string column that has no date just mm/yyyy to a date column with format mm/1/yyyy.
The current code I have is
from pyspark.sql.functions import *
monthdf.select(col("PP Month"),to_date(col("PP Month"),"m/yyyy").alias("date")).show()
but this returns null. Example of how the string column is formatted is 6/2021.
try this:
import pyspark.sql.functions as f
df = (
df.withColumn('date', f.date_format(f.to_date(f.concat(f.lit('1/'), f.col('PP Month')), 'd/M/yyyy'), 'MM/d/yyyy'))
)
Related
In a spark dataframe, I will like to convert date column, "Date" which is in string format (eg. 20220124) to 2022-01-24 and then to date format using python.
df_new= df.withColumn('Date',to_date(df.Date, 'yyyy-MM-dd'))
You can do it with to_date function which gets the input col and format of your date.
from pyspark.sql import functions as F
df.withColumn('date', F.to_date('date', 'yyyyMMdd'))
I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.
I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))
I want a dataframe to be reordered in ascending order based on a datetime column which is in the format of "23-07-2018 16:01"
My program sorts to date level but not HH:mm standard.I want output to include HH:mm details as well sorted according to it.
package com.spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{to_date, to_timestamp}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object conversion{
def main(args:Array[String]) = {
val spark = SparkSession.builder().master("local").appName("conversion").enableHiveSupport().getOrCreate()
import spark.implicits._
val sourceDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load("D:\\2018_Sheet1.csv")
val modifiedDF = sourceDF.withColumn("CredetialEndDate",to_date($"CredetialEndDate","dd-MM-yyyy HH:mm"))
//This converts into "dd-MM-yyyy" but "dd-MM-yyyy HH:mm" is expected
//what is the equivalent Dataframe API to convert string to HH:mm ?
modifiedDF.createOrReplaceGlobalTempView("conversion")
val sortedDF = spark.sql("select * from global_temp.conversion order by CredetialEndDate ASC ").show(50)
//dd-MM-YYYY 23-07-2018 16:01
}
}
So my result should have the column in the format "23-07-2018 16:01" instead of just "23-07-2018" and having sorted ascending manner.
The method to_date converts the column into a DateType which has date only, no time. Try to use to_timestamp instead.
Edit: If you want to do the sorting but keep the original string representation you can do something like:
val modifiedDF = sourceDF.withColumn("SortingColumn",to_timestamp($"CredetialEndDate","dd-MM-yyyy HH:mm"))
and then modify the result to:
val sortedDF = spark.sql("select * from global_temp.conversion order by SortingColumnASC ").drop("SortingColumn").show(50)
How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))