How can I convert a IntegerType and LongType to date (yy-mm-dd)?
I m using spark/scala
here the line:
StructField("date_test", IntegerType,true), //Num YYMMDD10
to convert I used:
df.withColumn("custDOB",to_date(concat(col("custDOB"))))
Result: null
Related
In a spark dataframe, I will like to convert date column, "Date" which is in string format (eg. 20220124) to 2022-01-24 and then to date format using python.
df_new= df.withColumn('Date',to_date(df.Date, 'yyyy-MM-dd'))
You can do it with to_date function which gets the input col and format of your date.
from pyspark.sql import functions as F
df.withColumn('date', F.to_date('date', 'yyyyMMdd'))
Very Simple question - Need to convert timestamp column in spark dataframe to java.time.Instant format
Here you can convert to java.time.instant:
val time1 = spark
.sql("...")
.as[java.sql.Timestamp]
.first()
.toInstant
I am reading a CSV that contains two types of date:
dd-MMM-yyyy hh:mm:ss -> 13-Dec-2019 17:10:00
dd/MM/yyyy hh:mm -> 11/02/2020 17:33
I am trying to transform all dates of the first type into the second type but I can't find a good solution. I am trying this:
val pr_date = readeve.withColumn("Date", when(to_date(col("Date"),"dd-MMM-yyyy hh:mm:ss").isNotNull,
to_date(col("Date"),"dd/MM/yyyy hh:mm")))
pr_date.show(25)
And I get the entire Date column as null values:
I am trying with this function:
def to_date_(col: Column,
formats: Seq[String] = Seq("dd-MMM-yyyy hh:mm:ss", "dd/MM/yyyy hh:mm")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
val p2 = readeve.withColumn("Date",to_date_(readeve.col(("Date")))).show(125)
And in the first type of date i get nulls too:
What am I doing wrong? (new with Scala Spark)
Scala version: 2.11.7
Spark version: 2.4.3
Try code below? Note that 17 is HH, not hh. Also try to_timestamp instead of to_date because you want to keep the time.
val pr_date = readeve.withColumn(
"Date",
coalesce(
date_format(to_timestamp(col("Date"),"dd-MMM-yyyy HH:mm:ss"),"dd/MM/yyyy HH:mm"),
date_format(to_timestamp(col("Date"),"dd/MM/yyyy HH:mm"),"dd/MM/yyyy HH:mm")
)
)
Please note that I am not asking for unix_timestamp or timestamp or datetime data type I am asking for time data type, is it possible in pyspark or scala?
Lets get in details,
I have a dataframe like this with column Time string type
+--------+
| Time|
+--------+
|10:41:35|
|12:41:35|
|01:41:35|
|13:00:35|
+--------+
I want to convert it in time data type because in my SQL database this column is time data type, so I am trying to insert my data with spark connector applying Bulk Copy
So for bulk copy my both data-frame and DB table schema must be same, that's why I need to convert my Timecolumn into time data type.
Appreciate Any suggestion or help. Thanks in advance.
The following was run in the PySpark shell, the datetime module does allow time format
>>> t = datetime.datetime.strptime('10:41:35', '%H:%M:%S').time()
>>> type(t)
<class 'datetime.time'>
When the above function is to be applied on the dataframe using the map, it fails as the PySpark doesn't have a datatype time and it's unable to infer it.
>>> df2.select("val11").rdd.map(lambda x: datetime.datetime.strptime(str(x[0]), '%H:%M:%S').time()).toDF()
TypeError: Can not infer schema for type: <class 'datetime.time'>
The pyspark.sql.types module for now only supports the below datatypes
NullType
StringType
BinaryType
BooleanType
DateType
TimestampType
DecimalType
DoubleType
FloatType
ByteType
IntegerType
LongType
ShortType
ArrayType
MapType
StructField
StructType
Try it
df.withColumn('time', F.from_unixtime(F.unix_timestamp(F.col('time'), 'HH:mm:ss'), 'HH:mm:ss'))
I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930
You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))