Spark fails to convert String to TIMESTAMP - scala

I have a hive table that contains a String column: this is an example:
| DT |
| 2019-05-07 00:03:53.837000000 |
when I try to import the table inside a Spark-Scala DF transforming the String to a timestamp I only have null values:
val df = spark.sql(s"""select to_timestamp(dt_maj, 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
| DT |
| null |
val df = spark.sql(s"""select dt from ${use_database}.pz_send_demande_diffusion""").show()
gives a good result (column with the String values). So Spark is importing te column normally.
I also tried:
val df = spark.sql(s"""select to_timestamp('2005-05-04 11:12:54.297', 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
And it worked! It returns a TIMESTAMPs column.
What is the problem ?

Trim your extra 0s. Then,
df.withColumn("new", to_timestamp($"date".substr(lit(1),length($"date") - 6), "yyyy-MM-dd HH:mm:ss.SSS")).show(false)
the result is:
|date |new |
|2019-05-07 00:03:53.837000000|2019-05-07 00:03:53|
The schema:
|-- date: string (nullable = true)
|-- new: timestamp (nullable = true)

I think you should use following format yyyy-MM-dd HH:mm:ss.SSSSSSSSS for this type of data 2019-05-07 00:03:53.837000000


Convert yyyyMM to end of month date using PySpark

I have a column in a dataframe in Pyspark with date in integer format e.g 202203 (yyyyMM format). I want to convert that to end of month date as 2022-03-31. How do I achieve this?
First cast column to String, then use to_date to get the date and then last_day.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [{"x": 202203}]
df = spark.createDataFrame(data=data)
df = df.withColumn("date", F.last_day(F.to_date(F.col("x").cast("string"), "yyyyMM")))
| x| date|
|-- x: long (nullable = true)
|-- date: date (nullable = true)

Pyspark date_trunc without modifying actual value

Consider the below dataframe
I have to convert it to
After using the below code
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
My output
2022-02-21 11:23:00
By desired output is
Is there anyway I can keep the data same and just update/truncate the seconds??
you simply have a format issue. the output that you see is the string representation of a timestamp. check your output formats :
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
|time |time_updated |
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)

i am having error with the code in pyspark related with df.withcolumn

Can I use the following code:
df.withColumn("id", df["id"].cast("integer")).na.drop(subset=["id"])
If id is not a valid integer, it will be NULL and dropped in the subsequent step.
Without changing the type
df ="sample.txt")
valid = df.where(df["id"].cast("integer").isNotNull())
invalid = df.where(df["id"].cast("integer").isNull())
Here my df.printschema prints
|-- value: string (nullable = true)
| id| name | date |Yes/No|
| 01|abcdefghijklkm |010V2201| 9Ye|
| ab| abcdefghijklmm|010V2201| 9Ye|
this is a sample output
Expected result
row with integer column to be removed with null or invalid values, can i use df.withcolumn into it ? if i can then how ?

Spark Scala -difference between current date and max(day)

Need to calculate the difference between two dates. The question is
Currentdate - max(day_id)
"Currentdate" is of simple date format - yyyyMMdd
"day_id" is of string format and its value is yyyy-mm-dd.
I have a dataframe which converted the date(string format) to date format (yyyy-mm-dd)
df1 = to_date(from_unixtime(unix_timestamp(day_id, 'yyyy-MM-dd')))
Normally, for finding the max(day_id), I would do
def daySince (columnName: String): Column = {
I cannot figure out how to do the difference between
Currentdate - max(day_id)
Given input dataframe with schema as
|id |day_id |
|-- id: string (nullable = true)
|-- day_id: string (nullable = true)
you can use current_date() and datediff() inbuilt functions to meet your requirement as
import org.apache.spark.sql.functions._
df.withColumn("diff", datediff(current_date(), to_date(col("day_id"), "yyyy-MM-dd")))
which should give you
|id |day_id |diff|
|id1|2017-11-21|167 |
|id1|2018-01-21|106 |
|id2|2017-12-21|137 |

Converting pattern of date in spark dataframe

I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
| name| startDate| endDate|price|
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 =$"endDate".cast(DateType).alias("endDate"))
| endDate|
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem
You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")"endDate"), "MM/dd/yyyy")).show
Output :
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
Use pyspark.sql.functions.date_format(date, format):
val df2 ="endDate", "MM/dd/yyyy").alias("endDate"))
Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.