How to convert string to time datatype in pyspark or scala? - scala

Please note that I am not asking for unix_timestamp or timestamp or datetime data type I am asking for time data type, is it possible in pyspark or scala?
Lets get in details,
I have a dataframe like this with column Time string type
+--------+
| Time|
+--------+
|10:41:35|
|12:41:35|
|01:41:35|
|13:00:35|
+--------+
I want to convert it in time data type because in my SQL database this column is time data type, so I am trying to insert my data with spark connector applying Bulk Copy
So for bulk copy my both data-frame and DB table schema must be same, that's why I need to convert my Timecolumn into time data type.
Appreciate Any suggestion or help. Thanks in advance.

The following was run in the PySpark shell, the datetime module does allow time format
>>> t = datetime.datetime.strptime('10:41:35', '%H:%M:%S').time()
>>> type(t)
<class 'datetime.time'>
When the above function is to be applied on the dataframe using the map, it fails as the PySpark doesn't have a datatype time and it's unable to infer it.
>>> df2.select("val11").rdd.map(lambda x: datetime.datetime.strptime(str(x[0]), '%H:%M:%S').time()).toDF()
TypeError: Can not infer schema for type: <class 'datetime.time'>
The pyspark.sql.types module for now only supports the below datatypes
NullType
StringType
BinaryType
BooleanType
DateType
TimestampType
DecimalType
DoubleType
FloatType
ByteType
IntegerType
LongType
ShortType
ArrayType
MapType
StructField
StructType

Try it
df.withColumn('time', F.from_unixtime(F.unix_timestamp(F.col('time'), 'HH:mm:ss'), 'HH:mm:ss'))

Related

How to format datatype to TimestampType in spark DataFrame- Scala

I'm trying to cast the column type to Timestamptype for which the value is in the format "11/14/2022 4:48:24 PM". However when I display the results I see the values as null.
Here is the sample code that I'm using to cast the timestamp field.
val messages = df.withColumn("Offset", $"Offset".cast(LongType))
.withColumn("Time(readable)", $"EnqueuedTimeUtc".cast(TimestampType))
.withColumn("Body", $"Body".cast(StringType))
.select("Offset", "Time(readable)", "Body")
display(messages)
4
Is there any other way I can try to avoid the null values?
Instead of casting to TimestampType, you can use to_timestamp function and provide the time format explicitly, like so:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
val time_df = Seq((62536, "11/14/2022 4:48:24 PM"), (62537, "12/14/2022 4:48:24 PM")).toDF("Offset", "Time")
val messages = time_df
.withColumn("Offset", $"Offset".cast(LongType))
.withColumn("Time(readable)", to_timestamp($"Time", "MM/dd/yyyy h:mm:ss a"))
.select("Offset", "Time(readable)")
messages.show(false)
+------+-------------------+
|Offset|Time(readable) |
+------+-------------------+
|62536 |2022-11-14 16:48:24|
|62537 |2022-12-14 16:48:24|
+------+-------------------+
messages: org.apache.spark.sql.DataFrame = [Offset: bigint, Time(readable): timestamp]
One thing to remember, is that you will have to set one Spark configuration, to allow for legacy time parser policy:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

Spark Scala Dataframe: How to handle money datatype in PostgreSQL table?

I need save a dataframe into PostgreSQL table, which has some fields with Money Datatype.
I tried to cast the data to DoubleType before storing, which does not seem to be working. The reported error is as following.
column "cost" is of type money but expression is of type double precision
Which datatype should I cast to store into PostgreSQL table with money datatype? Thanks!
// v 1
val df1 = sourceDF.withColumn("cost", 'cost.cast("decimal(25,10)"))
// v 2
import org.apache.spark.sql.types.DecimalType
val df2 = sourceDF.withColumn("cost", 'cost.cast(new DecimalType(25, 10)))

Filter Scala dataframe by column of arrays

My scala dataframe has a column that has the datatype array(element: String). I want to display those rows of the dataframe that has the word "hello" in that column.
I have this:
display(df.filter($"my_column".contains("hello")))
I get an error because of data mismatch. It says that argument 1 requires string type, however, 'my:column' is of array<string> type.
You can use array_contains function
import org.apache.spark.sql.functions._
df.filter(array_contains(df.col("my_column"), "hello")).show

how to do BETWEEN condition in spark 1.6 version

I have tried between condition in spark 1.6 version but got an error as
between is not member of string.
df.filter($"date".between("2015-07-05", "2015-09-02"))
Either your df("date") is a string type or your column date is not inferred as column.
I have replicated your code and it does work on a column sent as which is java.sql.Timestamp:
val test= bigDF.filter($"sent_at".between("2015-07-05", "2015-09-02"))
ensure that your col date is a valid df column, try df("date") or col("date") and that it is saved as data time type ex:
case class Schema(uuid: String, sent_at: java.sql.Timestamp)
val df1 = df.as[Schema]
do df.printSchema() to verify the type of date column

Print out types of data frame columns in Spark

I tried using VectorAssembler on my Spark Data Frame and it complained that it didn't support the StringType type. My Data Frame has 2126 columns.
What's the programmatic way to print out all the column types?
df.printSchema() will print you the dataframe schema in an easy to follow formatting
Try:
>>> for name, dtype in df.dtypes:
... print(name, dtype)
or
>>> df.schema