Convert String to date SparkSQL - scala

I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below
1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56)
2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11)
First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.
df.printSchema
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: long (nullable = true)
|-- IN_DATE: string (nullable = true)
df.show
Input
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+-----------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610120256 |
| F | 000544| 2017002| LD | 95032015062763261| 20150611 |
| F | 000544| 2017002| AK | 95037854336743246| 20150611012356 |
+-----+-------+---------+--+------+-------------------+-----------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| null |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 00:00:00 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+

I'd
Choose more precise data type - here TimestampType.
coalesce with different formats.
import org.apache.spark.sql.functions._
val df = Seq("20150610120256", "20150611").toDF("IN_DATE")
df.withColumn("IN_DATE", coalesce(
to_timestamp($"IN_DATE", "yyyyMMddHHmmss"),
to_timestamp($"IN_DATE", "yyyyMMdd"))).show
+-------------------+
| IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+

There are several options to achieve a date parser.
Use the built in spark sql function TODATE(). Here's an example of that implementation.
Create a user defined function where you can do different date parsing based on the input format you like, and return the string. Read more about UDF's here.

2015-06-11 format is spark.sql.types.DateType and 2015-06-10 12:02:56 is spark.sql.types.TimestampType
You can't have two dataType on the same column. A schema should have only one dataType for each columns.
I would suggest you to create two new columns and have the format you desire in them as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
.withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType))
this will give you dataframe as
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE |SQ_CODE|RE_TYPE|VERY_ID |IN_DATE |IN_DATE_DateOnly|IN_DATE_DateAndTime |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F |000544|2017002|OP |95032015062763298|20150610120256|null |2015-06-10 12:02:00.0|
|F |000544|2017002|LD |95032015062763261|20150611 |2015-06-11 |null |
|F |000544|2017002|AK |95037854336743246|20150611012356|null |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
You can see that the dataType is different
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: string (nullable = true)
|-- IN_DATE: string (nullable = true)
|-- IN_DATE_DateOnly: date (nullable = true)
|-- IN_DATE_DateAndTime: timestamp (nullable = true)
I hope the answer is helpful

Try This query
df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))

Related

Spark Scala - How to explode a column into multiple rows in spark scala

I have a dataframe like this
+----------+-------------+
|A |Devices |
+----------+-------+------
|house1 |[100,101,102]|
|house1 |[103,104] |
+----------+-------------+
And I want to explode the column 'Devices' into multiple rows. My final dataframe should look like this
+----------+--------+
|A |Devices |
+----------+--------+
|house1 |100 |
|house1 |101 |
|house1 |102 |
|house1 |103 |
|house1 |104 |
+----------+--------+
The schema of the table is
root
|-- A: String (nullable = true)
|-- Devices: array (nullable = true)
| |-- element: String (containsNull = true)
I tried doing this but it is showing error (UnresolvedAttribute in $"Devices")
Df.withColumn("c", explode(split($"Devices","\\,")))
Df.select(col("A"),explode(col("devices"))
Using this I am able to find the required answer

PySpark: Perform logical AND on timestamp

I have a table consisting of main ID, subID and two timestamps (start-end).
+-------+---------------------+---------------------+---------------------+
|main_id|sub_id |start_timestamp |end_timestamp |
+-------+---------------------+---------------------+---------------------+
| 1| 1 | 2021/06/01 19:00 | 2021/06/01 19:10 |
| 1| 2 | 2021/06/01 19:01 | 2021/06/01 19:10 |
| 1| 3 | 2021/06/01 19:01 | 2021/06/01 19:05 |
| 1| 3 | 2021/06/01 19:07 | 2021/06/01 19:09 |
My goal is to pick all the records with the same mainID (different subIDs) and perform a logical AND on the timestamp column (the goal is to find periods, where all the subIDs were active).
+-------+---------------------------+---------------------------+
|main_id| global_start | global_stop |
+-------+---------------------------+---------------------------+
| 1| 2021/06/01 19:01 | 2021/06/01 19:05 |
| 1| 2021/06/01 19:07 | 2021/06/01 19:09 |
Thanks!
Partial solution
This kind of logic is probably really difficult to implement in pure Spark. Built-in functions are not enough for that.
Also, the expected output is 2 lines, but a simple group by main_id should output only one line. Therefore, the logic behind is not trivial.
I would advice you to group your data by main_id and process them with python, using an UDF.
# Agg your data by main_id
df2 = (
df.groupby("main_id", "sub_id")
.agg(
F.collect_list(F.struct("start_timestamp", "end_timestamp")).alias("timestamps")
)
.groupby("main_id")
.agg(F.collect_list(F.struct("sub_id", "timestamps")).alias("data"))
)
df2.show(truncate=False)
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|main_id|data |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[[3, [[2021/06/01 19:01, 2021/06/01 19:05], [2021/06/01 19:07, 2021/06/01 19:09]]], [1, [[2021/06/01 19:00, 2021/06/01 19:10]]], [2, [[2021/06/01 19:01, 2021/06/01 19:10]]]]|
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df2.printSchema()
root
|-- main_id: long (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sub_id: long (nullable = true)
| | |-- timestamps: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- start_timestamp: string (nullable = true)
| | | | |-- end_timestamp: string (nullable = true)
Once this step achieved, you can process the column data with Python and perform your logical AND.
#F.udf # Add required type depending on function return
def process(data):
"""
data is a complex type (see df2.printSchema()) :
list(dict(
"sub_id": value_of_sub_id,
"timestamps": list(dict(
"start_timestamp": value_of_start_timestamp,
"end_timestamp": value_of_end_timestamp,
))
))
"""
... # implement the "logical AND" here.
df2.select(
"main_id",
process(F.col("data"))
)
I hope this can help you or others to build a final solution.

Select key column from data as null if it doesn't exist in pyspark

My Dataframe (df) is structured as follows:
root
|-- val1: string (nullable = true)
|-- val2: string (nullable = true)
|-- val3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _type: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I have two sample records as follows:
+------+------+-----------------------------------+
| val1 | val2 | val3 |
+------+------+-----------------------------------+
| A | a | {k1: A1, k2: A2, k3: A3} |
+------+------+-----------------------------------+
| B | b | {k3: B3} |
+------+------+-----------------------------------+
I'm trying to select data from this as follows:
df.select(val1,val2,val3.k1,val3.k2,val3.k3)
And I want my output to look like:
+------+------+---------+---------+---------+
| val1 | val2 | k1 | k2 | k3 |
+------+------+---------+---------+---------+
| A | a | A1 | A2 | A3 |
+------+------+-----------------------------+
| B | b | NULL | NULL | B3 |
+------+------+-----------------------------+
But since I don't have the keys k1 and k2 for all records, the select statement throws an error. How do I solve this? I'm relatively new to pyspark.
I think you can use
df.selectExpr('val3.*')
Let me know if this works

Append new column to spark DF based on logic

Need to add a new column to below DF based on other columns. Here is the DF schema
scala> a.printSchema()
root
|-- ID: decimal(22,0) (nullable = true)
|-- NAME: string (nullable = true)
|-- AMOUNT: double (nullable = true)
|-- CODE: integer (nullable = true)
|-- NAME1: string (nullable = true)
|-- code1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- revised_code string (nullable = true)
now i want to add a column say flag as per below conditions
1=> if code == revised_code, than flag is P
2 => if code != revised code than I
3=> if both code and revised_code is null than no flag.
this is the udf that i am trying, but its giving I for both case 1 and 3.
def tagsUdf =
udf((code: String, revised_code: String) =>
if (code == null && revised_code == null ) ""
else if (code == revised_code) "P" else "I")
tagsUdf(col("CODE"), col("revised_code"))
Can anyone please point out as what mistake am I doing
I/P DF
+-------------+-------+------------+
|NAME | CODE|revised_code|
+-------------+-------+------------+
| amz | null| null|
| Watch | null| 5812|
| Watch | null| 5812|
| Watch | 5812| 5812|
| amz | null| null|
| amz | 9999 | 4352|
+-------------+-------+-----------+
Schema:
root
|-- MERCHANT_NAME: string (nullable = true)
|-- CODE: integer (nullable = true)
|-- revised_mcc: string (nullable = true)
O/P DF
+-------------+-------+-----------------+
|NAME | CODE|revised_code|flag|
+-------------+-------+-----------------+
| amz | null| null| null|
| Watch | null| 5812| I |
| Watch | null| 5812| I |
| Watch | 5812| 5812| P |
| amz | null| null| null|
|amz | 9999 | 4352| I |
+-------------+-------+-----------------+
You don't need a udf function for that. A simple nested when inbuilt function should do the trick.
import org.apache.spark.sql.functions._
df.withColumn("CODE", col("CODE").cast("string"))
.withColumn("flag", when(((isnull(col("CODE")) || col("CODE") === "null") && (isnull(col("revised_code")) || col("revised_code") === "null")), "").otherwise(when(col("CODE") === col("revised_code"), "P").otherwise("I")))
.show(false)
Here, CODE column is casted to stringType before logic applying using when so that both CODE and revised_code match in datatype when comparing.
Note: CODE column is an IntegerType and it cannot be null in any case.

Spark Data frame throwing error when trying to query nested column

I'm having a Dataframe with below schema. This is basically an XML file, which I have converted to Dataframe for further processing. I trying to extract _Date column, but looks like some type mismatch is happening
df1.printSchema
|-- PlayWeek: struct (nullable = true)
| |-- TicketSales: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- PlayDate: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- BoxOfficeDetail: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- VisualFormatCd: struct (nullable = true)
| | | | | | | | |-- Code: struct (nullable = true)
| | | | | | | | | |-- _SequenceId: long (nullable = true)
| | | | | | | | | |-- _VALUE: double (nullable = true)
| | | | | | | |-- _SessionTypeCd: string (nullable = true)
| | | | | | | |-- _TicketPrice: double (nullable = true)
| | | | | | | |-- _TicketQuantity: long (nullable = true)
| | | | | | | |-- _TicketTax: double (nullable = true)
| | | | | | | |-- _TicketTypeCd: string (nullable = true)
| | | | | |-- _Date: string (nullable = true)
| | | |-- _FilmId: long (nullable = true)
| | | |-- _Screen: long (nullable = true)
| | | |-- _TheatreId: long (nullable = true)
| |-- _BusinessEndDate: string (nullable = true)
| |-- _BusinessStartDate: string (nullable = true)
I need to extract _Date column but its throwing below error
scala> df1.select(df1.col("PlayWeek.TicketSales.PlayDate._Date")).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'PlayWeek.TicketSales.PlayDate[_Date]' due to data type mismatch: argument 2 requires integral type, however, '_Date' is of string type.;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
Any help would be appreciated.