pyspark to_timestamp() handling format of miliseconds SSS - pyspark

I have distorted Data,
I am using below function here.
to_timestamp("col","yyyy-MM-dd'T'hh:mm:ss.SSS'Z'")
Data:
time | OUTPUT | IDEAL
2022-06-16T07:01:25.346Z | 2022-06-16T07:01:25.346+0000 | 2022-06-16T07:01:25.346+0000
2022-06-16T06:54:21.51Z | 2022-06-16T06:54:21.051+0000 | 2022-06-16T06:54:21.510+0000
2022-06-16T06:54:21.5Z | 2022-06-16T06:54:21.005+0000 | 2022-06-16T06:54:21.500+0000
so, I have S or SS or SSS format for milisecond in data. How can i normalise it into SSS correct way? Here, 51 miliseconds mean 510 not 051.
Using spark version : 3.2.1
Code :
import pyspark.sql.functions as F
test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])
timeFmt = "yyyy-MM-dd'T'hh:mm:ss.SSS'Z'"
test = test.withColumn("timing2", (F.to_timestamp(F.col('timing1'),format=timeFmt)))
test.select("timing1","timing2").show(truncate=False)
Output:

I also use v3.2.1 and it works for me if you just don't parse the timestamp format. It is already in the right format:
from pyspark.sql import functions as F
test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])
new_df = test.withColumn('timing1_ts', F.to_timestamp('timing1'))\
new_df.show(truncate=False)
new_df.dtypes
+---+------------------------+-----------------------+
|no |timing1 |timing1_ts |
+---+------------------------+-----------------------+
|1 |2022-06-16T07:01:25.346Z|2022-06-16 07:01:25.346|
|2 |2022-06-16T06:54:21.51Z |2022-06-16 06:54:21.51 |
|3 |2022-06-16T06:54:21.5Z |2022-06-16 06:54:21.5 |
+---+------------------------+-----------------------+
Out[9]: [('no', 'bigint'), ('timing1', 'string'), ('timing1_ts', 'timestamp')]

I was using this setting :
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
I have to reset this and it is working as normal.

Related

How to calculate standard deviation over a range of dates when there are dates missing in pyspark 2.2.0

I have a pyspark df wherein I am using a combination of windows + udf function to calculate standard deviation over historical business dates. The challenge is that my df is missing dates when there is no transaction. How do I calculate std dev including these missing dates without adding them as additional rows into my df to limit the df size going out of memory.
Sample Table & Current output
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |2886.751|
Current Code
from pyspark.sql.functions import udf,first,Window,withColumn
import numpy as np
from pyspark.sql.types import IntegerType
windowSpec = Window.partitionBy("ID").orderBy("date")
workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType()) # UDF to calculate difference between business days#
df = df.withColumn("date_dif", workdaysUDF(F.col('Date'), F.first(F.col('Date')).over(windowSpec))) #column calculating business date diff#
windowval = lambda days: Window.partitionBy('id').orderBy('date_dif').rangeBetween(-days, 0)
df = df.withColumn("std_dev",F.stddev("amount").over(windowval(6))\
.drop("date_dif")
Desired Output where the values of dates missing between 24 to 29 March are being substituted with 0.
| ID | Date | Amount | Std_Dev|
|----|----------|--------|--------|
|1 |2021-03-24| 10000 | |
|1 |2021-03-26| 5000 | |
|1 |2021-03-29| 10000 |4915.96 |
Please note that I am only showing std dev for a single date for illustration, there would be value for each row as I am using a rolling windows function.
Any help would be greatly appreciated.
PS: Pyspark version is 2.2.0 at enterprise so I do not have flexibility to change the version.
Thanks,
VSG

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

how to update all the values of a column in a dataFrame

I have a data frame which has a non formated Date column :
+--------+-----------+--------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+--------+
| AAA|bbbbbbbbbbb|13190326|
| AAA|bbbbbbbbbbb|10190309|
| AAA|bbbbbbbbbbb|36190908|
| AAA|bbbbbbbbbbb|07190214|
| AAA|bbbbbbbbbbb|13190328|
| AAA|bbbbbbbbbbb|23190608|
| AAA|bbbbbbbbbbb|13190330|
| AAA|bbbbbbbbbbb|26190630|
+--------+-----------+--------+
the date column is formated as : wwyyMMdd (week, year, month, day) which I want to format to YYYYMMdd, for that a have a method : format that do that.
so my question is how could I map all the values of column Date to the needed format? here is the output that I want :
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
Spark 2.4.3 using unix_timestamp you can convert data to the expected output.
scala> var df2 =spark.createDataFrame(Seq(("AAA","bbbbbbbbbbb","13190326"),("AAA","bbbbbbbbbbb","10190309"),("AAA","bbbbbbbbbbb","36190908"),("AAA","bbbbbbbbbbb","07190214"),("AAA","bbbbbbbbbbb","13190328"),("AAA","bbbbbbbbbbb","23190608"),("AAA","bbbbbbbbbbb","13190330"),("AAA","bbbbbbbbbbb","26190630"))).toDF("CDOPEINT","bbbbbbbbbb","Date")
scala> df2.withColumn("Date",from_unixtime(unix_timestamp(substring(col("Date"),3,7),"yyMMdd"),"yyyy/MM/dd")).show
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
let me know if you have any query related to this.
If the date involves values from 2000 and the Date column in your original dataframe is of Integer type,you could try something like this
def getDate =(date:Int) =>{
val dateString = date.toString.drop(2).sliding(2,2)
dateString.zipWithIndex.map{
case(value,index) => if(index ==0) 20+value else value
}.mkString("/")
}
Then create a UDF which calls this function
val updateDateUdf = udf(getDate)
If originalDF is the original Dataframe that you have, you could then change the dataframe like this
val updatedDF = originalDF.withColumn("Date",updateDateUdf(col("Date")))

Hourly Aggregation in Scala Spark

I'm looking for a way to aggregate by hour my data. I want firstly to keep only hours in my evtTime. My DataFrame looks like this:
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:23:06.426|1 |
|X166815|2018-01-01 02:20:06.426|2 |
|X166816|2018-01-01 11:25:06.429|5 |
|X166817|2018-02-01 10:23:06.429|1 |
|X166818|2018-01-01 09:23:06.430|3 |
|X166819|2018-01-01 10:15:06.430|8 |
|X166820|2018-08-01 11:00:06.431|20 |
|X166821|2018-03-01 06:23:06.431|7 |
|X166822|2018-01-01 07:23:06.434|2 |
|X166823|2018-01-01 11:23:06.434|1 |
+-------+-----------------------+-----------+
My objectif is to get something like this :
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:00:00.000|1 |
|X166815|2018-01-01 02:00:00.000|2 |
|X166816|2018-01-01 11:00:00.000|5 |
|X166817|2018-02-01 10:00:00.000|1 |
|X166818|2018-01-01 09:00:00.000|3 |
|X166819|2018-01-01 10:00:00.000|8 |
|X166820|2018-08-01 11:00:00.000|20 |
|X166821|2018-03-01 06:00:00.000|7 |
|X166822|2018-01-01 07:00:00.000|2 |
|X166823|2018-01-01 11:00:00.000|1 |
+-------+-----------------------+-----------+
I'm using scala 2.10.5 and spark 1.6.3. My objectif subsequently is to group by reqUser and calculate the sum of event_count. I tried this :
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{round, sum}
val new_df = df
.groupBy($"reqUser",
Window(col("evtTime"), "1 hour"))
.agg(sum("event_count") as "aggregate_sum")
This is my error message :
Error:(81, 15) org.apache.spark.sql.expressions.Window.type does not take parameters
Window(col("time"), "1 hour"))
Help ? Thx !
In Spark 1.x you can use format tools
import org.apache.spark.sql.functions.trunc
val df = Seq("2018-01-01 10:15:06.430").toDF("evtTime")
df.select(date_format($"evtTime".cast("timestamp"), "yyyy-MM-dd HH:00:00")).show
+------------------------------------------------------------+
|date_format(CAST(evtTime AS TIMESTAMP), yyyy-MM-dd HH:00:00)|
+------------------------------------------------------------+
| 2018-01-01 10:00:00|
+------------------------------------------------------------+

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+