update json column with timestamp pyspark - pyspark

+----------------+
|application_name|
+----------------+
|{"application_name": "DIMENSIONS_USER",
"dq_test_name": "contra_cp_dimension_agentpresence_isBillable_should_be_set"
}
+----------------+--------------------+-----------+
In this i need to update column value with date for all the rows..can someone help I am new to pyspark and was unable to find any working solution
+----------------+
|application_name|
+----------------+
|{"application_name": "DIMENSIONS_USER",
"dq_test_name": "contra_cp_dimension_agentpresence_isBillable_should_be_set",
"date":01/001/2020
}
+----------------+

You could define a UDF that will add a new value to json string.
Here's an example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import json
def update_app_data(data):
app_data = json.loads(data)
app_data["date"] = "01/001/2020"
return json.dumps(app_data)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
application_data = {
"application_name": "DIMENSIONS_USER",
"dq_test_name": "contra_cp_dimension_agentpresence_isBillable_should_be_set",
}
data = [
{
"application_name": json.dumps(application_data),
}
]
df = spark.createDataFrame(data=data)
update_app_data_udf = F.udf(lambda x: update_app_data(x))
df = df.withColumn("application_name", update_app_data_udf(F.col("application_name")))
Input dataframe looks like:
+---------------------------------------------------------------------------------------------------------------------+
|application_name |
+---------------------------------------------------------------------------------------------------------------------+
|{"application_name": "DIMENSIONS_USER", "dq_test_name": "contra_cp_dimension_agentpresence_isBillable_should_be_set"}|
+---------------------------------------------------------------------------------------------------------------------+
Output:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|application_name |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|{"application_name": "DIMENSIONS_USER", "dq_test_name": "contra_cp_dimension_agentpresence_isBillable_should_be_set", "date": "01/001/2020"}|
+--------------------------------------------------------------------------------------------------------------------------------------------+

Related

scala spark dataframe modify column with udf return value

I have a spark dataframe which has a timestamp field and i want to convert this to long datatype. I used a UDF and the standalone code works fine but when i plug to to a generic logic where any timestamp will need to be converted i m not ble to get it working.Issue is how can i assing the return value from UDF back to the dataframe column
Below is the code snippet
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test3").getOrCreate();
import org.apache.spark.sql.functions._
val sqlContext = spark.sqlContext
val df2 = sqlContext.jsonRDD(spark.sparkContext.parallelize(Array(
"""{"year":2012, "make": "Tesla", "model": "S", "comment": "No Comment", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
"""{"year":1997, "make": "Ford", "model": "E350", "comment": "Get one", "blank": "","manufacture_ts":"2017-10-16 00:00:00"}""",
)))
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
df2.withColumn("manufacture_ts",getTime(df2("manufacture_ts"))).show
+-----+----------+-----+--------------+-----+----+
| |No Comment|Tesla| 1508126400000| S|2012|
| | Get one| Ford| 1508126400000| E350|1997|
| | |Chevy| 1508126400000| Volt|2015|
+-----+----------+-----+--------------+-----+----+
Now i want to invoke this from a dataframe to be clled on all columns which are of type long
object Test4 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("Test").getOrCreate();
import spark.implicits._
import scala.collection.JavaConversions._
val long : Long = "1508299200000".toLong
val data = Seq(Row("10000020_LUX_OTC",long,"2020-02-14"))
val schema = List( StructField("rowkey",StringType,true)
,StructField("order_receipt_dt",LongType,true)
,StructField("maturity_dt",StringType,true))
val dataDF = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
val modifedDf2= schema.foldLeft(dataDF) { case (newDF,StructField(name,dataType,flag,metadata)) =>
newDF.withColumn(name,DataTypeUtil.transformLong(newDF,name,dataType.typeName))
modifedDf2,show
}
}
val convertTimeStamp = udf { (manTs :java.sql.Timestamp) =>
manTs.getTime
}
def transformLong(dataFrame: DataFrame,name:String, fieldType:String):Column = {
import org.apache.spark.sql.functions._
fieldType.toLowerCase match {
case "timestamp" => convertTimeStamp(dataFrame(name))
case _ => dataFrame.col(name)
}
}
Maybe your udf crashed if the timestamp is nullYou can do :
use unix_timestamp instead of UDF.. or make your UDF null-safe
only apply on fields which need to be converted.
Given the data:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.TimestampType
val df = Seq(
(1L,Timestamp.valueOf(LocalDateTime.now()),Timestamp.valueOf(LocalDateTime.now()))
).toDF("id","ts1","ts2")
you can do:
val newDF = df.schema.fields.filter(_.dataType == TimestampType).map(_.name)
.foldLeft(df)((df,field) => df.withColumn(field,unix_timestamp(col(field))))
newDF.show()
which gives:
+---+----------+----------+
| id| ts1| ts2|
+---+----------+----------+
| 1|1589109282|1589109282|
+---+----------+----------+

Pass Spark SQL function name as parameter in Scala

I am trying to pass a Spark SQL function name to my defined function in Scala.
I am trying to get same functionality as:
myDf.agg(max($"myColumn"))
my attempt doesn't work:
def myFunc(myDf: DataFrame, myParameter: String): Dataframe = {
myDf.agg(myParameter($"myColumn"))
}
Obviously it shouldn't work as I'm providing a string type I am unable to find a way to make it work.
Is it even possible?
Edit:
I have to provide sql function name (and it can be other aggregate function) as parameter when calling my function.
myFunc(anyDf, max) or myFunc(anyDf, "max")
agg also takes a Map[String,String] which allows to do what you want:
def myFunc(myDf: DataFrame, myParameter: String): DataFrame = {
myDf.agg(Map("myColumn"->myParameter))
}
example:
val df = Seq(1.0,2.0,3.0).toDF("myColumn")
myFunc(df,"avg")
.show()
gives:
+-------------+
|avg(myColumn)|
+-------------+
| 2.0|
+-------------+
Try this:
import org.apache.spark.sql.{Column, DataFrame}
val df = Seq((1, 2, 12),(2, 1, 21),(1, 5, 10),(5, 3, 9),(2, 5, 4)).toDF("a","b","c")
def myFunc(df: DataFrame, f: Column): DataFrame = {
df.agg(f)
}
myFunc(df, max(col("a"))).show
+------+
|max(a)|
+------+
| 5|
+------+
Hope it helps!

Spark Scala CSV Column names to Lower Case

Please find the code below and Let me know how I can change the Column Names to Lower case. I tried withColumnRename but I have to do it for each column and type all the column names. I just want to do it on columns so I don't want to mention all the column names as there are too many of them.
Scala Version: 2.11
Spark : 2.2
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
import com.datastax
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object dataframeset {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val rdd1 = sc.cassandraTable("tdata", "map3")
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host","127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val df = spark1.read.format("csv").option("header","true").option("inferschema", "true").load("/Users/Desktop/del2.csv")
import spark1.implicits._
println("\nTop Records are:")
df.show(1)
val dfprev1 = df.select(col = "sno", "year", "StateAbbr")
dfprev1.show(1)
}
}
Required output:
|sno|year|stateabbr| statedesc|cityname|geographiclevel
All the Columns names should be in lower case.
Actual output:
Top Records are:
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
|sno|year|StateAbbr| StateDesc|CityName|GeographicLevel|DataSource| category|UniqueID| Measure|Data_Value_Unit|DataValueTypeID| Data_Value_Type|Data_Value|Low_Confidence_Limit|High_Confidence_Limit|Data_Value_Footnote_Symbol|Data_Value_Footnote|PopulationCount|GeoLocation|categoryID|MeasureId|cityFIPS|TractFIPS|Short_Question_Text|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
| 1|2014| US|United States| null| US| BRFSS|Prevention| 59|Current lack of h...| %| AgeAdjPrv|Age-adjusted prev...| 14.9| 14.6| 15.2| null| null| 308745538| null| PREVENT| ACCESS2| null| null| Health Insurance|
+---+----+---------+-------------+--------+---------------+----------+----------+--------+--------------------+---------------+---------------+--------------------+----------+--------------------+---------------------+--------------------------+-------------------+---------------+-----------+----------+---------+--------+---------+-------------------+
only showing top 1 row
+---+----+---------+
|sno|year|StateAbbr|
+---+----+---------+
| 1|2014| US|
+---+----+---------+
only showing top 1 row
Just use toDF:
df.toDF(df.columns map(_.toLowerCase): _*)
Other way to achieve it is using FoldLeft method.
val myDFcolNames = myDF.columns.toList
val rdoDenormDF = myDFcolNames.foldLeft(myDF)((myDF, c) =>
myDF.withColumnRenamed(c.toString.split(",")(0), c.toString.toLowerCase()))

How do convert date to Unix timestamp in milliseconds [duplicate]

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+
UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+
There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala

How to convert a string column with milliseconds to a timestamp with milliseconds in Spark 2.1 using Scala?

I am using Spark 2.1 with Scala.
How to convert a string column with milliseconds to a timestamp with milliseconds?
I tried the following code from the question Better way to convert a string field into timestamp in Spark
import org.apache.spark.sql.functions.unix_timestamp
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss.SSS").cast("timestamp")
tdf.withColumn("ts", tts).show(2, false)
But I get the result without milliseconds:
+---+-----------------------+---------------------+
|id |dts |ts |
+---+-----------------------+---------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.0|
|2 |#$#### |null |
+---+-----------------------+---------------------+
UDF with SimpleDateFormat works. The idea is taken from the Ram Ghadiyaram's link to an UDF logic.
import java.text.SimpleDateFormat
import java.sql.Timestamp
import org.apache.spark.sql.functions.udf
import scala.util.{Try, Success, Failure}
val getTimestamp: (String => Option[Timestamp]) = s => s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss.SSS")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
val getTimestampUDF = udf(getTimestamp)
val tdf = Seq((1L, "05/26/2016 01:01:01.601"), (2L, "#$####")).toDF("id", "dts")
val tts = getTimestampUDF($"dts")
tdf.withColumn("ts", tts).show(2, false)
with output:
+---+-----------------------+-----------------------+
|id |dts |ts |
+---+-----------------------+-----------------------+
|1 |05/26/2016 01:01:01.601|2016-05-26 01:01:01.601|
|2 |#$#### |null |
+---+-----------------------+-----------------------+
There is an easier way than making a UDF. Just parse the millisecond data and add it to the unix timestamp (the following code works with pyspark and should be very close the scala equivalent):
timeFmt = "yyyy/MM/dd HH:mm:ss.SSS"
df = df.withColumn('ux_t', unix_timestamp(df.t, format=timeFmt) + substring(df.t, -3, 3).cast('float')/1000)
Result:
'2017/03/05 14:02:41.865' is converted to 1488722561.865
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
dataFrame.withColumn(
"time_stamp",
dataFrame.col("milliseconds_in_string")
.cast(DataTypes.LongType)
.cast(DataTypes.TimestampType)
)
the code is in java and it is easy to convert to scala