Spark - Convert all Timestamp columns to a certain date format - scala

I have a use case where I need to read data from Hive tables (Parquet), convert Timestamp columns to a certain format and write the output as csv.
For the date format thing, I want to write a function that takes a StructField and returns either the original field name or date_format($"col_name", "dd-MMM-yyyy hh.mm.ss a"), if the dataType is TimestampType. This is what I have come up with so far
def main (String[] args) {
val hiveSchema = args(0)
val hiveName = args(1)
val myDF = spark.table(s"${hiveSchema}.${hiveTable}")
val colArray = myDF.schema.fields.map(getColumns)
val colString = colArray.mkString(",")
myDF.select(colString).write.format("csv").mode("overwrite").option("header", "true").save("/tmp/myDF")
}
def getColumns(structField: StructField): String = structField match {
case structField if(structField.dataType.simpleString.equalsIgnoreCase("TimestampType")) => s"""date_format($$"${structField.name}", "dd-MMM-yy hh.mm.ss a")"""
case _ => structField.name
}
But I get the following error at runtime
org.apache.spark.sql.AnalysisException: cannot resolve '`date_format($$"my_date_col", "dd-MMM-yy hh.mm.ss a")`' given input columns [mySchema.myTable.first_name, mySchema.myTable.my_date_col];
Is there a better way to do this?

Remove the double dollar sign and quotes. Also, no need to mkString; just use selectExpr:
def main (String[] args) {
val hiveSchema = args(0)
val hiveName = args(1)
val myDF = spark.table(s"${hiveSchema}.${hiveTable}")
val colArray = myDF.schema.fields.map(getColumns)
myDF.selectExpr(colArray: _*).write.format("csv").mode("overwrite").option("header", "true").save("/tmp/myDF")
}
def getColumns(structField: StructField): String = structField match {
case structField if(structField.dataType.simpleString.equalsIgnoreCase("TimestampType")) => s"""date_format(${structField.name}, "dd-MMM-yy hh.mm.ss a") as ${structField.name}"""
case _ => structField.name
}

Related

Date validation function scala

I have RDD[(String, String)]. String contains datetimestamp in format ("yyyy-MM-dd HH:mm:ss"). I am converting it in epoch time using the below function where dateFormats is SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
def epochTime (stringOfTime: String): Long = dateFormats.parse(stringOfTime).getTime
I want to modify the function so as to delete the row if it contains null/empty/not-right formatted date and how to apply it to the RDD[(String, String)] so the string value get converted to epoch time as below
Input
(2020-10-10 05:17:12,2015-04-10 09:18:20)
(2020-10-12 06:15:58,2015-04-10 09:17:42)
(2020-10-11 07:16:40,2015-04-10 09:17:49)
Output
(1602303432,1428653900)
(1602479758,1428653862)
(1602397000,1428653869)
You can use a filter to determine which value is not None. To do this you need to change the epochTime method so that it can return Option[Long],def epochTime (stringOfTime: String): Option[Long] inside your method make a check to see if the string is null with the .nonEmpty method, then you can use Try to see if you can parse the string with dateFormats.
After these changes, you must filter the RDD to remove None, and then unwrap each value from Option to Long
The code itself:
val sparkSession = SparkSession.builder()
.appName("Data Validation")
.master("local[*]")
.getOrCreate()
val data = Seq(("2020-10-10 05:17:12","2015-04-10 09:18:20"), ("2020-10-12 06:15:58","2015-04-10 09:17:42"),
("2020-10-11 07:16:40","2015-04-10 09:17:49"), ("t", "t"))
val rdd:RDD[(String,String)] = sparkSession.sparkContext.parallelize(data)
def epochTime (stringOfTime: String): Option[Long] = {
val dateFormats = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
if (stringOfTime.nonEmpty) {
val parseDate = Try(dateFormats.parse(stringOfTime).getTime)
parseDate match {
case Success(value) => Some(value)
case _ => None
}
} else {
None
}
}
rdd.map(pair => (epochTime(pair._1),epochTime(pair._2)))
.filter(pair => pair._1.isDefined && pair._2.isDefined)
.map(pair => (pair._1.get, pair._2.get))
.foreach(pair => println(s"Results: (${pair._1}, ${pair._2})"))}

Spark scala UDF in DataFrames is not working

I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?
The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.
I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))

Dynamic Query on Scala Spark

I'm basically trying to do something like this but spark doesn’t recognizes it.
val colsToLower: Array[String] = Array("col0", "col1", "col2")
val selectQry: String = colsToLower.map((x: String) => s"""lower(col(\"${x}\")).as(\"${x}\"), """).mkString.dropRight(2)
df
.select(selectQry)
.show(5)
Is there a way to do something like this in spark/scala?
If you need to lowercase the name of your columns there is a simple way of doing it. Here is one example:
df.columns.foreach(c => {
val newColumnName = c.toLowerCase
df = df.withColumnRenamed(c, newColumnName)
})
This will allow you to lowercase the column names, and update it in the spark dataframe.
I believe I found a way to build it:
def lowerTextColumns(cols: Array[String])(df: DataFrame): DataFrame = {
val remainingCols: String = (df.columns diff cols).mkString(", ")
val lowerCols: String = cols.map((x: String) => s"""lower(${x}) as ${x}, """).mkString.dropRight(2)
val selectQry: String =
if (colsToSelect.nonEmpty) lowerCols + ", " + remainingCols
else lowerCols
df
.selectExpr(selectQry.split(","):_*)
}

Spark: dynamic schema definition out of csv file

Receiving a schema information as csv file below. Assume i have around 100+columns
FIRSTNAME|VARCHAR2
LASTANME|VARCHAR2
MIDDLENAME|VARCHAR2
BIRTHDATE|DATE
ADULTS|NUMBER
ADDRESS|VARCHAR2
How to generate a schema dynamically in SPARK in this scenario?
You can use string splitting and pattern matching assuming that the schema file is a validly formatted csv. Assuming that you already have the schema loaded as a single comma-separated string, the following will work:
def toSchema(str: String) = {
val structFields = str.split(",").map{ s =>
val split = s.split("\\|")
val name: String = split.head
val typeStr = split.tail.head
val varCharPattern = "varchar[0-9]+".r
val datePattern = "date".r
val numPattern = "number".r
val t = typeStr.toLowerCase match{
case varCharPattern() => StringType
case datePattern() => TimestampType
case numPattern() => IntegerType
case _ => throw new Exception("unknown type string")
}
StructField(name, t)
}
StructType(structFields)
}
You can add more types easily enough by just adding new cases to the pattern matching statement.

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})