Date validation function scala - scala

I have RDD[(String, String)]. String contains datetimestamp in format ("yyyy-MM-dd HH:mm:ss"). I am converting it in epoch time using the below function where dateFormats is SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
def epochTime (stringOfTime: String): Long = dateFormats.parse(stringOfTime).getTime
I want to modify the function so as to delete the row if it contains null/empty/not-right formatted date and how to apply it to the RDD[(String, String)] so the string value get converted to epoch time as below
Input
(2020-10-10 05:17:12,2015-04-10 09:18:20)
(2020-10-12 06:15:58,2015-04-10 09:17:42)
(2020-10-11 07:16:40,2015-04-10 09:17:49)
Output
(1602303432,1428653900)
(1602479758,1428653862)
(1602397000,1428653869)

You can use a filter to determine which value is not None. To do this you need to change the epochTime method so that it can return Option[Long],def epochTime (stringOfTime: String): Option[Long] inside your method make a check to see if the string is null with the .nonEmpty method, then you can use Try to see if you can parse the string with dateFormats.
After these changes, you must filter the RDD to remove None, and then unwrap each value from Option to Long
The code itself:
val sparkSession = SparkSession.builder()
.appName("Data Validation")
.master("local[*]")
.getOrCreate()
val data = Seq(("2020-10-10 05:17:12","2015-04-10 09:18:20"), ("2020-10-12 06:15:58","2015-04-10 09:17:42"),
("2020-10-11 07:16:40","2015-04-10 09:17:49"), ("t", "t"))
val rdd:RDD[(String,String)] = sparkSession.sparkContext.parallelize(data)
def epochTime (stringOfTime: String): Option[Long] = {
val dateFormats = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
if (stringOfTime.nonEmpty) {
val parseDate = Try(dateFormats.parse(stringOfTime).getTime)
parseDate match {
case Success(value) => Some(value)
case _ => None
}
} else {
None
}
}
rdd.map(pair => (epochTime(pair._1),epochTime(pair._2)))
.filter(pair => pair._1.isDefined && pair._2.isDefined)
.map(pair => (pair._1.get, pair._2.get))
.foreach(pair => println(s"Results: (${pair._1}, ${pair._2})"))}

Related

How to input and output an Seq of an object to a function in Scala

I want to parse a column to get split values using Seq of an object
case class RawData(rawId: String, rawData: String)
case class SplitData(
rawId: String,
rawData: String,
split1: Option[Int],
split2: Option[String],
split3: Option[String],
split4: Option[String]
)
def rawDataParser(unparsedRawData: Seq[RawData]): Seq[RawData] = {
unparsedrawData.map(rawData => {
val split = rawData.address.split(", ")
rawData.copy(
split1 = Some(split(0).toInt),
split2 = Some(split(1)),
split3 = Some(split(2)),
split4 = Some(split(3))
)
})
}
val rawDataDF= Seq[(String, String)](
("001", "Split1, Split2, Split3, Split4"),
("002", "Split1, Split2, Split3, Split4")
).toDF("rawDataID", "rawData")
val rawDataDS: Dataset[RawData] = rawDataDF.as[RawData]
I need to use rawDataParser function to parse my rawData. However, the parameter to the function is of type Seq. I am not sure how should I convert rawDataDS as an input to function to parse the raw data. some form of guidance to solve this is appreciated.
Each DataSet is further divided into partitions. You can use mapPartitions with a mapping Iterator[T] => Iterator[U] to convert a DataSet[T] into a DataSet[U].
So, you can just use your addressParser as the argument for mapPartition.
val rawAddressDataDS =
spark.read
.option("header", "true")
.csv(csvFilePath)
.as[AddressRawData]
val addressDataDS =
rawAddressDataDS
.map { rad =>
AddressData(
addressId = rad.addressId,
address = rad.address,
number = None,
road = None,
city = None,
country = None
)
}
.mapPartitions { unparsedAddresses =>
addressParser(unparsedAddresses.toSeq).toIterator
}

Spark - Convert all Timestamp columns to a certain date format

I have a use case where I need to read data from Hive tables (Parquet), convert Timestamp columns to a certain format and write the output as csv.
For the date format thing, I want to write a function that takes a StructField and returns either the original field name or date_format($"col_name", "dd-MMM-yyyy hh.mm.ss a"), if the dataType is TimestampType. This is what I have come up with so far
def main (String[] args) {
val hiveSchema = args(0)
val hiveName = args(1)
val myDF = spark.table(s"${hiveSchema}.${hiveTable}")
val colArray = myDF.schema.fields.map(getColumns)
val colString = colArray.mkString(",")
myDF.select(colString).write.format("csv").mode("overwrite").option("header", "true").save("/tmp/myDF")
}
def getColumns(structField: StructField): String = structField match {
case structField if(structField.dataType.simpleString.equalsIgnoreCase("TimestampType")) => s"""date_format($$"${structField.name}", "dd-MMM-yy hh.mm.ss a")"""
case _ => structField.name
}
But I get the following error at runtime
org.apache.spark.sql.AnalysisException: cannot resolve '`date_format($$"my_date_col", "dd-MMM-yy hh.mm.ss a")`' given input columns [mySchema.myTable.first_name, mySchema.myTable.my_date_col];
Is there a better way to do this?
Remove the double dollar sign and quotes. Also, no need to mkString; just use selectExpr:
def main (String[] args) {
val hiveSchema = args(0)
val hiveName = args(1)
val myDF = spark.table(s"${hiveSchema}.${hiveTable}")
val colArray = myDF.schema.fields.map(getColumns)
myDF.selectExpr(colArray: _*).write.format("csv").mode("overwrite").option("header", "true").save("/tmp/myDF")
}
def getColumns(structField: StructField): String = structField match {
case structField if(structField.dataType.simpleString.equalsIgnoreCase("TimestampType")) => s"""date_format(${structField.name}, "dd-MMM-yy hh.mm.ss a") as ${structField.name}"""
case _ => structField.name
}

Date/Time formatting Scala

I'm trying to assert date and time displayed on the page
Problem is it's returning value of "2017-03-11T09:00" instead of "2017-03-11 09:00:00" and I'm confused why as the pattern = yyyy-MM-dd HH:mm:ss
Any ideas?
def getDate :String = {
val timeStamp = find(xpath("//*[#id=\"content\"]/article/div/div/table/tbody/tr[5]/td/div/p[4]")).get.underlying.getText
val stripDate: Array[String] = timeStamp.split("Timestamp:\n")
stripDate(1)
}
def datePattern(date: String): LocalDateTime = {
val pattern: DateTimeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val result = LocalDateTime.parse(date, pattern)
result
}
def checkDatePattern() = datePattern(getDate).toString shouldBe getDate
The DateTimeFormatter only gets used for the parse operation. It doesn't influence the result of toString. If you want to convert your LocalDateTime to a String in a certain format you have to call
date.format(pattern)
I've managed to get the result I wanted by just deleting some parts of the code. As long as the date is in displayed in the correct format, the test passes if it's displayed in an incorrect format it fails, which is good enough for me. Thanks for your input. CASE CLOSED
def getDate :String = {
val timeStamp = find(xpath("//*[#id=\"content\"]/article/div/div/table/tbody/tr[5]/td/div/p[4]")).get.underlying.getText
val stripDate: Array[String] = timeStamp.split("Timestamp:\n")
stripDate(1)
}
def datePattern(date: String): LocalDateTime = {
val pattern: DateTimeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
LocalDateTime.parse(date, pattern)
}
def checkDatePattern() = datePattern(getDate)

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?
So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()
I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})

formatters for List[DateTime] play scala

I am working on a project using Play, Scala, MongoDB. I want to store the List[Datetime] in a collection, so I need fomatters for it. To store the Datetime I used this formatter
implicit def dateFormat = {
val dateStandardFormat = "yyyy-MM-dd'T'HH:mm:ss.SSS"
val dateReads: Reads[DateTime] = Reads[DateTime](js =>
js.validate[JsObject].map(_.value.toSeq).flatMap {
case Seq(("$date", JsNumber(ts))) if ts.isValidLong =>
JsSuccess(new DateTime(ts.toLong))
case _ =>
JsError(__, "validation.error.expected.$date")
}
)
val dateWrites: Writes[DateTime] = new Writes[DateTime] {
def writes(dateTime: DateTime): JsValue = Json.obj("$date"-> dateTime.getMillis())
}
Format(dateReads, dateWrites)
}
but for storing list of datetimes it is not working. thanks in advance for help
You need to create an implicit json Writer and Reader for the List[DateTime]. In your example, you are only defining how to serialize and deserialize the DateTime type. Adding this below the formatter should make the framework know how to JSONify DateTime lists.
See working example below:
val dateStandardFormat = "yyyy-MM-dd'T'HH:mm:ss.SSS"
val dateReads: Reads[DateTime] = Reads[DateTime](js =>
js.validate[JsObject].map(_.value.toSeq).flatMap {
case Seq(("$date", JsNumber(ts))) if ts.isValidLong =>
JsSuccess(new DateTime(ts.toLong))
case _ =>
JsError(__, "validation.error.expected.$date")
}
)
val dateWrites: Writes[DateTime] = new Writes[DateTime] {
def writes(dateTime: DateTime): JsValue = Json.obj("$date" -> dateTime.getMillis())
}
implicit def dateFormat = Format(dateReads, dateWrites)
implicit val listDateTimeFormat = Format(Reads.list[DateTime](dateReads), Writes.list[DateTime](dateWrites))
val m = List(DateTime.now(), DateTime.now(), DateTime.now(), DateTime.now(), DateTime.now())
println(Json.toJson(m).toString())
You could use the MongoDateFormats from this project simple-reactivemongo