Joda.Time DateTimeFormat returns: java.lang.IllegalArgumentException: Invalid format is malformed - scala

I am converting some date formats from "ddMMMyyyy" to "yyyy-MM-dd". I have written the code below:
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
import scala.util.{Failure, Success, Try}
def dateFormatter(date: String, inpuFormat: String, outputFormat: String): String = {
var returnDate: String = ""
val inputDateFormat = DateTimeFormat.forPattern(inpuFormat)
val outputDateFormat = DateTimeFormat.forPattern(outputFormat)
Try {
if (date != null && date.trim != "") {
val parsedDate = DateTime.parse(date.trim(), inputDateFormat)
returnDate = outputDateFormat.print(parsedDate)
} else { returnDate = null }
} match {
case Success(_) => { returnDate }
case Failure(e) => {
println("exception!" + e)
date
}
}
}
I am using scala 2.12
If I pass as input:
"09September2032"
Then I get
"2032-09-09"
However if I pass
"09sep2032"
I get the following exception:
java.lang.IllegalArgumentException: Invalid format: "09sep2032" is malformed at "sep2032"
What is wrong with the provided pattern?

It seems for September, we need to pass 4 characters Sept
println(dateFormatter("09Jan2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Feb2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Mar2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Apr2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09May2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Jun2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Jul2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Aug2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Sept2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Oct2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Nov2032", "ddMMMyyyy", "yyyy-MM-dd"))
println(dateFormatter("09Dec2032", "ddMMMyyyy", "yyyy-MM-dd"))
Would output:
2032-01-09
2032-02-09
2032-03-09
2032-04-09
2032-05-09
2032-06-09
2032-07-09
2032-08-09
2032-09-09
2032-10-09
2032-11-09
2032-12-09

Related

scala loop to add date string into a seq

I am trying to add dates in string from an array into a seq while determining whether it is a weekend day.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, GregorianCalendar}
import org.apache.spark.sql.SparkSession
val arrDateEsti=dtfBaseNonLong.select("AAA").distinct().collect.map(_(0).toString);
var dtfDateCate = Seq(
("0000", "0")
);
for (a<-0 to arrDateEsti.length-1){
val dayDate:Date = dateFormat.parse(arrDateEsti(a));
val cal=new GregorianCalendar
cal.setTime(dayDate);
if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
dtfDateCate:+(arrDateEsti(a),"1")
}else{
dtfDateCate:+(arrDateEsti(a),"0")
}
};
scala> dtfDateCate
res20: Seq[(String, String)] = List((0000,0))
It returns the same initial sequence. But if I run one single element it works. What went wrong?
scala> val dayDate:Date = dateFormat.parse(arrDateEsti(0));
dayDate: java.util.Date = Thu Oct 15 00:00:00 CST 2020
scala> cal.setTime(dayDate);
scala> if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
| dtfDateCate:+(arrDateEsti(0),"1")
| }else{
| dtfDateCate:+(arrDateEsti(0),"0")
| };
res14: Seq[(String, String)] = List((0000,0), (20201015,0))
I think this gets at what you're trying to do.
import java.time.LocalDate
import java.time.DayOfWeek.{SATURDAY, SUNDAY}
import java.time.format.DateTimeFormatter
//replace with dtfBaseNonLong.select(... code
val arrDateEsti = Seq("20201015", "20201017") //place holder
val dtFormat = DateTimeFormatter.ofPattern("yyyyMMdd")
val dtfDateCate = ("0000", "0") +:
arrDateEsti.map { dt =>
val day = LocalDate.parse(dt,dtFormat).getDayOfWeek()
if (day == SATURDAY || day == SUNDAY) (dt, "1")
else (dt, "0")
}
//dtfDateCate = Seq((0000,0), (20201015,0), (20201017,1))
yeah, it should be seqDateCate=seqDateCate:+(arrDateEsti(a),"1")

Throwing NullPointerException while reading from MySQL in Apache Spark with Scala

I am trying to read data from MySQL but it is throwing NullPointerException. Not sure what is the reason.
code in main.scala
object main extends App {
val dt = args.lift(0)
if (dt.isEmpty || !PairingbatchUtil.validatePartitionDate(dt.get)) {
throw new Exception("Partition date is mandatory or enter valid format 'yyyy-MM-dd'")
}
var mailProperties:Properties = new Properties
var templateMappingData: Map[String, Object] = Map(
"job" -> "Load merchant count Data from hdfs to mongo",
"jobProcessedDate" -> dt.get,
"batch" -> "Pairing Batch")
val startTime = System.currentTimeMillis()
try {
val conf = new SparkConf().setAppName("read_from_mysql") //.setMaster("local")
conf.set("spark.sql.warehouse.dir", "/user/local/warehouse/")
conf.set("hive.exec.dynamic.partition", "true")
conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
conf.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/db.table_name")
conf.set("spark.mongodb.output.uri", "mongodb://127.0.0.1/db.table_name")
val spark = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val schemaName = "/user/local/warehouse/"
val aid = "1000"
val resultPath = "/usr/local/proad" + "/" + dt.get
val dbDataPartitionsMap = Map("aid" -> aid, "dt" -> dt.get)
spark.sql("set aid=" + aid)
spark.sql("set dt=" + dt.get)
val configs = spark.sparkContext.getConf.getAll
configs.foreach(i => println(i))
val registerBaseTablesMap = Map(
"DAILY_COUNT" -> ("SELECT * FROM " + schemaName + ".table_name WHERE aid = '${aid}' and dt ='${dt}'"),
"DAILY_COUNT_FINAL" -> ("SELECT * FROM " + schemaName + ".second_table_name WHERE aid = '${aid}' and dt ='${dt}'"))
val parentDF = PairingbatchUtil.readDataFromHive(registerBaseTablesMap.get("DAILY_COUNT").get, spark)
val finalMerchantAffiliateDailyCountDF = Processor.process(parentDF, dbDataPartitionsMap, spark)
}
code in Processor.scala
object Processor {
case class MerchantDailyCount( _id: String, date: Date, totalClicks: String, totalLinks: String, shopUrl: String, shopUUID: String, shopName: String, publisherId: String)
def process(parentDF: DataFrame, partitionsMap: Map[String, String], spark: SparkSession): DataFrame = {
val schemaString = "_id date total_clicks total_links shop_url shop_uuid shop_name publisher_id"
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
var finalDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
parentDF.foreach(row => {
if (parentDF == null || row.getAs("publisher_id") == null || StringUtils.isBlank(row.getAs("shop_uuid"))) {
} else {
val shopUUID = row.getAs("shop_uuid").toString
val currentDate = row.getAs("cur_date").toString
val date = PairingbatchUtil.parseDate(currentDate, Constants.DATE_FORMAT_YYYY_MM_DD, Constants.TAIWAN_TIMEZONE)
val publisherId = row.getAs("publisher_id").toString
val totalClicks = row.getAs("total_clicks").toString
val totalLinks = row.getAs("total_links").toString
val shopUrl = PairingbatchUtil.setShopUrlInfo(shopUUID, "com.mysql.jdbc.Driver", "user_mame", "password", s"""select shop_url, shop_name from db.table_name where shop_uuid ='$shopUUID'""", "shopUrl", spark)._1
val id = PairingbatchUtil.isNeedToSet(spark, shopUrl, publisherId, date)
val merchantDailyCount = MerchantDailyCount(id, date, totalClicks, totalLinks, shopUrl,shopUUID,shopName,publisherId)
import spark.implicits._
val merchantCountDF = Seq(merchantDailyCount).toDF()
finalDF = finalDF.union(merchantCountDF)
}
})
finalDF
}
}
code in PairingBatchUtil.scala:
def setShopUrlInfo(shopUUID: String, driverClass: String, user: String, pass: String, query: String, url: String, sparkSession: SparkSession)={
val merchantDetailsDF = sparkSession.read //line no 139
.format("jdbc")
.option("url", url)
.option("driver", driverClass)
.option("dbtable", s"( $query ) t")
.option("user",user)
.option("password", pass)
.load()
if (merchantDetailsDF.count() == 0) {
("INVALID SHOP URL","INVALID SHOP NAME")
}else {
(merchantDetailsDF.select(col = "shop_url").first().getAs("shop_url"),merchantDetailsDF.select(col = "shop_name").first().getAs("shop_name"))
}
}
i expect output of the query to be :
+--------------+---------+
| shop_url|shop_name|
+--------------+---------+
| parimal | roy |
+--------------+---------+
but actual output is :
19/07/04 14:48:50 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:117)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:115)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:549)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:613)
at com.rakuten.affiliate.order.pairing.batch.util.PairingbatchUtil$.setShopUrlInfo(PairingbatchUtil.scala:139)
at com.rakuten.affiliate.order.pairing.batch.Processors.MechantAffDailyCountProcessor$$anonfun$process$1.apply(MechantAffDailyCountProcessor.scala:40)
at com.rakuten.affiliate.order.pairing.batch.Processors.MechantAffDailyCountProcessor$$anonfun$process$1.apply(MechantAffDailyCountProcessor.scala:30)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1954)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Are you using Spark 2.1?
In that case I think you might have a problem with a configuration as you can see in the source on line 117
https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

Get next week date in Spark Dataframe using scala

I have a DateType input in the function. I would like to exclude Saturday and Sunday and get the next week day, if the input date falls on the weekend, otherwise it should give the next day's date
Example:
Input: Monday 1/1/2017 output: 1/2/2017 (which is Tuesday)
Input: Saturday 3/4/2017 output: 3/5/2017 (which is Monday)
I have gone through https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html but I don't see a ready made function, so I think it will need to be created.
So far I have something that is:
val nextWeekDate = udf {(startDate: DateType) =>
val day= date_format(startDate,'E'
if(day=='Sat' or day=='Sun'){
nextWeekDate = next_day(startDate,'Mon')
}
else{
nextWeekDate = date_add(startDate, 1)
}
}
Need help to get it valid and working.
Using dates as strings:
import java.time.{DayOfWeek, LocalDate}
import java.time.format.DateTimeFormatter
// If that is your format date
object MyFormat {
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
}
object MainSample {
import MyFormat._
def main(args: Array[String]): Unit = {
import java.sql.Date
import org.apache.spark.sql.types.{DateType, IntegerType}
import spark.implicits._
import org.apache.spark.sql.types.{ StringType, StructField, StructType }
import org.apache.spark.sql.functions._
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("YourApp")
.config("spark.master", "local")
.getOrCreate()
val someData = Seq(
Row(1,"2013-01-30"),
Row(2,"2012-01-01")
)
val schema = List(StructField("id", IntegerType), StructField("date",StringType))
val sourceDF = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
sourceDF.show()
val _udf = udf { (dt: String) =>
// Parse your date, dt is a string
val localDate = LocalDate.parse(dt, formatter)
// Check the week day and add days in each case
val newDate = if ((localDate.getDayOfWeek == DayOfWeek.SATURDAY)) {
localDate.plusDays(2)
} else if (localDate.getDayOfWeek == DayOfWeek.SUNDAY) {
localDate.plusDays(1)
} else {
localDate.plusDays(1)
}
newDate.toString
}
sourceDF.withColumn("NewDate", _udf('date)).show()
}
}
Here's a much simpler answer that's defined in spark-daria:
def nextWeekday(col: Column): Column = {
val d = dayofweek(col)
val friday = lit(6)
val saturday = lit(7)
when(col.isNull, null)
.when(d === friday || d === saturday, next_day(col,"Mon"))
.otherwise(date_add(col, 1))
}
You always want to stick with the native Spark functions whenever possible. This post explains the derivation of this function in greater detail.

Spark scala udf error for if else

I am trying to define udf with the function getTIme for spark scala udf but i am getting the error as error: illegal start of declaration. What might be error in the syntax and retutrn the date and also if there is parse exception instead of returing the null, send the some string as error
def getTime=udf((x:String) : java.sql.Timestamp => {
if (x.toString() == "") return null
else { val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss");
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime()); return t
}})
Thank you!
The return type for the udf is derived and should not be specified. Change the first line of code to:
def getTime=udf((x:String) => {
// your code
}
This should get rid of the error.
The following is a fully working code written in functional style and making use of Scala constructs:
val data: Seq[String] = Seq("", null, "2017-01-15 10:18:30")
val ds = spark.createDataset(data).as[String]
import java.text.SimpleDateFormat
import java.sql.Timestamp
val fmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
// ********HERE is the udf completely re-written: **********
val f = udf((input: String) => {
Option(input).filter(_.nonEmpty).map(str => new Timestamp(fmt.parse(str).getTime)).orNull
})
val ds2 = ds.withColumn("parsedTimestamp", f($"value"))
The following is output:
+-------------------+--------------------+
| value| parsedTimestamp|
+-------------------+--------------------+
| | null|
| null| null|
|2017-01-15 10:18:30|2017-01-15 10:18:...|
+-------------------+--------------------+
You should be using Scala datatypes, not Java datatypes. It would go like this:
def getTime(x: String): Timestamp = {
//your code here
}
You can easily do it in this way :
def getTimeFunction(timeAsString: String): java.sql.Timestamp = {
if (timeAsString.isEmpty)
null
else {
val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss")
val date = format.parse(timeAsString.toString())
val time = new Timestamp(date.getTime())
time
}
}
val getTimeUdf = udf(getTimeFunction _)
Then use this getTimeUdf accordingly. !

Better way to convert a string field into timestamp in Spark

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.
I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.
Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)
I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.
I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)
I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})