Casting a column in a spark DataFrame without nulls being created - scala

Is there a way to cast columns in Spark and have it fail in case of a type mismatch rather than getting a null returned?
As an example I have a DF with all string columns but one of them I want to cast to date
+----------+------------+------------+
| service| eventType|process_date|
+----------+------------+------------+
| myservice| myeventtype| 2020-10-15|
| myservice| myeventtype| 2020-02-15|
|myservice2|myeventtype3| notADate |
+----------+------------+------------+
If I try to cast this with the main cast function df.withColumn("process_date", df("process_date").cast(targetType)) it will replace the bad data with a null
+----------+------------+------------+
| service| eventType|process_date|
+----------+------------+------------+
| myservice| myeventtype| 2020-10-15|
| myservice| myeventtype| 2020-02-15|
|myservice2|myeventtype3| null|
+----------+------------+------------+
Using this function in my current program could result in dangerous loss of data that I might not catch until it's too late.

I find two ways of doing what you want.
First, if you really want the process to fail when a Date is not parseable you can use UDF:
import java.time.LocalDate
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.DateType
object Data {
val tuples = List(
("myservice", "myeventtype", "2020-10-15"),
("myservice", "myeventtype", "2020-02-15"),
("myservice2", "myeventtype3", "notADate")
)
}
object BadDates {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local[2]").appName("Simple Application").getOrCreate()
import spark.implicits._
val dfBad = Data.tuples.toDF("service","eventType","process_date")
val dateConvertUdf = udf({str : String => java.sql.Date.valueOf(LocalDate.parse(str))})
dfBad
.withColumn("process_date", dateConvertUdf(col("process_date")))
.show()
}
}
This will fail with the following exception:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(BadDates$$$Lambda$1122/934288610: (string) => date)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
...
Caused by: java.time.format.DateTimeParseException: Text 'notADate' could not be parsed at index 0
at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1949)
Alternatively you can do the conversion and check if the converted value is null but the original isn't for any line:
object BadDates2 {
def main(args: Array[String]) {
val spark = SparkSession.builder.master("local[2]").appName("Simple Application").getOrCreate()
import spark.implicits._
val dfBad = Data.tuples.toDF("service","eventType","process_date")
val df = dfBad
.withColumn("process_date_dat", col("process_date").cast(DateType))
val badLines = df
.filter(col("process_date").isNotNull && col("process_date_dat").isNull)
.count()
assert(badLines==0) //This will fail, badLines is 1
}
}

Related

Spark : Parse a Date / Timestamps with different Formats (MM-dd-yyyy HH:mm, MM/dd/yy H:mm ) in same column of a Dataframe

The problem is: I have a dataset where a column having 2 or more types of date format.
In general I select all values as String type and then use the to_date to parse the date.
But I don't know how do I parse a column having two or more types of date formats.
val DF= Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))
Output from the above command:
null
null
null
0019-04-13
0019-04-14
0019-04-15
0019-04-16
The code above I have written is not working for the format MM/dd/yyyy and the format which did not provided for that I am getting the null as a output.
So seeking the help to parse the file with different date formats.
If possible kindly also share some tutorial or notes to the deal with the date formats.
Please note: I am using Scala for the spark framework.
Thanks in advance.
Check EDIT section to use Column functions instead of UDF for performance benefits in later part of this solution --
Well, Let's do it try-catch way.. Try a column conversion against each format and keep the success value.
You may have to provide all possible format from outside as parameter or keep a master list of all possible formats somewhere in code itself..
Here is the possible solution.. ( Instead of SimpleDateFormatter which sometimes have issues on timestamps beyond milliseconds, I use new library - java.time.format.DateTimeFormatter)
Create a to_timestamp Function, which accepts string to convert to timestamp and all possible Formats
import java.time.LocalDate
import java.time.LocalDateTime
import java.time.LocalTime
import java.time.format.DateTimeFormatter
import scala.util.Try
def toTimestamp(date: String, tsformats: Seq[String]): Option[java.sql.Timestamp] = {
val out = (for (tsft <- tsformats) yield {
val formatter = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendPattern(tsft).toFormatter()
if (Try(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter))).isSuccess)
Option(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter)))
else None
}).filter(_.isDefined)
if (out.isEmpty) None else out.head
}
Create a UDF on top of it - ( this udf takes Seq of Format strings as parameter)
def UtoTimestamp(tsformats: Seq[String]) = org.apache.spark.sql.functions.udf((date: String) => toTimestamp(date, tsformats))
And now, simply use it in your spark code.. Here's the test with your Data -
val DF = Seq(("02-04-2020 08:02"), ("03-04-2020 10:02"), ("04-04-2020 09:00"), ("04/13/19 9:12"), ("04/14/19 2:13"), ("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
val tsformats = Seq("MM-dd-yyyy HH:mm", "MM/dd/yy H:mm")
DF.select(UtoTimestamp(tsformats)('DOB)).show
And here is the output -
+-------------------+
| UDF(DOB)|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
Cherry on top would be to avoid having to write UtoTimestamp(colname) for many columns in your dataframe.
Let's write a function which accepts a Dataframe, List of all Timestamp columns, And all possible formats which your source data may have coded timestamps in..
It'd parse all timestamp columns for you with trying against formats..
def WithTimestampParsed(df: DataFrame, tsCols: Seq[String], tsformats: Seq[String]): DataFrame = {
val colSelector = df.columns.map {
c =>
{
if (tsCols.contains(c)) UtoTimestamp(tsformats)(col(c)) alias (c)
else col(c)
}
}
Use it like this -
// You can pass as many column names in a sequence to be parsed
WithTimestampParsed(DF, Seq("DOB"), tsformats).show
Output -
+-------------------+
| DOB|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
EDIT -
I saw latest spark code, and they are also using java.time._ utils now to parse dates and timestamps which enable handling beyond Milliseconds.. Earlier these functions were based on SimpleDateFormat ( I wasn't relying on to_timestamps of spark earlier due to this limit) .
So with to_date & to_timestamp functions being so reliable now.. Let's use them instead of having to write a UDF.. Let's write a function which operates on Columns.
def to_timestamp_simple(col: org.apache.spark.sql.Column, formats: Seq[String]): org.apache.spark.sql.Column = {
coalesce(formats.map(fmt => to_timestamp(col, fmt)): _*)
}
and with this WithTimestampParsedwould look like -
def WithTimestampParsedSimple(df: DataFrame, tsCols: Seq[String], tsformats: Seq[String]): DataFrame = {
val colSelector = df.columns.map {
c =>
{
if (tsCols.contains(c)) to_timestamp_simple(col(c), tsformats) alias (c)
else col(c)
}
}
df.select(colSelector: _*)
}
And use it like -
DF.select(to_timestamp_simple('DOB,tsformats)).show
//OR
WithTimestampParsedSimple(DF, Seq("DOB"), tsformats).show
Output looks like -
+---------------------------------------------------------------------------------------+
|coalesce(to_timestamp(`DOB`, 'MM-dd-yyyy HH:mm'), to_timestamp(`DOB`, 'MM/dd/yy H:mm'))|
+---------------------------------------------------------------------------------------+
| 2020-02-04 08:02:00|
| 2020-03-04 10:02:00|
| 2020-04-04 09:00:00|
| 2019-04-13 09:12:00|
| 2019-04-14 02:13:00|
| 2019-04-15 10:14:00|
| 2019-04-16 05:15:00|
+---------------------------------------------------------------------------------------+
+-------------------+
| DOB|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
I put some code that maybe can help you in some way.
I tried this
mport org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import java.sql.Date
import java.util.{GregorianCalendar}
object DateFormats {
val spark = SparkSession
.builder()
.appName("Multiline")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id", "Multiline") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
import spark.implicits._
val DF = Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))
DOBDF.show()
// todo: my code below
DF
.rdd
.map(r =>{
if(r.toString.contains("-")) {
val dat = r.toString.substring(1,11).split("-")
val calendar = new GregorianCalendar(dat(2).toInt,dat(1).toInt - 1,dat(0).toInt)
(r.toString, new Date(calendar.getTimeInMillis))
} else {
val dat = r.toString.substring(1,9).split("/")
val calendar = new GregorianCalendar(dat(2).toInt + 2000,dat(0).toInt - 1,dat(1).toInt)
(r.toString, new Date(calendar.getTimeInMillis))
}
})
.toDF("DOB","DATE")
.show()
// To have the opportunity to view the web console of Spark: http://localhost:4040/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped.")
spark.stop()
println("SparkSession stopped.")
}
}
}
+------------------+----------+
| DOB| DATE|
+------------------+----------+
|[02-04-2020 08:02]|2020-04-02|
|[03-04-2020 10:02]|2020-04-03|
|[04-04-2020 09:00]|2020-04-04|
| [04/13/19 9:12]|2019-04-13|
| [04/14/19 2:13]|2019-04-14|
| [04/15/19 10:14]|2019-04-15|
| [04/16/19 5:15]|2019-04-16|
+------------------+----------+
Regards
We can use coalesce function as mentioned in the accepted answer. On each format mismatch, to_date returns null, which makes coalesce to move to the next format in the list.
But with to_date, if you have issues in parsing the correct year component in the date in yy format (In the date 7-Apr-50, if you want 50 to be parsed as 1950 or 2050), refer to this stackoverflow post
import org.apache.spark.sql.functions.coalesce
// Reference: https://spark.apache.org/docs/3.0.0/sql-ref-datetime-pattern.html
val parsedDateCol: Column = coalesce(
// Four letters of M looks for full name of the Month
to_date(col("original_date"), "MMMM, yyyy"),
to_date(col("original_date"), "dd-MMM-yy"),
to_date(col("original_date"), "yyyy-MM-dd"),
to_date(col("original_date"), "d-MMM-yy")
)
// I have used some dummy dataframe name.
dataframeWithDateCol.select(
parsedDateCol.as("parsed_date")
)
.show()

How to convert org.apache.spark.sql.ColumnName to string,Decimal type in Spark Scala?

I have a JSON like below
{"name":"method1","parameter1":"P1name","parameter2": 1.0}
I am loading my JSON file
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("C:/Users/test/Desktop/te.txt")
scala> df.show()
+-------+----------+----------+
| name|parameter1|parameter2|
+-------+----------+----------+
|method1| P1name| 1.0 |
+-------+----------+----------+
I have a function like below:
def method1(P1:String, P2:Double)={
| print(P1)
print(P2)
| }
I am calling my method1 based on column name after executing below code it should execute method1.
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
But I am getting bellow error.
<console>:63: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how to convert org.apache.spark.sql.ColumnName data type to String
When you pass arguments as
method1($"parameter1",$"parameter2")
You are passing columns to the function and not primitive datatypes. So, I would suggest you to change your method1 and method2 as udf functions, if you want to apply primitive datatype manipulations inside functions. And udf functions would have to return a value for each row of the new column.
import org.apache.spark.sql.functions._
def method1 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
def method2 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
Then your withColumn api should work properly
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
Note: udf functions perform data serialization and deserialzation for changing the column dataTypes to be processed row-wise which would increase complexity and a lot of memory usages. spark functions should be used as much as possible
You can try like this:
scala> def method1(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 0
| }
scala> def method2(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 1
| }
df.withColumn("methodCalling", when($"name" === "method1", method1(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head))
.otherwise(when($"name" === "method2", method2(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head)))).show
//output
P1name
1.0
+-------+----------+----------+-------------+
| name|parameter1|parameter2|methodCalling|
+-------+----------+----------+-------------+
|method1| P1name| 1.0| 0|
+-------+----------+----------+-------------+
You have to return something from your method otherwise it will retun unit and it will give error after printing result:
java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:75)
at org.apache.spark.sql.functions$.lit(functions.scala:101)
at org.apache.spark.sql.functions$.when(functions.scala:1245)
... 50 elided
Thanks.
I think you just want to read the JSON and based on that call the methods.
Since you have already created a dataframe, you can do something like :
df.map( row => (row.getString(0), row.getString(1) , row.getDouble(2) ) ).collect
.foreach { x =>
x._1.trim.toLowerCase match {
case "method1" => method1(x._2, x._3)
//case "method2" => method2(x._2, x._3)
//case _ => methodn(x._2, x._3)
}
}
// Output : P1name1.0
// Because you used `print` and not `println` ;)

passing UDF to a method or class

I have a UDF say
val testUDF = udf{s: string=>s.toUpperCase}
I want to create this UDF in a separate method or may be something else like an implementation class and pass it on another class which uses it. Is it possible?
Say suppose I have a class A
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo=df.select(testUDF(col))
}
}
class A should be able to use UDF. Can this be achieved?
Given a dataframe as
+----+
|col1|
+----+
|abc |
|dBf |
|Aec |
+----+
And a udf function
import org.apache.spark.sql.functions._
val testUDF = udf{s: String=>s.toUpperCase}
You can definitely use that udf function from another class as
val demo = df.select(testUDF(col("col1")).as("upperCasedCol"))
which should give you
+-------------+
|upperCasedCol|
+-------------+
|ABC |
|DBF |
|AEC |
+-------------+
But I would suggest you to use other functions if possible as udf function requires columns to be serialized and deserialized which would consume time and memory more than other functions available. UDF function should be the last choice.
You can use upper function for your case
val demo = df.select(upper(col("col1")).as("upperCasedCol"))
This will generate the same output as the original udf function
I hope the answer is helpful
Updated
Since your question is asking for information on how to call the udf function defined in another class or object, here is the method
suppose you have an object where you defined the udf function or a function that i suggested as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
object UDFs {
def testUDF = udf{s: String=>s.toUpperCase}
def testUpper(column: Column) = upper(column)
}
Your A class is as in your question, I just added another function
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo = df.select(UDFs.testUDF(col("col1")))
demo
}
def usingUpper() = {
df.select(UDFs.testUpper(col("col1")))
}
}
Then you can call the functions from main as below
import org.apache.spark.sql.SparkSession
object TestUpper {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().appName("Simple Application")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq(
("abc"),
("dBf"),
("Aec")
).toDF("col1")
val a = new A(df)
//calling udf function
a.testMethod().show(false)
//calling upper function
a.usingUpper().show(false)
}
}
I guess this is more than helpful
If I understand correctly you would actually like some kind of factory to create this user-defined-function for a specific class A.
This could be achieve using a type class which gets injected implicitly.
E.g. (I had to define UDF and DataFrame to be able to test this)
type UDF = String => String
case class DataFrame(col: String) {
def select(in: String) = s"col:$col, in:$in"
}
trait UDFFactory[A] {
def testUDF: UDF
}
implicit object UDFFactoryA extends UDFFactory[AClass] {
def testUDF: UDF = _.toUpperCase
}
class AClass(df: DataFrame) {
def testMethod(implicit factory: UDFFactory[AClass]) = {
val demo = df.select(factory.testUDF(df.col))
println(demo)
}
}
val a = new AClass(DataFrame("test"))
a.testMethod // prints 'col:test, in:TEST'
Like you mentioned, create a method exactly like your UDF in your object body or companion class,
val myUDF = udf((str:String) => { str.toUpperCase })
Then for some dataframe df do this,
val res=df withColumn("NEWCOLNAME", myUDF(col("OLDCOLNAME")))
This will change something like this,
+-------------------+
| OLDCOLNAME |
+-------------------+
| abc |
+-------------------+
to
+-------------------+-------------------+
| OLDCOLNAME | NEWCOLNAME |
+-------------------+-------------------+
| abc | ABC |
+-------------------+-------------------+
Let me know if this helped, Cheers.
Yes thats possible as functions are objects in scala which can be passed around:
import org.apache.spark.sql.expressions.UserDefinedFunction
class A(df: DataFrame, testUdf:UserDefinedFunction) {
def testMethod(): DataFrame = {
df.select(testUdf(col))
}
}

Spark scala udf error for if else

I am trying to define udf with the function getTIme for spark scala udf but i am getting the error as error: illegal start of declaration. What might be error in the syntax and retutrn the date and also if there is parse exception instead of returing the null, send the some string as error
def getTime=udf((x:String) : java.sql.Timestamp => {
if (x.toString() == "") return null
else { val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss");
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime()); return t
}})
Thank you!
The return type for the udf is derived and should not be specified. Change the first line of code to:
def getTime=udf((x:String) => {
// your code
}
This should get rid of the error.
The following is a fully working code written in functional style and making use of Scala constructs:
val data: Seq[String] = Seq("", null, "2017-01-15 10:18:30")
val ds = spark.createDataset(data).as[String]
import java.text.SimpleDateFormat
import java.sql.Timestamp
val fmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
// ********HERE is the udf completely re-written: **********
val f = udf((input: String) => {
Option(input).filter(_.nonEmpty).map(str => new Timestamp(fmt.parse(str).getTime)).orNull
})
val ds2 = ds.withColumn("parsedTimestamp", f($"value"))
The following is output:
+-------------------+--------------------+
| value| parsedTimestamp|
+-------------------+--------------------+
| | null|
| null| null|
|2017-01-15 10:18:30|2017-01-15 10:18:...|
+-------------------+--------------------+
You should be using Scala datatypes, not Java datatypes. It would go like this:
def getTime(x: String): Timestamp = {
//your code here
}
You can easily do it in this way :
def getTimeFunction(timeAsString: String): java.sql.Timestamp = {
if (timeAsString.isEmpty)
null
else {
val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss")
val date = format.parse(timeAsString.toString())
val time = new Timestamp(date.getTime())
time
}
}
val getTimeUdf = udf(getTimeFunction _)
Then use this getTimeUdf accordingly. !

Better way to convert a string field into timestamp in Spark

I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.
I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.
Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)
I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.
I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)
I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})