How to extract year from a date string? - scala

I'm using spark 2.1.2.
I'm working with datetime data, and would like to get the year from a dt string using spark sql functions.
The code I use is as follows:
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13")).toDF()
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
I expect the year of the date as 13 in the example above.
Actual: org.apache.spark.sql.AnalysisException: cannot resolve 'date' given input columns: [CAST(unix_timestamp(date, MM/dd/yy) AS TIMESTAMP)];;
'Project [year('date) AS year(date)#11]

case class Person(id: Int, date: String)
val mydf = Seq(Person(1,"9/16/13")).toDF
val solution = mydf.withColumn("year", year(to_timestamp($"date", "MM/dd/yy")))
scala> solution.show
+---+-------+----+
| id| date|year|
+---+-------+----+
| 1|9/16/13|2013|
+---+-------+----+
It looks like year does not give you two digits but four for years. I'm leaving the string truncation as a home exercise for you :)
Actual: org.apache.spark.sql.AnalysisException: cannot resolve 'date' given input columns: [CAST(unix_timestamp(date, MM/dd/yy) AS TIMESTAMP)];; 'Project [year('date) AS year(date)#11]
The reason of the exception is that you want to access the "old" date column (in select(year($"date"))) that's no longer available after select (select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType)).
You could use alias or as to change the weird-looking auto-generated name into something else like date again, and that would work.

Related

Update date format in spark dataframe for multiple spark columns

I have a Spark dataframe where few columns having a different type of date format.
To handle this I have written below code to keep a consistent type of format for all the date columns.
As the date column date format may get change every time hence I have defined a set of date formats in dt_formats.
def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD1", date_format(to_timestamp_multiple($"ETD",Seq("dd-MMM-yyyy", dt_formats)).cast("date"), "yyyy-MM-dd")).drop("ETD").withColumnRenamed("ETD1","ETD")
But here I have to create a new column then I have to drop older column then rename the new column.
that make the code unnecessary very clumsy hence I want to get override from this code.
I am trying to implement similar functionality by writing a Scala below function but it is throwing the exception org.apache.spark.sql.catalyst.parser.ParseException:, but I am unable to identify the what change I should made to make it work..
val CleansedData= rawDF.selectExpr(rawDF.columns.map(
x => { x match {
case "ETA" => s"""date_format(to_timestamp_multiple($x, dt_formats).cast("date"), "yyyy-MM-dd") as ETA"""
case _ => x
} } ) : _*)
Hence seeking help.
Thanks in advance.
Create a UDF in order to use with select. The select method takes columns and produces another DataFrame.
Also, instead of using coalesce, it might be more straightforward simply to build a parser that handles all of the formats. You can use DateTimeFormatterBuilder for this.
import java.time.format.DateTimeFormatter
import java.time.format.DateTimeFormatterBuilder
import org.apache.spark.sql.functions.udf
import java.time.LocalDate
import scala.util.Try
import java.sql.Date
val dtFormatStrings:Seq[String] = Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
// use foldLeft with appendOptional method, which for each format,
// returns a new builder with that additional possible format
val initBuilder = new DateTimeFormatterBuilder()
val builder: DateTimeFormatterBuilder = dtFormatStrings.foldLeft(initBuilder)(
(b: DateTimeFormatterBuilder, s:String) => b.appendOptional(DateTimeFormatter.ofPattern(s)))
val formatter = builder.toFormatter()
// Create the UDF, which just takes
// any function returning a sql-compatible type (java.sql.Date, here)
def toTimeStamp2(dateString:String): Date = {
val dateTry: Try[Date] = Try(java.sql.Date.valueOf(LocalDate.parse(dateString, formatter)))
dateTry.toOption.getOrElse(null)
}
val timeConversionUdf = udf(toTimeStamp2 _)
// example DF and new DF
val df = Seq(("05/08/20"), ("2020-04-03"), ("unparseable")).toDF("ETD")
df.select(timeConversionUdf(col("ETD"))).toDF("ETD2").show
Output:
+----------+
| ETD2|
+----------+
|2020-05-08|
|2020-04-03|
| null|
+----------+
Note that unparseable values end up null, as shown.
try withColumn(...) with same name and coalesce as below-
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD", coalesce(dt_formats.map(fmt => to_date($"ETD", fmt)):_*))

java.lang.RuntimeException: Unsupported literal type class org.joda.time.DateTime

I work on a project where I use a library, which is very new to me, although I was using it in other projects, without any problems.
org.joda.time.DateTime
So I work with Scala, and run the project as a job on Databricks.
scalaVersion := "2.11.12"
The code, where the exception comes from - according to my investigation so far ^^ - is the following:
var lastEndTime = config.getState("some parameters")
val timespanStart: Long = lastEndTime // last query ending time
var timespanEnd: Long = (System.currentTimeMillis / 1000) - (60*840) // 14 hours ago
val start = new DateTime(timespanStart * 1000)
val end = new DateTime(timespanEnd * 1000)
val date = DateTime.now()
Where the getState() function returns 1483228800 as Long type value.
EDIT: I use the start and end dates in filtering while building a dataframe. I compare columns (timespan type) with these values!
val df2= df
.where(col("column_name").isNotNull)
.where(col("column_name") > start &&
col("column_name") <= end)
The error I get:
ERROR Uncaught throwable from user code: java.lang.RuntimeException:
Unsupported literal type class org.joda.time.DateTime
2017-01-01T00:00:00.000Z
I am not sure I actually understand how and why this is an error, so every kind of help is more than welcome!! Thank you a lot in advance!!
This is a common problem when people start to work with Spark SQL. Spark SQL has its own types and you need to work with them if you want to take advantage of the Dataframe API. In your example you can not compare a Dataframe column value using a Spark Sql function like "col" with a DateTime object directly unless you use an UDF.
If you want to make your comparison using the Spark sql functions you can take a look to this post where you can find differences using Dates and Timestamps with Spark Dataframes.
If you (for any reason) need to use Joda you will inevitably need to build your UDF:
import org.apache.spark.sql.DataFrame
import org.joda.time.DateTime
import org.joda.time.format.{DateTimeFormat, DateTimeFormatter}
object JodaFormater {
val formatter: DateTimeFormatter = DateTimeFormat.forPattern("dd/MM/yyyy HH:mm:ss")
}
object testJoda {
import org.apache.spark.sql.functions.{udf, col}
import JodaFormater._
def your_joda_compare_udf = (start: DateTime) => (end: DateTime) => udf { str =>
val dt: DateTime = formatter.parseDateTime(str)
dt.isAfter(start.getMillis) && dt.isBefore(start.getMillis)
}
def main(args: Array[String]) : Unit = {
val start: DateTime = ???
val end : DateTime = ???
// Your dataframe with your date as StringType
val df: DataFrame = ???
df.where(your_joda_compare_udf(start)(end)(col("your_date")))
}
}
Note that using this implementation implies some overhead(memory and GC) because the conversion from StringType to a Joda DateTime object so you should use the Spark SQL functions whenever you can. In some posts you can read that udfs are black boxes because Spark can not optimize their execution, but sometimes they help.

Extract date from unix_timestamp which is in string format Apache Spark? [duplicate]

I have a data frame with a column of unix timestamp(eg.1435655706000), and I want to convert it to data with format 'yyyy-MM-DD', I've tried nscala-time but it doesn't work.
val time_col = sqlc.sql("select ts from mr").map(_(0).toString.toDateTime)
time_col.collect().foreach(println)
and I got error:
java.lang.IllegalArgumentException: Invalid format: "1435655706000" is malformed at "6000"
Here it is using Scala DataFrame functions: from_unixtime and to_date
// NOTE: divide by 1000 required if milliseconds
// e.g. 1446846655609 -> 2015-11-06 21:50:55 -> 2015-11-06
mr.select(to_date(from_unixtime($"ts" / 1000)))
Since spark1.5 , there is a builtin UDF for doing that.
val df = sqlContext.sql("select from_unixtime(ts,'YYYY-MM-dd') as `ts` from mr")
Please check Spark 1.5.2 API Doc for more info.
import org.joda.time.{DateTime, DateTimeZone}
import org.joda.time.format.DateTimeFormat
You need to import the following libraries.
val stri = new DateTime(timeInMillisec).toString("yyyy/MM/dd")
Or adjusting to your case :
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0).toInt).toString("yyyy/MM/dd"))
There could be another way :
import com.github.nscala_time.time.Imports._
val date = (new DateTime() + ((threshold.toDouble)/1000).toInt.seconds )
.toString("yyyy/MM/dd")
Hope this helps :)
You needn't convert to String before applying toDataTime with nscala_time
import com.github.nscala_time.time.Imports._
scala> 1435655706000L.toDateTime
res4: org.joda.time.DateTime = 2015-06-30T09:15:06.000Z
`
I have solved this issue using the joda-time library by mapping on the DataFrame and converting the DateTime into a String :
import org.joda.time._
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0)).toString("yyyy-MM-dd"))
You can use the following syntax in Java
input.select("timestamp)
.withColumn("date", date_format(col("timestamp").$div(1000).cast(DataTypes.TimestampType), "yyyyMMdd").cast(DataTypes.IntegerType))
What you can do is:
input.withColumn("time", concat(from_unixtime(input.col("COL_WITH_UNIX_TIME")/1000,
"yyyy-MM-dd'T'HH:mm:ss"), typedLit("."), substring(input.col("COL_WITH_UNIX_TIME"), 11, 3),
typedLit("Z")))
where time is a new column name and COL_WITH_UNIX_TIME is the name of the column which you want to convert. This will give data in millis, making your data more accurate, like: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"

Cannot access Spark dataframe methods

In Zeppelin I am using a dataframe created in another paragraph. I display the type of my df variable and get:
res35: String = DataFrame
suggesting it is a dataframe. But when I try and use select on the df variable I get an error:
<console>:62: error: value select is not a member of Object
Do I have to convert Object to Dataframe or something? Can someone tell me what I am missing? TIA!
My code is:
val df = z.get("wds")
df.getClass.getSimpleName
df.select(explode($"filtered").as("value")).groupBy("value").count.show
This gives the folowwing (edited) output:
df: Object = [racist: boolean, contributors:
string, coordinates: string, ...n: Int = 20
res35: String = DataFrame
<console>:62: error: value select is not a member of Object
df.select(explode($"filtered").as("value")).groupBy("value").count.show
Seems I was missing
.asInstanceOf[DataFrame]
i.e.
import org.apache.spark.sql.DataFrame
val df = z.get("wds").asInstanceOf[DataFrame]

How to convert unix timestamp to date in Spark

I have a data frame with a column of unix timestamp(eg.1435655706000), and I want to convert it to data with format 'yyyy-MM-DD', I've tried nscala-time but it doesn't work.
val time_col = sqlc.sql("select ts from mr").map(_(0).toString.toDateTime)
time_col.collect().foreach(println)
and I got error:
java.lang.IllegalArgumentException: Invalid format: "1435655706000" is malformed at "6000"
Here it is using Scala DataFrame functions: from_unixtime and to_date
// NOTE: divide by 1000 required if milliseconds
// e.g. 1446846655609 -> 2015-11-06 21:50:55 -> 2015-11-06
mr.select(to_date(from_unixtime($"ts" / 1000)))
Since spark1.5 , there is a builtin UDF for doing that.
val df = sqlContext.sql("select from_unixtime(ts,'YYYY-MM-dd') as `ts` from mr")
Please check Spark 1.5.2 API Doc for more info.
import org.joda.time.{DateTime, DateTimeZone}
import org.joda.time.format.DateTimeFormat
You need to import the following libraries.
val stri = new DateTime(timeInMillisec).toString("yyyy/MM/dd")
Or adjusting to your case :
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0).toInt).toString("yyyy/MM/dd"))
There could be another way :
import com.github.nscala_time.time.Imports._
val date = (new DateTime() + ((threshold.toDouble)/1000).toInt.seconds )
.toString("yyyy/MM/dd")
Hope this helps :)
You needn't convert to String before applying toDataTime with nscala_time
import com.github.nscala_time.time.Imports._
scala> 1435655706000L.toDateTime
res4: org.joda.time.DateTime = 2015-06-30T09:15:06.000Z
`
I have solved this issue using the joda-time library by mapping on the DataFrame and converting the DateTime into a String :
import org.joda.time._
val time_col = sqlContext.sql("select ts from mr")
.map(line => new DateTime(line(0)).toString("yyyy-MM-dd"))
You can use the following syntax in Java
input.select("timestamp)
.withColumn("date", date_format(col("timestamp").$div(1000).cast(DataTypes.TimestampType), "yyyyMMdd").cast(DataTypes.IntegerType))
What you can do is:
input.withColumn("time", concat(from_unixtime(input.col("COL_WITH_UNIX_TIME")/1000,
"yyyy-MM-dd'T'HH:mm:ss"), typedLit("."), substring(input.col("COL_WITH_UNIX_TIME"), 11, 3),
typedLit("Z")))
where time is a new column name and COL_WITH_UNIX_TIME is the name of the column which you want to convert. This will give data in millis, making your data more accurate, like: "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"