How to replace epoch column in DataFrame with date in scala - scala

I am writing a spark application which receives an avro record. I am converting that avro record into Spark DataFrame (df) object.
The df contains a time stamp attribute which is in seconds. (Epoch time)
I want to replace the seconds column with the date column.
How to do that?
My code snippet is :
val df = sqlContext.read.avro("/root/Work/PixelReporting/input_data/pixel.avro")
val pixelGeoOutput = df.groupBy("current_time", "pixel_id", "geo_id", "operation_type", "is_piggyback").count()
pixelGeoOutput.write.json("/tmp/pixelGeo")
"current_time" is in seconds right now. I want to convert it into date.

Since Spark 1.5, there's a built in sql.function called from_unixtime, you can do:
val df = Seq(Tuple1(1462267668L)).toDF("epoch")
df.withColumn("date", from_unixtime(col("epoch")))

Thanks guys,
I used withColumn method to solve my problem.
Code snippet is :
val newdf = df.withColumn("date", epochToDateUDF(df("current_time")))
def epochToDateUDF = udf((current_time : Long) =>{
DateTimeFormat.forPattern("YYYY-MM-dd").print(current_time *1000)
})

This should give you an idea:
import java.util.Date
val df = sc.parallelize(List(1462267668L, 1462267672L, 1462267678L)).toDF("current_time")
val dfWithDates = df.map(row => new Date(row.getLong(0) * 1000))
dfWithDates.collect()
Output:
Array[java.util.Date] = Array(Tue May 03 11:27:48 CEST 2016, Tue May 03 11:27:52 CEST 2016, Tue May 03 11:27:58 CEST 2016)
You might also try this in a UDF and using withColumn to just replace that single column.

Related

Convert date to another format Scala Spark

I am reading a CSV that contains two types of date:
dd-MMM-yyyy hh:mm:ss -> 13-Dec-2019 17:10:00
dd/MM/yyyy hh:mm -> 11/02/2020 17:33
I am trying to transform all dates of the first type into the second type but I can't find a good solution. I am trying this:
val pr_date = readeve.withColumn("Date", when(to_date(col("Date"),"dd-MMM-yyyy hh:mm:ss").isNotNull,
to_date(col("Date"),"dd/MM/yyyy hh:mm")))
pr_date.show(25)
And I get the entire Date column as null values:
I am trying with this function:
def to_date_(col: Column,
formats: Seq[String] = Seq("dd-MMM-yyyy hh:mm:ss", "dd/MM/yyyy hh:mm")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
val p2 = readeve.withColumn("Date",to_date_(readeve.col(("Date")))).show(125)
And in the first type of date i get nulls too:
What am I doing wrong? (new with Scala Spark)
Scala version: 2.11.7
Spark version: 2.4.3
Try code below? Note that 17 is HH, not hh. Also try to_timestamp instead of to_date because you want to keep the time.
val pr_date = readeve.withColumn(
"Date",
coalesce(
date_format(to_timestamp(col("Date"),"dd-MMM-yyyy HH:mm:ss"),"dd/MM/yyyy HH:mm"),
date_format(to_timestamp(col("Date"),"dd/MM/yyyy HH:mm"),"dd/MM/yyyy HH:mm")
)
)

Convert timestamp column from UTC to EST in spark scala

I have a column in spark dataframe of timestamp type with date format like '2019-06-13T11:39:10.244Z'
My goal is to convert this column into EST time(subtracting 4 hours) keeping the same format.
I tried it using from_utc_timestamp api but it seems it is converting the UTC time to my local timezone (+5:30) and adding it to the timestamp then subtracting 4 hours from it. I tried to use Joda time but for some reason it is adding 33 days to the EST time
innput = 2019-06-13T11:39:10.244Z
using from_utc_timestamp api:
val tDf = df.withColumn("newTimeCol", to_utc_timestamp(col("timeCol"), "America/New_York"))
output = 2019-06-13T13:09:10.244Z+5:30
using Joda time package:
val coder : (String => String) = (arg: String) => {
new DateTime(arg, DateTimeZone.UTC).minusHours(4).toString("yyyy-mm-dd'T'HH:mm:s.SS'Z'")}
val sqlfunc = udf(coder)
val tDf = df.withColumn("newTime", sqlfunc(col("_c20")))
output = 2019-39-13T07:39:10.244Z
desired output = 2019-06-13T07:39:10.244Z
Kindly advise how should I proceed. Thanks in advance
There is a typo in your format string when creating the output.
Your format string should be yyyy-MM-dd'T'HH:mm:s.SS'Z' but it is yyyy-mm-dd'T'HH:mm:s.SS'Z'.
mm is the format char for minutes while MM is the format char for the months. You can check all format chars here.

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.

Change column value in a dataframe spark scala

This is how my dataframe looks like at the moment
+------------+
| DATE |
+------------+
| 19931001|
| 19930404|
| 19930603|
| 19930805|
+------------+
I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp.
How would I do that using the withColumn method ?
Here is the solution using UDF and withcolumn, I have assumed that you have a string date field in Dataframe
//Create dfList dataframe
val dfList = spark.sparkContext
.parallelize(Seq("19931001","19930404", "19930603", "19930805")).toDF("DATE")
dfList.withColumn("DATE", dateToTimeStamp($"DATE")).show()
val dateToTimeStamp = udf((date: String) => {
val stringDate = date.substring(0,4)+"/"+date.substring(4,6)+"/"+date.substring(6,8)
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
format.format(new SimpleDateFormat("yyy/MM/dd").parse(stringDate))
})
withClumn("date",
from_unixtime(unix_timestamp($"date", "yyyyMMdd"), "yyyy-MM-dd hh:mm:ss.fff") as "date")
this should work.
Another notice is the that mm gives minutes and MM gives months, hope this help you.
First, I created this DF:
val df = sc.parallelize(Seq("19931001","19930404","19930603","19930805")).toDF("DATE")
For date management we are going to use joda time Library (don't forget to join the joda-time.jar file)
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter
def func(s:String):String={
val dateFormat = DateTimeFormat.forPattern("yyyymmdd");
val resultDate = dateFormat.parseDateTime(s);
return resultDate.toString();
}
Finally, apply the function to dataframe:
val temp = df.map(l => func(l.get(0).toString()))
val df2 = temp.toDF("DATE")
df2.show()
This answer still needs some work, me myself is new to spark, but it is getting the job done, I think!

How to map a RDD to another RDD with Scala, in Spark?

I have a RDD:
RDD1 = (big,data), (apache,spark), (scala,language) ...
and I need to map that with the time stamp
RDD2 = ('2015-01-01 13.00.00')
so that I get
RDD3 = (big, data, 2015-01-01 13.00.00), (apache, spark, 2015-01-01 13.00.00), (scala, language, 2015-01-01 13.00.00)
I wrote a simple map function for this:
RDD3 = RDD1.map(rdd => (rdd, RDD2))
but it is not working, and I think it is not the way to go.
How to do it? I am new to Scala and Spark. Thank you.
You can use zip:
val rdd1 = sc.parallelize(("big","data") :: ("apache","spark") :: ("scala","language") :: Nil)
// RDD[(String, String)]
val rdd2 = sc.parallelize(List.fill(3)(new java.util.Date().toString))
// RDD[String]
rdd1.zip(rdd2).map{ case ((a,b),c) => (a,b,c) }.collect()
// Array((big,data,Fri Jul 24 22:25:01 CEST 2015), (apache,spark,Fri Jul 24 22:25:01 CEST 2015), (scala,language,Fri Jul 24 22:25:01 CEST 2015))
If you want the same time stamp with every element of rdd1 :
val now = new java.util.Date().toString
rdd1.map{ case (a,b) => (a,b,now) }.collect()