spark rdd time stamp conversion - scala

I have a text file. In that I have 2 fields start-time and end-time. I want to find the difference between these 2 times.
name,id,starttime,endtime,loc
xxx,123,2017-10-23T07:13:45.567+5:30,2017-10-23T07:17:40.567+5:30,zzz
xya,134,2017-10-23T14:17:25.567+5:30,2017-10-23T15:13:45.567+5:30,yyy
I have loaded this file into rdd.
val rdd1=sparkcontext.textFile("/user/root/file1.txt")
case class xyz(name:String,id:Int,starttime:String,endtime:String,loc:String)
val rdd2=rdd1.map{x =>
val w=rdd2.split(',')
xyz(w(0),w(1),w(2),w(3),w(4))
}
How to find the time stamp difference between starttime(w(2)) and endtime(w(3)) using RDD.

I would suggest you to use dataSet and not rdd so that you can utilize the case class and since dataSets are optimized than rdd and there are plenty of options than rdd.
Assuming that you have a text file with following data without header
xxx,123,2017-10-23T07:13:45.567+5:30,2017-10-23T07:17:40.567+5:30,zzz
xya,134,2017-10-23T14:17:25.567+5:30,2017-10-23T15:13:45.567+5:30,yyy
And a case class as
case class xyz(name:String,id:Int,starttime:String,endtime:String,loc:String)
First step would be to convert the text file to dataSet
val rdd1=sparkcontext.textFile("/user/root/file1.txt")
val dataSet = rdd1
.map(x => x.split(','))
.map(w => xyz(w(0), w(1).toInt, w(2).replace("T", " ").substring(0, w(2).indexOf(".")), w(3).replace("T", " ").substring(0, w(3).indexOf(".")), w(4)))
.toDS()
If you do dataSet.show(false) then you should get the dataset
+----+---+-------------------+-------------------+---+
|name|id |starttime |endtime |loc|
+----+---+-------------------+-------------------+---+
|xxx |123|2017-10-23 07:13:45|2017-10-23 07:17:40|zzz|
|xya |134|2017-10-23 14:17:25|2017-10-23 15:13:45|yyy|
+----+---+-------------------+-------------------+---+
Now you can just call unix_timestamp function to find the difference
import org.apache.spark.sql.functions._
dataSet.withColumn("difference", unix_timestamp($"endtime") - unix_timestamp($"starttime")).show(false)
which should result as
+----+---+-------------------+-------------------+---+----------+
|name|id |starttime |endtime |loc|difference|
+----+---+-------------------+-------------------+---+----------+
|xxx |123|2017-10-23 07:13:45|2017-10-23 07:17:40|zzz|235 |
|xya |134|2017-10-23 14:17:25|2017-10-23 15:13:45|yyy|3380 |
+----+---+-------------------+-------------------+---+----------+
I hope the answer is helpful

You will have to convert the String date to valid date i.e. to convert from
2017-10-23T07:13:45.567+5:30 to 2017-10-23 07:13:45 and then you can use SimpleDateFormat to convert the date to long so that arithmetic operation can be done on them
Concisely, you can do something like below
val rdd1=sparkcontext.textFile("/user/root/file1.txt")
val rdd2=rdd1
.map(x => x.split(','))
.map(w => (w(2).replace("T", " ").substring(0, w(2).indexOf(".")), w(3).replace("T", " ").substring(0, w(3).indexOf("."))))
val difference = rdd2.map(tuple => {
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val startDate = format.parse(tuple._1).getTime
val endDate = format.parse(tuple._2).getTime
endDate - startDate
})
I hope the answer is helpful

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

Spark : Parse a Date / Timestamps with different Formats (MM-dd-yyyy HH:mm, MM/dd/yy H:mm ) in same column of a Dataframe

The problem is: I have a dataset where a column having 2 or more types of date format.
In general I select all values as String type and then use the to_date to parse the date.
But I don't know how do I parse a column having two or more types of date formats.
val DF= Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))
Output from the above command:
null
null
null
0019-04-13
0019-04-14
0019-04-15
0019-04-16
The code above I have written is not working for the format MM/dd/yyyy and the format which did not provided for that I am getting the null as a output.
So seeking the help to parse the file with different date formats.
If possible kindly also share some tutorial or notes to the deal with the date formats.
Please note: I am using Scala for the spark framework.
Thanks in advance.
Check EDIT section to use Column functions instead of UDF for performance benefits in later part of this solution --
Well, Let's do it try-catch way.. Try a column conversion against each format and keep the success value.
You may have to provide all possible format from outside as parameter or keep a master list of all possible formats somewhere in code itself..
Here is the possible solution.. ( Instead of SimpleDateFormatter which sometimes have issues on timestamps beyond milliseconds, I use new library - java.time.format.DateTimeFormatter)
Create a to_timestamp Function, which accepts string to convert to timestamp and all possible Formats
import java.time.LocalDate
import java.time.LocalDateTime
import java.time.LocalTime
import java.time.format.DateTimeFormatter
import scala.util.Try
def toTimestamp(date: String, tsformats: Seq[String]): Option[java.sql.Timestamp] = {
val out = (for (tsft <- tsformats) yield {
val formatter = new DateTimeFormatterBuilder()
.parseCaseInsensitive()
.appendPattern(tsft).toFormatter()
if (Try(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter))).isSuccess)
Option(java.sql.Timestamp.valueOf(LocalDateTime.parse(date, formatter)))
else None
}).filter(_.isDefined)
if (out.isEmpty) None else out.head
}
Create a UDF on top of it - ( this udf takes Seq of Format strings as parameter)
def UtoTimestamp(tsformats: Seq[String]) = org.apache.spark.sql.functions.udf((date: String) => toTimestamp(date, tsformats))
And now, simply use it in your spark code.. Here's the test with your Data -
val DF = Seq(("02-04-2020 08:02"), ("03-04-2020 10:02"), ("04-04-2020 09:00"), ("04/13/19 9:12"), ("04/14/19 2:13"), ("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
val tsformats = Seq("MM-dd-yyyy HH:mm", "MM/dd/yy H:mm")
DF.select(UtoTimestamp(tsformats)('DOB)).show
And here is the output -
+-------------------+
| UDF(DOB)|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
Cherry on top would be to avoid having to write UtoTimestamp(colname) for many columns in your dataframe.
Let's write a function which accepts a Dataframe, List of all Timestamp columns, And all possible formats which your source data may have coded timestamps in..
It'd parse all timestamp columns for you with trying against formats..
def WithTimestampParsed(df: DataFrame, tsCols: Seq[String], tsformats: Seq[String]): DataFrame = {
val colSelector = df.columns.map {
c =>
{
if (tsCols.contains(c)) UtoTimestamp(tsformats)(col(c)) alias (c)
else col(c)
}
}
Use it like this -
// You can pass as many column names in a sequence to be parsed
WithTimestampParsed(DF, Seq("DOB"), tsformats).show
Output -
+-------------------+
| DOB|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
EDIT -
I saw latest spark code, and they are also using java.time._ utils now to parse dates and timestamps which enable handling beyond Milliseconds.. Earlier these functions were based on SimpleDateFormat ( I wasn't relying on to_timestamps of spark earlier due to this limit) .
So with to_date & to_timestamp functions being so reliable now.. Let's use them instead of having to write a UDF.. Let's write a function which operates on Columns.
def to_timestamp_simple(col: org.apache.spark.sql.Column, formats: Seq[String]): org.apache.spark.sql.Column = {
coalesce(formats.map(fmt => to_timestamp(col, fmt)): _*)
}
and with this WithTimestampParsedwould look like -
def WithTimestampParsedSimple(df: DataFrame, tsCols: Seq[String], tsformats: Seq[String]): DataFrame = {
val colSelector = df.columns.map {
c =>
{
if (tsCols.contains(c)) to_timestamp_simple(col(c), tsformats) alias (c)
else col(c)
}
}
df.select(colSelector: _*)
}
And use it like -
DF.select(to_timestamp_simple('DOB,tsformats)).show
//OR
WithTimestampParsedSimple(DF, Seq("DOB"), tsformats).show
Output looks like -
+---------------------------------------------------------------------------------------+
|coalesce(to_timestamp(`DOB`, 'MM-dd-yyyy HH:mm'), to_timestamp(`DOB`, 'MM/dd/yy H:mm'))|
+---------------------------------------------------------------------------------------+
| 2020-02-04 08:02:00|
| 2020-03-04 10:02:00|
| 2020-04-04 09:00:00|
| 2019-04-13 09:12:00|
| 2019-04-14 02:13:00|
| 2019-04-15 10:14:00|
| 2019-04-16 05:15:00|
+---------------------------------------------------------------------------------------+
+-------------------+
| DOB|
+-------------------+
|2020-02-04 08:02:00|
|2020-03-04 10:02:00|
|2020-04-04 09:00:00|
|2019-04-13 09:12:00|
|2019-04-14 02:13:00|
|2019-04-15 10:14:00|
|2019-04-16 05:15:00|
+-------------------+
I put some code that maybe can help you in some way.
I tried this
mport org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import java.sql.Date
import java.util.{GregorianCalendar}
object DateFormats {
val spark = SparkSession
.builder()
.appName("Multiline")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id", "Multiline") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
import spark.implicits._
val DF = Seq(("02-04-2020 08:02"),("03-04-2020 10:02"),("04-04-2020 09:00"),("04/13/19 9:12"),("04/14/19 2:13"),("04/15/19 10:14"), ("04/16/19 5:15")).toDF("DOB")
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val DOBDF = DF.withColumn("Date", to_date($"DOB", "MM/dd/yyyy"))
DOBDF.show()
// todo: my code below
DF
.rdd
.map(r =>{
if(r.toString.contains("-")) {
val dat = r.toString.substring(1,11).split("-")
val calendar = new GregorianCalendar(dat(2).toInt,dat(1).toInt - 1,dat(0).toInt)
(r.toString, new Date(calendar.getTimeInMillis))
} else {
val dat = r.toString.substring(1,9).split("/")
val calendar = new GregorianCalendar(dat(2).toInt + 2000,dat(0).toInt - 1,dat(1).toInt)
(r.toString, new Date(calendar.getTimeInMillis))
}
})
.toDF("DOB","DATE")
.show()
// To have the opportunity to view the web console of Spark: http://localhost:4040/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped.")
spark.stop()
println("SparkSession stopped.")
}
}
}
+------------------+----------+
| DOB| DATE|
+------------------+----------+
|[02-04-2020 08:02]|2020-04-02|
|[03-04-2020 10:02]|2020-04-03|
|[04-04-2020 09:00]|2020-04-04|
| [04/13/19 9:12]|2019-04-13|
| [04/14/19 2:13]|2019-04-14|
| [04/15/19 10:14]|2019-04-15|
| [04/16/19 5:15]|2019-04-16|
+------------------+----------+
Regards
We can use coalesce function as mentioned in the accepted answer. On each format mismatch, to_date returns null, which makes coalesce to move to the next format in the list.
But with to_date, if you have issues in parsing the correct year component in the date in yy format (In the date 7-Apr-50, if you want 50 to be parsed as 1950 or 2050), refer to this stackoverflow post
import org.apache.spark.sql.functions.coalesce
// Reference: https://spark.apache.org/docs/3.0.0/sql-ref-datetime-pattern.html
val parsedDateCol: Column = coalesce(
// Four letters of M looks for full name of the Month
to_date(col("original_date"), "MMMM, yyyy"),
to_date(col("original_date"), "dd-MMM-yy"),
to_date(col("original_date"), "yyyy-MM-dd"),
to_date(col("original_date"), "d-MMM-yy")
)
// I have used some dummy dataframe name.
dataframeWithDateCol.select(
parsedDateCol.as("parsed_date")
)
.show()

TimestampType Difference and reset hour in Scala

I have following two columns in
import org.apache.spark.sql.types.{TimestampType, ArrayType}
statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))
I want to pass requestTime & responseTime into the following UDF and find the difference after
setting Minute and Seconds into "0"
val split_hour_range_udf = udf { (startDateTime: TimestampType ,
endDateTime: TimestampType ) =>
}
In Python we have "replace" (startDateTime.replace(second=0,minute=0)) what is the equivalent in Scala?.
You can create a UDF as below, send the value as an string and convert as Timestamp later. in UDF
val timeDFiff = udf((start: String , end : String) => {
//convert to timestamp and find the difference
})
and use it as
df.withColumn("responseTime", timeDiff($"requestTime", $"responseTime"))
Rather than using an UDF you can use built in Spark function like dateDiff
you can do this:
import org.apache.spark.sql.types.{TimestampType, ArrayType}
statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS"))
//This resets minute and second to 0
def resetMinSec(colName: String) = {
col(colName) - minute(col(colName).cast("TimeStamp"))*60 - second(col(colname).cast("Timestamp"))
}
//create a new column with the difference between unixtimes
statusWithOutDuplication.select((resetMinSec("responseTime") - resetMinSec("requestTime")).as("diff"))
Note that i didn't cast requestTime/responseTime to "Timestamp", you should cast after finding the difference.
The udf approach should be similar but using some scala methods to obtain minutes/seconds from a Timestamp.
Hope this helps a little!

Scala Spark Filter RDD using Cassandra

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:
((url_hash, url, created_timestamp )).
I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.
Cassandra Table looks like following:
url_hash| url | created_timestamp | updated_timestamp
Any pointers will be great.
I tried something like this this:
case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)
I am getting cassandra error
java.lang.NullPointerException: Unexpected null value of column full_url in keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper
There are no null values in cassandra table
Thanks The Archetypal Paul!
I hope somebody finds this useful. Had to add Option to case class.
Looking forward to better solutions
case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])
def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)

Working with Dates in Spark

I have a requirement where a CSV file is to be parsed, identify the records between the specific dates and find total and average sales for each sales person per ProductCategory in that duration. Below is the CSV file structure:
SalesPersonId,SalesPersonName,SaleDate,SaleAmount,ProductCategory
Please help in resolving this query. Looking for solution in Scala
What I tried:
Used the SimpleDateFormat as mentioned below:
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
and created an RDD with the below piece of code:
val onlyHouseLoan = readFile.map(line => (line.split(",")(0), line.split(",")(2), line.split(",")(3).toLong, format.parse(line.split(",")(4).toString())))
However, I tried using the Calendar on top of the highlighted expression but getting error that NumberformatExpression.
So by just creating a quick rdd in the format of the csv-file you describe
val list = sc.parallelize(List(("1","Timothy","04/02/2015","100","TV"), ("1","Timothy","04/03/2015","10","Book"), ("1","Timothy","04/03/2015","20","Book"), ("1","Timothy","04/05/2015","10","Book"),("2","Ursula","04/02/2015","100","TV")))
And then running
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val startDate = LocalDate.of(2015,1,4)
val endDate = LocalDate.of(2015,4,5)
val result = list
.filter{case(_,_,date,_,_) => {
val localDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("MM/dd/yyyy"))
localDate.isAfter(startDate) && localDate.isBefore(endDate)}}
.map{case(id, _, _, amount, category) => ((id, category), (amount.toDouble, 1))}
.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2))
.map{case((id, category),(total, sales)) => (id, List((category, total, total/sales)))}
.reduceByKey(_ ++ _)
will give you
(1,List((Book,30.0,15.0), (TV,100.0,100.0)))
(2,List((TV,100.0,100.0)))
in the format of (SalesPersonId, [(ProductCategory,TotalSaleAmount, AvgSaleAmount)]. Is that what you are looking for?