Working with Dates in Spark - date

I have a requirement where a CSV file is to be parsed, identify the records between the specific dates and find total and average sales for each sales person per ProductCategory in that duration. Below is the CSV file structure:
SalesPersonId,SalesPersonName,SaleDate,SaleAmount,ProductCategory
Please help in resolving this query. Looking for solution in Scala
What I tried:
Used the SimpleDateFormat as mentioned below:
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
and created an RDD with the below piece of code:
val onlyHouseLoan = readFile.map(line => (line.split(",")(0), line.split(",")(2), line.split(",")(3).toLong, format.parse(line.split(",")(4).toString())))
However, I tried using the Calendar on top of the highlighted expression but getting error that NumberformatExpression.

So by just creating a quick rdd in the format of the csv-file you describe
val list = sc.parallelize(List(("1","Timothy","04/02/2015","100","TV"), ("1","Timothy","04/03/2015","10","Book"), ("1","Timothy","04/03/2015","20","Book"), ("1","Timothy","04/05/2015","10","Book"),("2","Ursula","04/02/2015","100","TV")))
And then running
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val startDate = LocalDate.of(2015,1,4)
val endDate = LocalDate.of(2015,4,5)
val result = list
.filter{case(_,_,date,_,_) => {
val localDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("MM/dd/yyyy"))
localDate.isAfter(startDate) && localDate.isBefore(endDate)}}
.map{case(id, _, _, amount, category) => ((id, category), (amount.toDouble, 1))}
.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2))
.map{case((id, category),(total, sales)) => (id, List((category, total, total/sales)))}
.reduceByKey(_ ++ _)
will give you
(1,List((Book,30.0,15.0), (TV,100.0,100.0)))
(2,List((TV,100.0,100.0)))
in the format of (SalesPersonId, [(ProductCategory,TotalSaleAmount, AvgSaleAmount)]. Is that what you are looking for?

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

How take data from several parquet files at once?

I need your help cause I am new in Spark Framework.
I have folder with a lot of parquet files. The name of these files has the same format: DD-MM-YYYY. For example: '01-10-2018', '02-10-2018', '03-10-2018', etc.
My application has two input parameters: dateFrom and dateTo.
When I try to use next code application hangs. It seems like application scan all files in folder.
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
I need to take data pool for period as fast as it possible.
I think it would be great to divide period into days and then read files separately, join them like that:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
dateFrom and dateTo are dynamic, so I don't know how correctly organize code right now. Please help!
#y2k-shubham I tried to test next code, but it raise error:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
ERROR:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
It seems like intermediateDf DateFrame at start is empty. How to fix the problem?
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
I also suggest using the native JSR 310 API (part of Java SE since Java 8) rather than joda-time, since it is more modern and does not require external dependencies. Note that first creating a sequence of paths and doing map+reduce is probably simpler for this use case than a more general foldLeft-based solution.
Additionally, you can use reduceOption, then you'll get an Option[DataFrame] if the input date range is empty. Also, if it is possible for some input directories/files to be missing, you'd want to do a check before invoking spark.read.parquet. If your data is on HDFS, you should probably use the Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)
While I haven't tested this piece of code, it must work (probably slight modification?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSession in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
Reference: How to calculate 'n' days interval date in functional style?

spark rdd time stamp conversion

I have a text file. In that I have 2 fields start-time and end-time. I want to find the difference between these 2 times.
name,id,starttime,endtime,loc
xxx,123,2017-10-23T07:13:45.567+5:30,2017-10-23T07:17:40.567+5:30,zzz
xya,134,2017-10-23T14:17:25.567+5:30,2017-10-23T15:13:45.567+5:30,yyy
I have loaded this file into rdd.
val rdd1=sparkcontext.textFile("/user/root/file1.txt")
case class xyz(name:String,id:Int,starttime:String,endtime:String,loc:String)
val rdd2=rdd1.map{x =>
val w=rdd2.split(',')
xyz(w(0),w(1),w(2),w(3),w(4))
}
How to find the time stamp difference between starttime(w(2)) and endtime(w(3)) using RDD.
I would suggest you to use dataSet and not rdd so that you can utilize the case class and since dataSets are optimized than rdd and there are plenty of options than rdd.
Assuming that you have a text file with following data without header
xxx,123,2017-10-23T07:13:45.567+5:30,2017-10-23T07:17:40.567+5:30,zzz
xya,134,2017-10-23T14:17:25.567+5:30,2017-10-23T15:13:45.567+5:30,yyy
And a case class as
case class xyz(name:String,id:Int,starttime:String,endtime:String,loc:String)
First step would be to convert the text file to dataSet
val rdd1=sparkcontext.textFile("/user/root/file1.txt")
val dataSet = rdd1
.map(x => x.split(','))
.map(w => xyz(w(0), w(1).toInt, w(2).replace("T", " ").substring(0, w(2).indexOf(".")), w(3).replace("T", " ").substring(0, w(3).indexOf(".")), w(4)))
.toDS()
If you do dataSet.show(false) then you should get the dataset
+----+---+-------------------+-------------------+---+
|name|id |starttime |endtime |loc|
+----+---+-------------------+-------------------+---+
|xxx |123|2017-10-23 07:13:45|2017-10-23 07:17:40|zzz|
|xya |134|2017-10-23 14:17:25|2017-10-23 15:13:45|yyy|
+----+---+-------------------+-------------------+---+
Now you can just call unix_timestamp function to find the difference
import org.apache.spark.sql.functions._
dataSet.withColumn("difference", unix_timestamp($"endtime") - unix_timestamp($"starttime")).show(false)
which should result as
+----+---+-------------------+-------------------+---+----------+
|name|id |starttime |endtime |loc|difference|
+----+---+-------------------+-------------------+---+----------+
|xxx |123|2017-10-23 07:13:45|2017-10-23 07:17:40|zzz|235 |
|xya |134|2017-10-23 14:17:25|2017-10-23 15:13:45|yyy|3380 |
+----+---+-------------------+-------------------+---+----------+
I hope the answer is helpful
You will have to convert the String date to valid date i.e. to convert from
2017-10-23T07:13:45.567+5:30 to 2017-10-23 07:13:45 and then you can use SimpleDateFormat to convert the date to long so that arithmetic operation can be done on them
Concisely, you can do something like below
val rdd1=sparkcontext.textFile("/user/root/file1.txt")
val rdd2=rdd1
.map(x => x.split(','))
.map(w => (w(2).replace("T", " ").substring(0, w(2).indexOf(".")), w(3).replace("T", " ").substring(0, w(3).indexOf("."))))
val difference = rdd2.map(tuple => {
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val startDate = format.parse(tuple._1).getTime
val endDate = format.parse(tuple._2).getTime
endDate - startDate
})
I hope the answer is helpful

How to automate the creation of String elements with datetime

This thread arises from my previous question. I need to create Seq[String] that contains paths as String elements, however now I also need to add numbers 7,8,...-22 after a date. Also I cannot use LocalDate as it was suggested in the answer to the above-cited question:
path/file_2017-May-1-7
path/file_2017-May-1-8
...
path/file_2017-May-1-22
path/file_2017-April-30-7
path/file_2017-April-30-8
...
path/file_2017-April-30-22
..
I am searching for a flexible solution. My current solution implies the manual definition of dates yyyy-MMM-dd. However it is not efficient if I need to include more than 2 dates, e.g. 10 or 100. Moreover filePathsList is currently Set[Seq[String]] and I don't know how to convert it into Seq[String].
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val currDay = Calendar.getInstance
currDay.add(Calendar.DATE, -1)
val day_1_ago = currDay.getTime
currDay.add(Calendar.DATE, -1)
val day_2_ago = currDay.getTime
val dates = Set(formatter.format(day_1_ago),formatter.format(day_2_ago))
val filePathsList = dates.map(date => {
val list: Seq.empty[String]
for (num <- 7 to 22) {
list :+ s"path/file_$date-$num" + "
}
list
})
Here is how I was able to achieve what you outlined, adjust the days val to configure the amount of days you care about:
import java.text.SimpleDateFormat
import java.util.Calendar
val currDay = Calendar.getInstance
val days = 5
val dates = currDay.getTime +: List.fill(days){
currDay.add(Calendar.DATE, -1)
currDay.getTime
}
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val filePathsList = for {
date <- dates
num <- 7 to 22
} yield s"path/file_${formatter.format(date)}-$num"

Filtering in Scala

So suppose I have the following data (only the first few rows, this data covers an entire year) -
(2014-08-31T00:05:00.000+01:00, John)
(2014-08-31T00:11:00.000+01:00, Sarah)
(2014-08-31T00:12:00.000+01:00, George)
(2014-08-31T00:05:00.000+01:00, John)
(2014-09-01T00:05:00.000+01:00, Sarah)
(2014-09-01T00:05:00.000+01:00, George)
(2014-09-01T00:05:00.000+01:00, Jason)
I would like to filter the data so that I only see what the names are for a specific date (say, 2014-09-05). I've tried doing this using the filter function in Scala but I keep receiving the following error -
error: value xxxx is not a member of (org.joda.time.DateTime, String)
Is there another way of doing this?
The filter method takes a function, called a predicate, that takes as parameter an element of your (I'm assuming) RDD, and returns a Boolean.
The returned RDD will keep only the rows for which the predicate evaluates to true.
In your case, it seems that what you want is something like
rdd.filter{
case (date, _) => date.withTimeAtStartOfDay() == new DateTime("2017-03-31")
}
I presume from the tag your question is in the context of Spark and not pure Scala. Given that, you could filter a dataframe on a date and get the associated name(s) like this:
import org.apache.spark.sql.functions._
import sparkSession.implicits._
Seq(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah")
...
)
.toDF("date", "name")
.filter(to_date('date).equalTo(Date.valueOf("2014-09-05")))
.select("name")
Note that the Date above is java.sql.Date.
Here's a function that takes a date, a list of datetime-name pairs, and returns a list of names for the date:
def getNames(d: String, l: List[(String, String)]): List[String] = {
val date = """^([^T]*).*""".r
val dateMap = list.map {
case (x, y) => ( x match { case date(z) => z }, y )
}.
groupBy(_._1) mapValues( _.map(_._2) )
dateMap.getOrElse(d, List[String]())
}
val list = List(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah"),
("2014-08-31T00:12:00.000+01:00", "George"),
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-09-01T00:05:00.000+01:00", "Sarah"),
("2014-09-01T00:05:00.000+01:00", "George"),
("2014-09-01T00:05:00.000+01:00", "Jason")
)
getNames("2014-09-01", list)
res1: List[String] = List(Sarah, George, Jason)
val dateTimeStringZero = "2014-08-12T00:05:00.000+01:00"
val dateTimeOne:DateTime = org.joda.time.format.ISODateTimeFormat.dateTime.withZoneUTC.parseDateTime(dateTimeStringZero)
import java.text.SimpleDateFormat
val df = new DateTime(new SimpleDateFormat("yyyy-MM-dd").parse("2014-08-12"))
println(dateTimeOne.getYear==df.getYear)
println(dateTimeOne.getMonthOfYear==df.getYear)
...