How take data from several parquet files at once? - scala

I need your help cause I am new in Spark Framework.
I have folder with a lot of parquet files. The name of these files has the same format: DD-MM-YYYY. For example: '01-10-2018', '02-10-2018', '03-10-2018', etc.
My application has two input parameters: dateFrom and dateTo.
When I try to use next code application hangs. It seems like application scan all files in folder.
val mf = spark.read.parquet("/PATH_TO_THE_FOLDER/*")
.filter($"DATE".between(dateFrom + " 00:00:00", dateTo + " 23:59:59"))
mf.show()
I need to take data pool for period as fast as it possible.
I think it would be great to divide period into days and then read files separately, join them like that:
val mf1 = spark.read.parquet("/PATH_TO_THE_FOLDER/01-10-2018");
val mf2 = spark.read.parquet("/PATH_TO_THE_FOLDER/02-10-2018");
val final = mf1.union(mf2).distinct();
dateFrom and dateTo are dynamic, so I don't know how correctly organize code right now. Please help!
#y2k-shubham I tried to test next code, but it raise error:
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
val dateFrom = DateTime.parse("2018-10-01")
val dateTo = DateTime.parse("2018-10-05")
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
val datesInBetween: Seq[DateTime] = getDatesInBetween(dateFrom, dateTo)
val unionDf: DataFrame = datesInBetween.foldLeft(spark.emptyDataFrame) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet("PATH" + date.toString("yyyy-MM-dd") + "/*.parquet"))
}
unionDf.show()
ERROR:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 20 columns;
It seems like intermediateDf DateFrame at start is empty. How to fix the problem?

import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.{DataFrame, SparkSession}
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
def dateRangeInclusive(start: String, end: String): Iterator[LocalDate] = {
val startDate = LocalDate.parse(start, formatter)
val endDate = LocalDate.parse(end, formatter)
Iterator.iterate(startDate)(_.plusDays(1))
.takeWhile(d => d.isBefore(endDate) || d.isEqual(endDate))
}
val spark = SparkSession.builder().getOrCreate()
val data: DataFrame = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => spark.read.parquet(s"/path/to/directory/${formatter.format(d)}"))
.reduce(_ union _)
I also suggest using the native JSR 310 API (part of Java SE since Java 8) rather than joda-time, since it is more modern and does not require external dependencies. Note that first creating a sequence of paths and doing map+reduce is probably simpler for this use case than a more general foldLeft-based solution.
Additionally, you can use reduceOption, then you'll get an Option[DataFrame] if the input date range is empty. Also, if it is possible for some input directories/files to be missing, you'd want to do a check before invoking spark.read.parquet. If your data is on HDFS, you should probably use the Hadoop FS API:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val spark = SparkSession.builder().getOrCreate()
val fs = FileSystem.get(new Configuration(spark.sparkContext.hadoopConfiguration))
val data: Option[DataFrame] = dateRangeInclusive("2018-10-01", "2018-10-05")
.map(d => s"/path/to/directory/${formatter.format(d)}")
.filter(p => fs.exists(new Path(p)))
.map(spark.read.parquet(_))
.reduceOption(_ union _)

While I haven't tested this piece of code, it must work (probably slight modification?)
import org.joda.time.{DateTime, Days}
import org.apache.spark.sql.{DataFrame, SparkSession}
// return no of days between two dates
def getDaysInBetween(from: DateTime, to: DateTime): Int = Days.daysBetween(from, to).getDays
// return sequence of dates between two dates
def getDatesInBetween(from: DateTime, to: DateTime): Seq[DateTime] = {
val days = getDaysInBetween(from, to)
(0 to days).map(day => from.plusDays(day).withTimeAtStartOfDay())
}
// read parquet data of given date-range from given path
// (you might want to pass SparkSession in a different manner)
def readDataForDateRange(path: String, from: DateTime, to: DateTime)(implicit spark: SparkSession): DataFrame = {
// get date-range sequence
val datesInBetween: Seq[DateTime] = getDatesInBetween(from, to)
// read data of from-date (needed because schema of all DataFrames should be same for union)
val fromDateDf: DataFrame = spark.read.parquet(path + "/" + datesInBetween.head.toString("yyyy-MM-dd"))
// read and union remaining dataframes (functionally)
val unionDf: DataFrame = datesInBetween.tail.foldLeft(fromDateDf) { (intermediateDf: DataFrame, date: DateTime) =>
intermediateDf.union(spark.read.parquet(path + "/" + date.toString("yyyy-MM-dd")))
}
// return union-df
unionDf
}
Reference: How to calculate 'n' days interval date in functional style?

Related

Spark scala UDF in DataFrames is not working

I have defined a function to convert Epoch time to CET and using that function after wrapping as UDF in Spark dataFrame. It is throwing error and not allowing me to use it. Please find below my code.
Function used to convert Epoch time to CET:
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long,
f: String = "dd/MM/yyyy HH:mm:ss.SSS",
z: String = "CET",
msPrecision: Int = 9
): String = {
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
Below given Spark DataFrame uses the above defined UDF for converting Epoch time to CET
val df2 = df1.select($"messageID",$"messageIndex",epochToDateTime($"messageTimestamp").as("messageTimestamp"))
I am getting the below shown error, when I run the code
Any idea how am I supposed to proceed in this scenario ?
The spark optimizer execution tells you that your function is not a Function1, that means that it is not a function that accepts one parameter. You have a function with four input parameters. And, although you may think that in Scala you are allowed to call that function with only one parameter because you have default values for the other three, it seems that Catalyst does not work in this way, so you will need to change the definition of your function to something like:
def convertNanoEpochToDateTime(
f: String = "dd/MM/yyyy HH:mm:ss.SSS"
)(z: String = "CET")(msPrecision: Int = 9)(d: Long): String
or
def convertNanoEpochToDateTime(f: String)(z: String)(msPrecision: Int)(d: Long): String
and put the default values in the udf creation:
val epochToDateTime = udf(
convertNanoEpochToDateTime("dd/MM/yyyy HH:mm:ss.SSS")("CET")(9) _
)
and try to define the SimpleDateFormat as a static transient value out of the function.
I found why the error is due to and resolved it. The problem is when I wrap the scala function as UDF, its expecting 4 parameters, but I was passing only one parameter. Now, I removed 3 parameters from the function and took those values inside the function itself, since they are constant values. Now in Spark Dataframe, I am calling the function with only 1 parameter and it works perfectly fine.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, TimeZone}
import java.util.concurrent.TimeUnit
def convertNanoEpochToDateTime(
d: Long
): String = {
val f: String = "dd/MM/yyyy HH:mm:ss.SSS"
val z: String = "CET"
val msPrecision: Int = 9
val sdf = new SimpleDateFormat(f)
sdf.setTimeZone(TimeZone.getTimeZone(z))
val date = new Date((d / Math.pow(10, 9).toLong) * 1000L)
val stringTime = sdf.format(date)
if (f.contains(".S")) {
val lng = d.toString.length
val milliSecondsStr = d.toString.substring(lng-9,lng)
stringTime.substring(0, stringTime.lastIndexOf(".") + 1) + milliSecondsStr.substring(0,msPrecision)
}
else stringTime
}
val epochToDateTime = udf(convertNanoEpochToDateTime _)
import spark.implicits._
val df1 = List(1659962673251388155L,1659962673251388155L,1659962673251388155L,1659962673251388155L).toDF("epochTime")
val df2 = df1.select(epochToDateTime($"epochTime"))

Create two rows based on a date time column in spark using scala

I have a DF with column session_start and session end. I need to create another row so if the start and end fall in different dates.
For ex :
We have df as
session_start
session_stop
01-05-2021 23:11:40
02-05-2021 02:13:25
So the new output df should break this into two rows like :
session_start
session_stop
01-05-2021 23:11:40
01-05-2021 23:59:59
02-05-2021 00:00:00
02-05-2021 02:13:25
Will all other columns should remain common in both the rows.
You can use a flatMap operation on your DF.
The function you use in the flatMap will produce either one or two row(s).
I did it without flatMap function. Created a UDF generateOverlappedSessionsFromTimestampRanges which does the conversion and used it as below
// UDF
import java.sql.Timestamp
import java.time.temporal.ChronoUnit
import java.time.LocalDateTime
val generateOverlappedSessionsFromTimestampRanges = udf {(localStartTimestamp: Timestamp, localEndTimestamp: Timestamp) =>
val localStartLdt = localStartTimestamp.toLocalDateTime
val localEndLdt = localEndTimestamp.toLocalDateTime
var output : List[(Timestamp, Timestamp)] = List()
if(localStartLdt.toLocalDate().until(localEndLdt.toLocalDate(), ChronoUnit.DAYS) > 0) {
val newLocalEndLdt = LocalDateTime.of(localStartLdt.getYear(), localStartLdt.getMonth(), localStartLdt.getDayOfMonth(), 23, 59, 59)
val newLocalStartLdt = LocalDateTime.of(localEndLdt.getYear(), localEndLdt.getMonth(), localEndLdt.getDayOfMonth(), 0, 0, 0)
output = output :+ (Timestamp.valueOf(localStartLdt),
Timestamp.valueOf(newLocalEndLdt)
)
output = output :+ (Timestamp.valueOf(newLocalStartLdt),
Timestamp.valueOf(localEndLdt)
)
} else {
output = output :+ (Timestamp.valueOf(localStartLdt),
Timestamp.valueOf(localEndLdt)
)
}
output
}
//Unit test case for above UDF
import org.apache.spark.sql.functions._
import java.sql.Timestamp
import org.apache.spark.sql.types.TimestampType
val timestamps: Seq[(Timestamp, Timestamp)] = Seq(
(Timestamp.valueOf("2020-02-10 22:07:25.000"),
Timestamp.valueOf("2020-02-11 02:07:25.000")
)
)
val timestampsDf = timestamps.toDF("local_session_start_timestamp", "local_session_stop_timestamp")
var output = timestampsDf.withColumn("to_be_explode", TimeUtil.generateOverlappedSessionsFromTimestampRanges1(timestampsDf("local_session_start_timestamp"),
timestampsDf("local_session_stop_timestamp")
))
output = output.withColumn("exploded_session_time",explode(col("to_be_explode")))
.withColumn("new_local_session_start",col("exploded_session_time._1"))
.withColumn("new_local_session_stop", col("exploded_session_time._2"))
.drop("to_be_explode", "exploded_session_time")
display(output)
df.withColumn("to_be_explode", generateOverlappedSessionsFromTimestampRanges(df("session_start"), df("session_stop")))
.withColumn("exploded_session_time",explode(col("to_be_explode")))
.withColumn("session_start",col("exploded_session_time._1"))
.withColumn("session_stop", col("exploded_session_time._2"))
.drop("to_be_explode", "exploded_session_time")

How to get creation date of a file in a Scala dataframe

How to print the date of a file in Scala is explained here.
My question is how I can get a variable containing this information which can be returned as a column to a dataframe. None of the conversions I would expect to be allowed, actually are allowed.
My code (using Scala 2.11):
import org.apache.spark.sql.functions._
import java.nio.file.{Files, Paths} // Needed for file time
import java.nio.file.attribute.BasicFileAttributes
import java.util.Date
def GetFileTimeFunc(pathStr: String): String = {
// From: https://stackoverflow.com/questions/47453193/how-to-get-creation-date-of-a-file-using-scala
val FileTime = Files.readAttributes(Paths.get(pathStr), classOf[BasicFileAttributes]).creationTime;
val JavaDate = Date.from(FileTime.toInstant);
return(JavaDate.toString())
}
#transient val GetFileTime = udf(GetFileTimeFunc _)
val filePath = "dbfs:/mnt/myData/" // location of data
val file_df = dbutils.fs.ls(filePath).toDF // Output columns are $"path", $"name", and $"size"
.withColumn("FileTimeCreated", GetFileTime($"path"))
display(file_df)//.select("name", "size"))
Output:
SparkException: Failed to execute user defined function($anonfun$2: (string) => string)
For some reason, Instant is not allowed as a column type, so I cannot use it as a return type.The same for FileTime, JavDate, etc.

Get next week date in Spark Dataframe using scala

I have a DateType input in the function. I would like to exclude Saturday and Sunday and get the next week day, if the input date falls on the weekend, otherwise it should give the next day's date
Example:
Input: Monday 1/1/2017 output: 1/2/2017 (which is Tuesday)
Input: Saturday 3/4/2017 output: 3/5/2017 (which is Monday)
I have gone through https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html but I don't see a ready made function, so I think it will need to be created.
So far I have something that is:
val nextWeekDate = udf {(startDate: DateType) =>
val day= date_format(startDate,'E'
if(day=='Sat' or day=='Sun'){
nextWeekDate = next_day(startDate,'Mon')
}
else{
nextWeekDate = date_add(startDate, 1)
}
}
Need help to get it valid and working.
Using dates as strings:
import java.time.{DayOfWeek, LocalDate}
import java.time.format.DateTimeFormatter
// If that is your format date
object MyFormat {
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
}
object MainSample {
import MyFormat._
def main(args: Array[String]): Unit = {
import java.sql.Date
import org.apache.spark.sql.types.{DateType, IntegerType}
import spark.implicits._
import org.apache.spark.sql.types.{ StringType, StructField, StructType }
import org.apache.spark.sql.functions._
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("YourApp")
.config("spark.master", "local")
.getOrCreate()
val someData = Seq(
Row(1,"2013-01-30"),
Row(2,"2012-01-01")
)
val schema = List(StructField("id", IntegerType), StructField("date",StringType))
val sourceDF = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
sourceDF.show()
val _udf = udf { (dt: String) =>
// Parse your date, dt is a string
val localDate = LocalDate.parse(dt, formatter)
// Check the week day and add days in each case
val newDate = if ((localDate.getDayOfWeek == DayOfWeek.SATURDAY)) {
localDate.plusDays(2)
} else if (localDate.getDayOfWeek == DayOfWeek.SUNDAY) {
localDate.plusDays(1)
} else {
localDate.plusDays(1)
}
newDate.toString
}
sourceDF.withColumn("NewDate", _udf('date)).show()
}
}
Here's a much simpler answer that's defined in spark-daria:
def nextWeekday(col: Column): Column = {
val d = dayofweek(col)
val friday = lit(6)
val saturday = lit(7)
when(col.isNull, null)
.when(d === friday || d === saturday, next_day(col,"Mon"))
.otherwise(date_add(col, 1))
}
You always want to stick with the native Spark functions whenever possible. This post explains the derivation of this function in greater detail.

Working with Dates in Spark

I have a requirement where a CSV file is to be parsed, identify the records between the specific dates and find total and average sales for each sales person per ProductCategory in that duration. Below is the CSV file structure:
SalesPersonId,SalesPersonName,SaleDate,SaleAmount,ProductCategory
Please help in resolving this query. Looking for solution in Scala
What I tried:
Used the SimpleDateFormat as mentioned below:
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
and created an RDD with the below piece of code:
val onlyHouseLoan = readFile.map(line => (line.split(",")(0), line.split(",")(2), line.split(",")(3).toLong, format.parse(line.split(",")(4).toString())))
However, I tried using the Calendar on top of the highlighted expression but getting error that NumberformatExpression.
So by just creating a quick rdd in the format of the csv-file you describe
val list = sc.parallelize(List(("1","Timothy","04/02/2015","100","TV"), ("1","Timothy","04/03/2015","10","Book"), ("1","Timothy","04/03/2015","20","Book"), ("1","Timothy","04/05/2015","10","Book"),("2","Ursula","04/02/2015","100","TV")))
And then running
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val startDate = LocalDate.of(2015,1,4)
val endDate = LocalDate.of(2015,4,5)
val result = list
.filter{case(_,_,date,_,_) => {
val localDate = LocalDate.parse(date, DateTimeFormatter.ofPattern("MM/dd/yyyy"))
localDate.isAfter(startDate) && localDate.isBefore(endDate)}}
.map{case(id, _, _, amount, category) => ((id, category), (amount.toDouble, 1))}
.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2 + v2._2))
.map{case((id, category),(total, sales)) => (id, List((category, total, total/sales)))}
.reduceByKey(_ ++ _)
will give you
(1,List((Book,30.0,15.0), (TV,100.0,100.0)))
(2,List((TV,100.0,100.0)))
in the format of (SalesPersonId, [(ProductCategory,TotalSaleAmount, AvgSaleAmount)]. Is that what you are looking for?