I need to find all the year weeks between the given weeks.
201824 is an example of an year week. It means 24th week of the year 2018.
Assuming that there are 52 weeks in a year, The year weeks of 2018 start with 201801 and end with 201852. After that, it continues with 201901.
I was able to find the range of all year weeks between 2 weeks if the start week and the end week are in the same year like below
val range = udf((i: Int, j: Int) => (i to j).toArray)
The above code works only works when the start week and end week are in the same year, for example 201912 - 201917
How do I make it work if the start week and the end week belongs to different years.
Example: 201849 - 201903
The above weeks should give the output as:
201849,201850,201851,201852,201901,201902,201903
Well there is still a lot of optimizations to do, but for the general direction you could use:
I am using org.joda.time.format here, but java.time should also fit.
def rangeOfYearWeeks(weeksRange: String): Array[String] = {
try {
val left = weeksRange.split("-")(0).trim
val right = weeksRange.split("-")(1).trim
val leftPattern = s"${left.substring(0, 4)}-${left.substring(4)}"
val rightPattern = s"${right.substring(0, 4)}-${right.substring(4)}"
val fmt = DateTimeFormat.forPattern("yyyy-w")
val leftDate = fmt.parseDateTime(leftPattern)
val rightDate = fmt.parseDateTime(rightPattern)
//if (leftDate.isAfter(rightDate))
val weeksBetween = Weeks.weeksBetween(leftDate, rightDate).getWeeks
val dates = for (one <- 0 to weeksBetween) yield {
leftDate.plusWeeks(one)
}
val result: Array[String] = dates.map(date => fmt.print(date)).map(_.replaceAll("-", "")).toArray
result
} catch {
case e: Exception => Array.empty
}
}
Example:
val dates = Seq("201849 - 201903", "201912 - 201917").toDF("col")
val weeks = udf((d: String) => rangeOfYearWeeks(d))
dates.select(weeks($"col")).show(false)
+-----------------------------------------------------+
|UDF(col) |
+-----------------------------------------------------+
|[201849, 201850, 201851, 201852, 20181, 20192, 20193]|
|[201912, 201913, 201914, 201915, 201916, 201917] |
+-----------------------------------------------------+
Here's a solution with an UDF that uses the java.time API:
def weeksBetween = udf{ (startWk: Int, endWk: Int) =>
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import scala.util.{Try, Success, Failure}
def formatYW(yw: Int): String = {
val pattern = "(\\d{4})(\\d+)".r
s"$yw" match { case pattern(y, w) => s"$y-$w-1"}
}
val formatter = DateTimeFormatter.ofPattern("YYYY-w-e") // week-based year
Try(
Iterator.iterate(LocalDate.parse(formatYW(startWk), formatter))(_.plusWeeks(1)).
takeWhile(_.isBefore(LocalDate.parse(formatYW(endWk), formatter))).
map{ s =>
val a = s.format(formatter).split("-")
(a(0) + f"${a(1).toInt}%02d").toInt
}.
toList.tail
) match {
case Success(ls) => ls
case Failure(_) => List.empty[Int] // return an empty list
}
}
Testing the UDF:
val df = Seq(
(1, 201849, 201903), (2, 201908, 201916), (3, 201950, 201955)
).toDF("id", "start_wk", "end_wk")
df.withColumn("weeks_between", weeksBetween($"start_wk", $"end_wk")).show(false)
// +---+--------+------+--------------------------------------------------------+
// |id |start_wk|end_wk|weeks_between |
// +---+--------+------+--------------------------------------------------------+
// |1 |201849 |201903|[201850, 201851, 201852, 201901, 201902] |
// |2 |201908 |201916|[201909, 201910, 201911, 201912, 201913, 201914, 201915]|
// |3 |201950 |201955|[] |
// +---+--------+------+--------------------------------------------------------+
Related
How can I compute percentile 15th and percentile 50th of column students taking into consideration occ column without using array_repeat and avoiding explosion? I have huge input dataframe and explosion blows out the memory.
My DF is:
name | occ | students
aaa 1 1
aaa 3 7
aaa 6 11
...
For example, if I consider students and occ are bot arrays then to compute percentile 50th of array students with taking into consideration of occ I would normaly compute like this:
val students = Array(1,7,11)
val occ = Array(1,3,6)
it gives:
val student_repeated = Array(1,7,7,7,11,11,11,11,11,11)
then student_50th would be 50th percentile of student_repeated => 11.
My current code:
import spark.implicits._
val inputDF = Seq(
("aaa", 1, 1),
("aaa", 3, 7),
("aaa", 6, 11),
)
.toDF("name", "occ", "student")
// Solution 1
inputDF
.withColumn("student", array_repeat(col("student"), col("occ")))
.withColumn("student", explode(col("student")))
.groupBy("name")
.agg(
percentile_approx(col("student"), lit(0.5), lit(10000)).alias("student_50"),
percentile_approx(col("student"), lit(0.15), lit(10000)).alias("student_15"),
)
.show(false)
which outputs:
+----+----------+----------+
|name|student_50|student_15|
+----+----------+----------+
|aaa |11 |7 |
+----+----------+----------+
EDIT:
I am looking for scala equivalent solution:
https://stackoverflow.com/a/58309977/4450090
EDIT2:
I am proceeding with sketches-java
https://github.com/DataDog/sketches-java
I have decided to use dds sketch which has method accept which allows the sketch to be updated.
"com.datadoghq" % "sketches-java" % "0.8.2"
First, I initialize empty sketch.
Then, I accept pair of values (value, weight)
Then after all I call dds sketch method getValueAtQuantile
I do execute all as Spark Scala Aggregator.
class DDSInitAgg(pct: Double, accuracy: Double) extends Aggregator[ValueWithWeigth, SketchData, Double]{
private val precision: String = "%.6f"
override def zero: SketchData = DDSUtils.sketchToTuple(DDSketches.unboundedDense(accuracy))
override def reduce(b: SketchData, a: ValueWithWeigth): SketchData = {
val s = DDSUtils.sketchFromTuple(b)
s.accept(a.value, a.weight)
DDSUtils.sketchToTuple(s)
}
override def merge(b1: SketchData, b2: SketchData): SketchData = {
val s1: DDSketch = DDSUtils.sketchFromTuple(b1)
val s2: DDSketch = DDSUtils.sketchFromTuple(b2)
s1.mergeWith(s2)
DDSUtils.sketchToTuple(s1)
}
override def finish(reduction: SketchData): Double = {
val percentile: Double = DDSUtils.sketchFromTuple(reduction).getValueAtQuantile(pct)
precision.format(percentile).toDouble
}
override def bufferEncoder: Encoder[SketchData] = ExpressionEncoder()
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
You can execute it as udaf taking two columns as the input.
Additionaly, I developed methods for encoding/decoding back and forth from DDSSketch <---> Array[Byte]
case class SketchData(backingArray: Array[Byte], numWrittenBytes: Int)
object DDSUtils {
val emptySketch: DDSketch = DDSketches.unboundedDense(0.01)
val supplierStore: Supplier[Store] = () => new UnboundedSizeDenseStore()
def sketchToTuple(s: DDSketch): SketchData = {
val o = GrowingByteArrayOutput.withDefaultInitialCapacity()
s.encode(o, false)
SketchData(o.backingArray(), o.numWrittenBytes())
}
def sketchFromTuple(sketchData: SketchData): DDSketch = {
val i: ByteArrayInput = ByteArrayInput.wrap(sketchData.backingArray, 0, sketchData.numWrittenBytes)
DDSketch.decode(i, supplierStore)
}
}
This is how I call it as udaf
val ddsInitAgg50UDAF: UserDefinedFunction = udaf(new DDSInitAgg(0.50, 0.50), ExpressionEncoder[ValueWithWeigth])
and finally then in aggregation:
ddsInitAgg50UDAF(col("weigthCol"), col("valueCol")).alias("value_pct_50")
i am trying to pass a list of date ranges needs to be in the below format.
val predicates =
Array(
“2021-05-16” → “2021-05-17”,
“2021-05-18” → “2021-05-19”,
“2021-05-20” → “2021-05-21”)
I am then using map to create a range of conditions that will be passed to the jdbc method.
val predicates =
Array(
“2021-05-16” → “2021-05-17”,
“2021-05-18” → “2021-05-19”,
“2021-05-20” → “2021-05-21”
).map { case (start, end) =>
s"cast(NEW_DT as date) >= date ‘$start’ AND cast(NEW_DT as date) <= date ‘$end’"
}
The process will need to run daily and i need to dynamically populate these values as i cannot use the hard coded way.
Need help in how i can return these values from a method with incrementing start_date and end_date tuples that can generate like above.I had a mere idea like below but as i am new to scala not able to figure out. Please help
def predicateRange(start_date: String, end_date: String): Array[(String,String)] = {
// iterate over the date values and add + 1 to both start and end and return the tuple
}
This assumes that every range is the same duration, and that each date range starts the next day after the end of the previous range.
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def dateRanges(start: String
,rangeLen: Int
,ranges: Int): Array[(String,String)] = {
val startDate =
LocalDate.parse(start, DateTimeFormatter.ofPattern("yyyy-MM-dd"))
Array.iterate(startDate -> startDate.plusDays(rangeLen), ranges){
case (_, end) => end.plusDays(1) -> end.plusDays(rangeLen+1)
}.map{case (s,e) => (s.toString, e.toString)}
}
usage:
dateRanges("2021-05-16", 1, 3)
//res0: Array[(String, String)] = Array((2021-05-16,2021-05-17), (2021-05-18,2021-05-19), (2021-05-20,2021-05-21))
You can use following method to generate your tuple array,
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def generateArray3(startDateString: String, endDateString: String): Array[(String, String)] = {
val dateFormatter = DateTimeFormatter.ISO_LOCAL_DATE
val startDate = LocalDate.parse(startDateString)
val endDate = LocalDate.parse(endDateString)
val daysCount = startDate.until(endDate).getDays
val dateStringTuples = Array.tabulate(daysCount)(i => {
val firstDate = startDate.plusDays(i)
val secondDate = startDate.plusDays(i + 1)
(dateFormatter.format(firstDate), dateFormatter.format(secondDate))
})
dateStringTuples
}
Usage:
println("--------------------------")
generateArray("2021-02-27", "2021-03-02").foreach(println)
println("--------------------------")
generateArray("2021-05-27", "2021-06-02").foreach(println)
println("--------------------------")
generateArray("2021-12-27", "2022-01-02").foreach(println)
println("--------------------------")
output :
--------------------------
(2021-02-27,2021-02-28)
(2021-02-28,2021-03-01)
(2021-03-01,2021-03-02)
--------------------------
(2021-05-27,2021-05-28)
(2021-05-28,2021-05-29)
(2021-05-29,2021-05-30)
(2021-05-30,2021-05-31)
(2021-05-31,2021-06-01)
(2021-06-01,2021-06-02)
--------------------------
(2021-12-27,2021-12-28)
(2021-12-28,2021-12-29)
(2021-12-29,2021-12-30)
(2021-12-30,2021-12-31)
(2021-12-31,2022-01-01)
(2022-01-01,2022-01-02)
--------------------------
I am trying to add dates in string from an array into a seq while determining whether it is a weekend day.
import java.text.SimpleDateFormat
import java.util.{Calendar, Date, GregorianCalendar}
import org.apache.spark.sql.SparkSession
val arrDateEsti=dtfBaseNonLong.select("AAA").distinct().collect.map(_(0).toString);
var dtfDateCate = Seq(
("0000", "0")
);
for (a<-0 to arrDateEsti.length-1){
val dayDate:Date = dateFormat.parse(arrDateEsti(a));
val cal=new GregorianCalendar
cal.setTime(dayDate);
if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
dtfDateCate:+(arrDateEsti(a),"1")
}else{
dtfDateCate:+(arrDateEsti(a),"0")
}
};
scala> dtfDateCate
res20: Seq[(String, String)] = List((0000,0))
It returns the same initial sequence. But if I run one single element it works. What went wrong?
scala> val dayDate:Date = dateFormat.parse(arrDateEsti(0));
dayDate: java.util.Date = Thu Oct 15 00:00:00 CST 2020
scala> cal.setTime(dayDate);
scala> if (cal.get(Calendar.DAY_OF_WEEK)==1 || cal.get(Calendar.DAY_OF_WEEK)==7){
| dtfDateCate:+(arrDateEsti(0),"1")
| }else{
| dtfDateCate:+(arrDateEsti(0),"0")
| };
res14: Seq[(String, String)] = List((0000,0), (20201015,0))
I think this gets at what you're trying to do.
import java.time.LocalDate
import java.time.DayOfWeek.{SATURDAY, SUNDAY}
import java.time.format.DateTimeFormatter
//replace with dtfBaseNonLong.select(... code
val arrDateEsti = Seq("20201015", "20201017") //place holder
val dtFormat = DateTimeFormatter.ofPattern("yyyyMMdd")
val dtfDateCate = ("0000", "0") +:
arrDateEsti.map { dt =>
val day = LocalDate.parse(dt,dtFormat).getDayOfWeek()
if (day == SATURDAY || day == SUNDAY) (dt, "1")
else (dt, "0")
}
//dtfDateCate = Seq((0000,0), (20201015,0), (20201017,1))
yeah, it should be seqDateCate=seqDateCate:+(arrDateEsti(a),"1")
This thread arises from my previous question. I need to create Seq[String] that contains paths as String elements, however now I also need to add numbers 7,8,...-22 after a date. Also I cannot use LocalDate as it was suggested in the answer to the above-cited question:
path/file_2017-May-1-7
path/file_2017-May-1-8
...
path/file_2017-May-1-22
path/file_2017-April-30-7
path/file_2017-April-30-8
...
path/file_2017-April-30-22
..
I am searching for a flexible solution. My current solution implies the manual definition of dates yyyy-MMM-dd. However it is not efficient if I need to include more than 2 dates, e.g. 10 or 100. Moreover filePathsList is currently Set[Seq[String]] and I don't know how to convert it into Seq[String].
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val currDay = Calendar.getInstance
currDay.add(Calendar.DATE, -1)
val day_1_ago = currDay.getTime
currDay.add(Calendar.DATE, -1)
val day_2_ago = currDay.getTime
val dates = Set(formatter.format(day_1_ago),formatter.format(day_2_ago))
val filePathsList = dates.map(date => {
val list: Seq.empty[String]
for (num <- 7 to 22) {
list :+ s"path/file_$date-$num" + "
}
list
})
Here is how I was able to achieve what you outlined, adjust the days val to configure the amount of days you care about:
import java.text.SimpleDateFormat
import java.util.Calendar
val currDay = Calendar.getInstance
val days = 5
val dates = currDay.getTime +: List.fill(days){
currDay.add(Calendar.DATE, -1)
currDay.getTime
}
val formatter = new SimpleDateFormat("yyyy-MMM-dd")
val filePathsList = for {
date <- dates
num <- 7 to 22
} yield s"path/file_${formatter.format(date)}-$num"
I have a CSV in which a field is datetime in a specific format. I cannot import it directly in my Dataframe because it needs to be a timestamp. So I import it as string and convert it into a Timestamp like this
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util.Date
import org.apache.spark.sql.Row
def getTimestamp(x:Any) : Timestamp = {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
if (x.toString() == "")
return null
else {
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime());
return t
}
}
def convert(row : Row) : Row = {
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1)
}
Is there a better, more concise way to do this, with the Dataframe API or spark-sql? The above method requires the creation of an RDD and to give the schema for the Dataframe again.
Spark >= 2.2
Since you 2.2 you can provide format string directly:
import org.apache.spark.sql.functions.to_timestamp
val ts = to_timestamp($"dts", "MM/dd/yyyy HH:mm:ss")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+-------------------+
// |id |dts |ts |
// +---+-------------------+-------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01|
// |2 |#$#### |null |
// +---+-------------------+-------------------+
Spark >= 1.6, < 2.2
You can use date processing functions which have been introduced in Spark 1.5. Assuming you have following data:
val df = Seq((1L, "05/26/2016 01:01:01"), (2L, "#$####")).toDF("id", "dts")
You can use unix_timestamp to parse strings and cast it to timestamp
import org.apache.spark.sql.functions.unix_timestamp
val ts = unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("timestamp")
df.withColumn("ts", ts).show(2, false)
// +---+-------------------+---------------------+
// |id |dts |ts |
// +---+-------------------+---------------------+
// |1 |05/26/2016 01:01:01|2016-05-26 01:01:01.0|
// |2 |#$#### |null |
// +---+-------------------+---------------------+
As you can see it covers both parsing and error handling. The format string should be compatible with Java SimpleDateFormat.
Spark >= 1.5, < 1.6
You'll have to use use something like this:
unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss").cast("double").cast("timestamp")
or
(unix_timestamp($"dts", "MM/dd/yyyy HH:mm:ss") * 1000).cast("timestamp")
due to SPARK-11724.
Spark < 1.5
you should be able to use these with expr and HiveContext.
I haven't played with Spark SQL yet but I think this would be more idiomatic scala (null usage is not considered a good practice):
def getTimestamp(s: String) : Option[Timestamp] = s match {
case "" => None
case _ => {
val format = new SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
Try(new Timestamp(format.parse(s).getTime)) match {
case Success(t) => Some(t)
case Failure(_) => None
}
}
}
Please notice I assume you know Row elements types beforehand (if you read it from a csv file, all them are String), that's why I use a proper type like String and not Any (everything is subtype of Any).
It also depends on how you want to handle parsing exceptions. In this case, if a parsing exception occurs, a None is simply returned.
You could use it further on with:
rows.map(row => Row(row(0),row(1),row(2), getTimestamp(row(3))
I have ISO8601 timestamp in my dataset and I needed to convert it to "yyyy-MM-dd" format. This is what I did:
import org.joda.time.{DateTime, DateTimeZone}
object DateUtils extends Serializable {
def dtFromUtcSeconds(seconds: Int): DateTime = new DateTime(seconds * 1000L, DateTimeZone.UTC)
def dtFromIso8601(isoString: String): DateTime = new DateTime(isoString, DateTimeZone.UTC)
}
sqlContext.udf.register("formatTimeStamp", (isoTimestamp : String) => DateUtils.dtFromIso8601(isoTimestamp).toString("yyyy-MM-dd"))
And you can just use the UDF in your spark SQL query.
Spark Version: 2.4.4
scala> import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.types.TimestampType
scala> val df = Seq("2019-04-01 08:28:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]
scala> val df_mod = df.select($"ts".cast(TimestampType))
df_mod: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df_mod.printSchema()
root
|-- ts: timestamp (nullable = true)
I would like to move the getTimeStamp method wrote by you into rdd's mapPartitions and reuse GenericMutableRow among rows in an iterator:
val strRdd = sc.textFile("hdfs://path/to/cvs-file")
val rowRdd: RDD[Row] = strRdd.map(_.split('\t')).mapPartitions { iter =>
new Iterator[Row] {
val row = new GenericMutableRow(4)
var current: Array[String] = _
def hasNext = iter.hasNext
def next() = {
current = iter.next()
row(0) = current(0)
row(1) = current(1)
row(2) = current(2)
val ts = getTimestamp(current(3))
if(ts != null) {
row.update(3, ts)
} else {
row.setNullAt(3)
}
row
}
}
}
And you should still use schema to generate a DataFrame
val df = sqlContext.createDataFrame(rowRdd, tableSchema)
The usage of GenericMutableRow inside an iterator implementation could be find in Aggregate Operator, InMemoryColumnarTableScan, ParquetTableOperations etc.
I would use https://github.com/databricks/spark-csv
This will infer timestamps for you.
import com.databricks.spark.csv._
val rdd: RDD[String] = sc.textFile("csvfile.csv")
val df : DataFrame = new CsvParser().withDelimiter('|')
.withInferSchema(true)
.withParseMode("DROPMALFORMED")
.csvRdd(sqlContext, rdd)
I had some issues with to_timestamp where it was returning an empty string. After a lot of trial and error, I was able to get around it by casting as a timestamp, and then casting back as a string. I hope this helps for anyone else with the same issue:
df.columns.intersect(cols).foldLeft(df)((newDf, col) => {
val conversionFunc = to_timestamp(newDf(col).cast("timestamp"), "MM/dd/yyyy HH:mm:ss").cast("string")
newDf.withColumn(col, conversionFunc)
})