Iterating over timestamp (date and hour) - scala

I have some function that gets as an input specific starting-timestamp and end-timestamp (e.g. "2018-01-01 16:00:00" and "2018-01-01 17:00:00") (in the beginning of the code i import java.sql.Timestamp)
I want to iterate this function over time (e.g. - between 2018-01-01 until 2018-01-10, over each hour separately).
The furthest I got so far was iterating over the date, using import java.time.{LocalDate, Period}
but when I tried to change my code to import java.time.{LocalDateTime, Period}, it didn't work:
import java.time.temporal.ChronoUnit
import java.time.temporal.ChronoField.HOUR_OF_DAY
import java.time.{LocalDateTime, Period}
val start = LocalDateTime.of(2018, 1, 1,6,20)
val end = LocalDateTime.of(2018, 1, 11,6,30)
val dates: IndexedSeq[LocalDateTime] =
(0L to (end.toEpochSecond() - start.toEpochSecond())).map(hours =>
start.plusHours(hours)
)
dates.foreach(println)
Would highly appreciate your help!

You can take advantage of both scala streams and localdatetime API to make things easier than what you tried, which is let's say, a bit too low level ^^ !
val allDatesBeforeEnd = Stream.iterate(start)(_.plusHours(1)).takeWhile(_.isBefore(end)).toList

import java.time.{LocalDateTime, Period}
val start = LocalDateTime.of(2018, 1, 1,6,20)
val end = LocalDateTime.of(2018, 1, 11,6,30)
val periodInHours = Period.between(start.toLocalDate(), end.toLocalDate()).getDays*24
val dates: IndexedSeq[LocalDateTime] = (0L to periodInHours).map(start.plusHours(_))
dates.foreach(println)

You could first get the number of hours between the two datetimes and then loop over the range formed by this number of hours to create the range of datetimes:
val allDatesBeforeEnd =
(0L until ChronoUnit.HOURS.between(start, end)).map(start.plusHours(_))
Although I do prefer reading C4stor's solution this solution might be slightly better in terms of performance as it doesn't perform the takeWhile isBefore check for each iteration.

Try this, it prints all the hours between yesterday and today, just adapt it you use case :
val yesterday = LocalDateTime.now().minusDays(1)
val today = LocalDateTime.now()
Stream.iterate(yesterday){
h => h.plusHours(1)
}.takeWhile(_.isBefore(today)).foreach(println(_))

Related

Serialization on rdd vs dataframe Spark

EX1. This with an RDD gives Serialization as we expect with or without Object and val num being the culprit, fine:
object Example {
val r = 1 to 1000000 toList
val rdd = sc.parallelize(r,3)
val num = 1
val rdd2 = rdd.map(_ + num)
rdd2.collect
}
Example
EX2. Using a Dataframe in similar fashion, however, does not. Why is that as it looks sort of the same? What am I missing here?
object Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val n = 1
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
df.repartition(3).withColumn("plus1", $"b" + n).show(false)
}
Example
Reasons not entirely clear to me on DF, would expect similar behaviour. Looks like DSs circumvent some issues, but I may well be missing something.
Running on Databricks gives plenty of Serializatiion issues, so do not think that is affecting things, handy to test.
The reason is simple and more fundamental than distinction between RDD and Dataset:
The first piece of code evaluates a function
_ + num
therefore it has to be computed and evaluated.
The second piece of code doesn't. Following
$"b" + n
is just a value, therefore no closure computation and subsequent serialization is required.
If this is still not clear you can think about it this way:
The former piece of code tells Spark how to do something.
The latter piece of code tells Spark what to do. Actual code that is executed is generated in different scope.
If your Dataset code was closer to it's RDD counterpart, for example:
object Example {
import spark.implicits._
val num = 1
spark.range(1000).map(_ + num).collect
}
or
Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val num = 1
val f = udf((x: Int) => x + num)
spark.range(1000).select(f($"id")).collect
}
it would fail with serialization exception, same as RDD version does.

Weighted average with Spark Datasets without UDF

While someone has already asked about computing a Weighted Average in Spark, in this question, I'm asking about using Datasets/DataFrames instead of RDDs.
How do I compute a weighted average in Spark? I have two columns: counts and previous averages:
case class Stat(name:String, count: Int, average: Double)
val statset = spark.createDataset(Seq(Stat("NY", 1,5.0),
Stat("NY",2,1.5),
Stat("LA",12,1.0),
Stat("LA",15,3.0)))
I would like to be able to compute a weighted average like this:
display(statset.groupBy($"name").agg(sum($"count").as("count"),
weightedAverage($"count",$"average").as("average")))
One can use a UDF to get close:
val weightedAverage = udf(
(row:Row)=>{
val counts = row.getAs[WrappedArray[Int]](0)
val averages = row.getAs[WrappedArray[Double]](1)
val (count,total) = (counts zip averages).foldLeft((0,0.0)){
case((cumcount:Int,cumtotal:Double),(newcount:Int,newaverage:Double))=>(cumcount+newcount,cumtotal+newcount*newaverage)}
(total/count) // Tested by returning count here and then extracting. Got same result as sum.
}
)
display(statset.groupBy($"name").agg(sum($"count").as("count"),
weightedAverage(struct(collect_list($"count"),
collect_list($"average"))).as("average")))
(Thanks to answers to Passing a list of tuples as a parameter to a spark udf in scala for help in writing this)
Newbies: Use these imports:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection.mutable.WrappedArray
Is there a way of accomplishing this with built-in column functions instead of UDFs? The UDF feels clunky and if the numbers get large you have to convert the Int's to Long's.
Looks like you could do it in two passes:
val totalCount = statset.select(sum($"count")).collect.head.getLong(0)
statset.select(lit(totalCount) as "count", sum($"average" * $"count" / lit(totalCount)) as "average").show
Or, including the groupBy you just added:
display(statset.groupBy($"name").agg(sum($"count").as("count"),
sum($"count"*$"average").as("total"))
.select($"name",$"count",($"total"/$"count")))

Spark UDF called more than once per record when DF has too many columns

I'm using Spark 1.6.1 and encountering a strange behaviour: I'm running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input data, and building up a result-Dataframe containing many columns (~40).
Strangely, my UDF is called more than once per Record of my input Dataframe in this case (1.6 times more often), which I find unacceptable because its very expensive. If I reduce the number of columns (e.g. to 20), then this behavior disappears.
I managed to write down a small script which demonstrates this:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo {
case class Result(a: Double)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val numRuns = sc.accumulator(0) // to count the number of udf calls
val myUdf = udf((i:Int) => {numRuns.add(1);Result(i.toDouble)})
val data = sc.parallelize((1 to 100), numSlices = 5).toDF("id")
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a")
// add many columns to dataframe (must depend on the UDF's result)
for (i <- 1 to 42) {
results=results.withColumn(s"col_$i",$"result")
}
// trigger action
val res = results.collect()
println(res.size) // prints 100
println(numRuns.value) // prints 160
}
}
Now, is there a way to solve this without reducing the number of columns?
I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a").cache()
Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.
We had this same problem about a year ago and spent a lot of time till we finally figured out what was the problem.
We also had a very expensive UDF to calculate and we found out that it gets calculated again and again for every time we refer to its column. Its just happened to us again a few days ago, so I decided to open a bug on this:
SPARK-18748
We also opened a question here then, but now I see the title wasn't so good:
Trying to turn a blob into multiple columns in Spark
I agree with Tzach about somehow "forcing" the plan to calculate the UDF. We did it uglier, but we had to, because we couldn't cache() the data - it was too big:
val df = data.withColumn("tmp", myUdf($"id"))
val results = sqlContext.createDataFrame(df.rdd, df.schema)
.withColumn("result", $"tmp.a")
update:
Now I see that my jira ticket was linked to another one: SPARK-17728, which still didn't really handle this issue the right way, but it gives one more optional work around:
val results = data.withColumn("tmp", explode(array(myUdf($"id"))))
.withColumn("result", $"tmp.a")
In newer spark verion (2.3+) we can mark UDFs as non-deterministic: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/UserDefinedFunction.html#asNondeterministic():org.apache.spark.sql.expressions.UserDefinedFunction
i.e. use
val myUdf = udf(...).asNondeterministic()
This makes sure the UDF is only called once

How do I write a script in Scala that converts the current time to the number of seconds since midnight?

I'm new to all of this; I'm currently taking CSC 101 and learning how to use Scala. This is our first assignment, and I don't know where to start. I was wondering if you could explain what each variable means and what it stands for?
I've been seeing the code below a lot, and I don't know what the percent signs mean or what they do.
val s = System.currentTimeMillis / 1000
val m = (s/60) % 60
val h = (s/60/60) % 24
Unfortunately it is not very straightforward. Here is some code for you:
import java.text.SimpleDateFormat
import java.util.Calendar
val formatter = new SimpleDateFormat("yyyy/MM/dd")
val cal = Calendar.getInstance()
val now = cal.getTime()
val thisDay = formatter.format(now)
val midnight = formatter.parse(thisDay)
// milliseconds since midnight
val msSinceMidnight = now.getTime - midnight.getTime
val secSinceMidnight = msSinceMidnight / 1000
println(secSinceMidnight)
You have to use Java APIs as shown above, or you could choose to use JodaTime library: http://www.joda.org/joda-time/.

Splitting a number into parts in Scala

I am trying to split a number, like
20130405
into three parts: year, month and date.One way is to convert it to string and use regex. Something like:
(\d{4})(\d{2})(\d{2}).r
A better way is to divide it by 100. Something like:
var date = dateNumber
val day = date % 100
date /= 100
val month = date % 100
date /= 100
val year = date
I get itchy while using 'var' in Scala. Is there a better way to do it?
I would go with the former:
scala> val regex = """(\d{4})(\d{2})(\d{2})""".r
regex: scala.util.matching.Regex = (\d{4})(\d{2})(\d{2})
scala> val regex(year, month, day) = "20130405"
year: String = 2013
month: String = 04
day: String = 05
This is probably not much better than your own solution, but it doesn't use var and doesn't require transforming the number to a string. On the other hand, it's not very safe - if you're not 100% sure that your number is going to be well formatted, better use a SimpleDateFormat - granted, it's more expensive, but at least you're safe from illegal input.
val num = 20130405
val (date, month, year) = (num % 100, num / 100 % 100, num / 10000)
println(date) // Prints 5
println(month) // Prints 4
println(year) // Prints 2013
I'd personally use a SimpleDateFormat even if I were sure the input will always be legal. The only certainty there is is that I'm wrong and the input will someday be illegal.
Better than substring would be to use the java Date and SimpleDateFormat classes, see:
https://stackoverflow.com/a/4216767/88588
Not very scala-ish but...
scala> import java.util.Calendar
import java.util.Calendar
scala> val format = new java.text.SimpleDateFormat("yyyyMMdd")
format: java.text.SimpleDateFormat = java.text.SimpleDateFormat#ef87e460
scala> format.format(new java.util.Date())
res0: String = 20131025
scala> val d=format.parse("20130405")
d: java.util.Date = Fri Apr 05 00:00:00 CEST 2013
scala> val calendar = Calendar.getInstance
calendar: java.util.Calendar = [cut...]
scala> calendar.setTime(d)
scala> calendar.get(Calendar.YEAR)
res1: Int = 2013