Last Date Id Of Previous Months - scala

My data-frame has a DateId (i.e. an integer column defining a date as a number of days since 1993-06-25). Objective is to calculate date id of the last day of month prior to each date in the column:
DateId -> _intermittent calc Date_ -> _result LastDayOfPriorMonthId_
9063 -> 2018-04-18 -> 9045 (i.e. 2018-03-31)
8771 -> 2017-06-30 -> 8741 (i.e. 2017-05-31)
9175 -> 2018-08-08 -> 9167 (i.e. 2018-07-31)
Solution would be really easy, but I'm running into issues with type conversion:
val a = Seq(9063, 8771, 9175).toDF("DateId")
val timeStart = to_date(lit("1993-06-25"))
val dateIdAdd : (Column) => Column = x => {x - date_add(timeStart, x).DATE_OF_MONTH}
The function compilation is failing with following error:
notebook:2: error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
x - date_add(timeStart, x).DATE_OF_MONTH
Expressions like .cast(IntegerType) do not change the outcome (x is still a spark Column type and .cast(Int) is not applicable.
Please note: similar problem was addressed in this SO question, but the same approach is failing when the timeStart constant is applied here. Also using function would be preferred over expression, because the same calculation is used multiple columns with real data.

Can you translate from Java? Sorry, I don’t code Scala (yet).
private static final LocalDate baseDate = LocalDate.of(1993, Month.JUNE, 25);
public static long dateIdAdd(long dateId) {
LocalDate date = baseDate.plusDays(dateId);
LocalDate lastOfPrevMonth = YearMonth.from(date).minusMonths(1).atEndOfMonth();
return ChronoUnit.DAYS.between(baseDate, lastOfPrevMonth);
}
Edit: according to you (Dan, the asker), the Scala version is:
val baseDate = LocalDate.of(1993, Month.JUNE, 25)
val lastDayIdOfPriorMonth = udf((dateId : Long) => {
val date = baseDate.plusDays(dateId)
val lastOfPrevMonth = YearMonth.from(date).minusMonths(1).atEndOfMonth()
ChronoUnit.DAYS.between(baseDate, lastOfPrevMonth)
})
Let’s try it with your example dates (Java again):
System.out.println("9063 -> " + dateIdAdd(9063));
System.out.println("8771 -> " + dateIdAdd(8771));
System.out.println("9175 -> " + dateIdAdd(9175));
This prints:
9063 -> 9045
8771 -> 8741
9175 -> 9167
In your question you gave 9176 as desired result in the last case, but I believe that was a typo?
And please enjoy how clear and self-explanatory the code is.

After testing many options with Scala conversion function, hack based on UDF with Java string and SimpleDateFormat the only thing I could figure out:
val dateIdAdd = udf((dateId : Long) => {
val d = new SimpleDateFormat("yyyy-MM-dd")
val ts = d.parse("1993-06-25")
val tc = d.format(new Date(ts.getTime() + (24 * 3600 * 1000 * dateId)))
dateId - Integer.parseInt(tc.substring(tc.length()-2))
})
After adding another support function for validation and a simple select:
val dateIdToDate = udf((dateId : Long) => {
val d = new SimpleDateFormat("yyyy-MM-dd")
val ts = d.parse("1993-06-25")
d.format(new Date(ts.getTime() + (24 * 3600 * 1000 * dateId)))
})
val aa = a.select($"*"
, dateIdToDate($"DateId") as "CalcDateFromId"
, dateIdAdd($"DateId") as "CalcLastDayOfMonthId")
display(aa)
Expected results are generated (but I doubt this is the most efficient way available):
DateId CalcDateFromId CalcLastDayOfMonthId
9063 4/18/2018 9045
8771 6/30/2017 8741
9175 8/8/2018 9167

Related

Retain MAX value of Aerospike CDT List

Context
Consider having a stream of tuples (string, timestamp), with the goal of having a bin containing a Map of the minimal timestamp received for each string.
Another constraint is that the update for this Map will be atomic.
For this input example:
("a", 1)
("b", 2)
("a", 3)
the expected output: Map("a" -> 1, "b" -> 2) without the maximal timestamp 3.
The current implementation I chose is using CDT of list to hold the timestamps, so my result is Map("a" -> ArrayList(1), "b" -> ArrayList(2)) as follows:
private val boundedAndOrderedListPolicy = new ListPolicy(ListOrder.ORDERED, ListWriteFlags.INSERT_BOUNDED)
private def bundleContext(str: String) =
CTX.mapKeyCreate(Value.get(str), MapOrder.UNORDERED)
private def buildInitTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, initBin, Value.get(timestamp), bundleCtx),
ListOperation.removeByRankRange(initBin, 1, ListReturnType.NONE, bundleCtx), // keep the first element of the order list - earliest time
)
}
This works as expected. However, if you have a better way to achieve this without the list and in an atomic manner - I would love to hear it.
My question
What does not work for me is retaining the max timestamp received for each input str. For the example above, the desired result should be Map("a" -> ArrayList(3), "b" -> ArrayList(2)). My implementation attempt is:
private def buildLastSeenTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, lastSeenBin, Value.get(timestamp), bundleCtx),
ListOperation.removeByRankRange(lastSeenBin, 1, ListReturnType.NONE | ListReturnType.REVERSE_RANK, bundleCtx), // keep the last element of the ordered list - last time
)
}
Any idea why doesn't it work?
So, i've solved it:
private def buildLastSeenTimestampOp(tuple: (String, Long)): List[Operation] = {
// having these two ops together assure that the size of the list is always 1
val str = tuple._1
val timestamp = tuple._2
val bundleCtx: CTX = bundleContext(str)
List(
ListOperation.append(boundedAndOrderedListPolicy, lastSeenBin, Value.get(command.timestamp.getMillis), bundleCtx),
ListOperation.removeByRankRange(lastSeenBin, -1, ListReturnType.NONE | ListReturnType.INVERTED, bundleCtx), // keep the last element of the ordered list - last time
)
}
When dealing with Rank/Index (removeByIndexRange for indexes) in ascending order -1 represent the max Rank/Index.
Using the ListReturnType.INVERTED is retaining the range (or element in this case) that is selected by initial rank/index until count - by deleting all elements of the list that aren't in the selected range.

How to best read a file and convert to a spark sqlContext dataset

I want to read a tab separated file with no header (sample rows as below)
196 242 3 881250949
186 302 3 891717742
I have 2 solutions to read the file and convert it to dataset. Can anybody tell me which is the better solution and why?
Solution1
final case class Movie(movieID: Int)
import spark.implicits._
val moviesDS1 = spark.sparkContext.textFile("file path")
.map(x => Movie(x.split("\t")(1).toInt))
.toDS
.select("movieID")
Solution 2
final case class Movie(movieID: Int, Somenum1:Int, Somenum2: Int, Somenum3:Int)
import spark.implicits._
var movieSchema = Encoders.product[Movie].schema
val moviesDS2 = spark.read.options(Map("delimiter" -> "\t"))
.schema(movieSchema)
.csv("file path")
.select("movieID")
Solution 2 is a always going to be minimum 5x faster than Solution 1.
Solution 2 also provides an implicit validation against your input
data and marks all the column values null if there is single a schema
mismatch.
Solution 2 also uses advance APIs which gives your Dataframes alike Solution first loads the data as RDD and then transforms it to DataSet.
Ok let have a small test to know which is better,
final case class Movie(movieID: Int)
exec {
import spark.implicits._
val moviesDS1 = spark.sparkContext.textFile("mydata.csv/Movies.csv").toDS()
.map { x => {
Movie(x.split("\t")(0).toInt)
}
}
.select("movieID").show(false)
// moviesDS1.show(false)
}
final case class Movie1(movieID: Int, Somenum1: Int, Somenum2: Int, Somenum3: Int)
exec {
var movieSchema = Encoders.product[Movie1].schema
val moviesDS2 = spark.read.options(Map("delimiter" -> "\t"))
.schema(movieSchema)
.csv("mydata.csv/Movies.csv")
.select("movieID")
moviesDS2.show(false)
}
where exec metthod will measure time in nano seconds...
/**
*
* #param f
* #tparam T
* #return
*/
def exec[T](f: => T) = {
val starttime = System.nanoTime()
println("t = " + f)
val endtime = System.nanoTime()
val elapsedTime = (endtime - starttime)
// import java.util.concurrent.TimeUnit
// val convertToSeconds = TimeUnit.MINUTES.convert(elapsedTime, TimeUnit.NANOSECONDS)
println("time Elapsed " + elapsedTime)
}
Result :
+-------+
|movieID|
+-------+
|196 |
|186 |
+-------+
t = ()
time Elapsed 5053611258
+-------+
|movieID|
+-------+
|196 |
|186 |
+-------+
t = ()
time Elapsed 573163989
Conclusion :
As per numbers second approach is faster/optimized (since 573163989 nano sec < 5053611258 nano sec) than first approach.
Solution 1 We have to take care of parsing & mapping to corresponding classes where as in Solution 2 it eliminates low level parsing data & mapping data to case classes.
So solution 2 is better option.
So #QuickSilvers answer is right with this test case.

How to combine the results of spark computations in the following case?

The question is to calculate average of each of the columns corresponding to each class. Class number is given in the first column.
I am giving a part of test file for better clarity.
2 0.819039 -0.408442 0.120827
3 -0.063763 0.060122 0.250393
4 -0.304877 0.379067 0.092391
5 -0.168923 0.044400 0.074417
1 0.053700 -0.088746 0.228501
2 0.196758 0.035607 0.008134
3 0.006971 -0.096478 0.123718
4 0.084281 0.278343 -0.350414
So the task is to calculate
1: avg(), avg(), avg()
.
.
.
I am very new to Scala. After juggling a lot with the code I came up with the following code
val inputfile = sc.textFile ("testfile.txt")
val myArray = inputfile.map { line =>
(line.split(" ").toList)
}
var Avgmap:Map[String,List[Double]] = Map()
var countmap:Map[String,Int] = Map()
for( a <- myArray ){
//println( "Value of a: " + a + " " + a.size );
if(!countmap.contains(a(0))){
countmap += (a(0) -> 0)
Avgmap += (a(0) -> List.fill(a.size-1)(1.0))
}
var c = countmap(a(0)) + 1
val countmap2 = countmap + (a(0) -> c)
countmap = countmap2
var p = List[Double]()
for( i <- 1 to a.size - 1) {
var temp = (Avgmap(a(0))(i-1)*(countmap(a(0)) - 1) + a(i).toDouble)/countmap(a(0))
// println("i: "+i+" temp: "+temp)
var q = p :+ temp
p = q
}
val Avgmap2 = Avgmap + (a(0) -> p)
Avgmap = Avgmap2;
println("--------------------------------------------------")
println(countmap)
println(Avgmap)
}
When I execute this code I seem to be getting the results in two halves of the dataset. Please help me in combining them.
Edit: About the variables I am using. countmap keeps record of classnumber -> number of vectors encountered. Similarly Avgmap keeps record of average so far of each columns corresponding to the key.
at first, use DataFrame API. at secont, what you want is just one row
df.select(df.columns.map(c => mean(col(c))) :_*).show

How can I benchmark performance in Spark console?

I have just started using Spark and my interactions with it revolve around spark-shell at the moment. I would like to benchmark how long various commands take, but could not find how to get the time or run a benchmark. Ideally I would want to do something super-simple, such as:
val t = [current_time]
data.map(etc).distinct().reduceByKey(_ + _)
println([current time] - t)
Edit: Figured it out --
import org.joda.time._
val t_start = DateTime.now()
[[do stuff]]
val t_end = DateTime.now()
new Period(t_start, t_end).toStandardSeconds()
I suggest you do the following :
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e9 + " seconds")
ret
}
You can pass a function as an argument to time function and it will compute the result of the function giving you the time taken by the function to be performed.
Let's consider a function foobar that take data as argument and then do the following :
val test = time(foobar(data))
test will contains the result of foobar and you'll get the time needed as well.

Scala - What type are the numbers in the List using x.toString.toList?

I have written a function in Scala that should calculate the sum of the squares of the digits of a number. Eg: 44 -> 32 (4^2 + 4^2 = 16 + 16 = 32)
Here it is:
def digitSum(x:BigInt) : BigInt = {
var sum = 0
val leng = x.toString.toList.length
var y = x.toString.toList
for (i<-0 until leng ) {
sum += y(i).toInt * y(i).toInt
}
return sum
}
However when I call the function let's say with digitSum(44) instead of 32 I get 5408.
Why is this happening? Does it have to do with the fact that in the list there are Strings? If so why does the .toInt method do not work?
Thanks!
The answer to your questions has been already covered here Scala int value of String characters, have a good read through and you will have more information than required ;)
Also looking at your code, it can benefit more from Scala expressiveness and functional features. The same function can be written in the following manner:
def digitSum(x: BigInt) = x.toString
.map(_.asDigit)
.map(a => a * a)
.sum
In the future try to avoid using mutable variables and standard looping techniques if you could.
When you do toString you're mapping the String to Chars not Ints and then to Ints later. This is what it looks like in the repl:
scala> "1".toList.map(_.toInt)
res0: List[Int] = List(49)
What you want is probably something like this:
def digitSum(x:BigInt) : BigInt = {
var sum = 0
val leng = x.toString.toList.length
var y = x.toString.toList
for (i<-0 until leng ) {
sum += (y(i).toInt - 48) * (y(i).toInt - 48) //Subtract out char base
}
sum
}