Type parameterized arithmetic? - scala

Trying to think of a way to subtract 5 minutes from 2 hours.
It doesn't make sense to subtract 5 from 2, because we end up with -3 generic time units, which is useless. But if "hour" is a subtype of "minute", we could convert 2 hours to 120 minutes, and yield 115 minutes, or 1 hour and 55 minutes.
Similarly, if we want to add 5 apples to 5 oranges, we cannot evaluate this in terms of apples, but might expect to end up with 10 fruit.
It seems in the above examples, and generally when using a number as an adjective, the integers need to be parameterized by the type of object they describing. I think it would be very useful if instead of declaring
val hours = 2
val minutes = 5
you could do something like
val hours = 2[Hour]
val minutes = 5[Minute]
val result = hours - minutes
assert (result == 115[Minute])
Does anything like this exist, would it be useful, and is it something that could be implemented?
EDIT: to clarify, the time example above is just a random example I thought up. My question is more whether in general the idea of parameterized Numerics is a useful concept, just as you have parameterized Lists etc. (The answer might be "no", I don't know!)

You can accomplish this by having two classes for Hours and Minutes, along with an implicit conversion function from hours to minutes
trait TimeUnit
case class Hour(val num: Int) extends TimeUnit
case class Minute(val num: Int) extends TimeUnit {
def - (sub: Minute) = Minute(num - sub.num)
}
implicit def hour2Minute(hour: Hour) = Minute(hour.num * 60)
This allows you to do something like
val h = Hour(2) - Minute(30) //returns Minute(90)

You can find some examples for this in the lift framework (spec).
import net.liftweb.utils.TimeHelpers._
3.minutes == 6 * 30.seconds
(Note: it seems you need to have reasonable numbers for correct comparison. Eg. There may be no more than 60 seconds.)

You might try scala-time, which is a wrapper around Joda Time and makes it a bit more idiomatic for Scala, including some DSL to do time period computations, similar to what Brian Agnew suggested in his answer.
For instance,
2.hours + 45.minutes + 10.seconds
creates a Joda Period.

It seems to me a DSL would be of use here. So you could write
2.hours - 5.minutes
and the appropriate conversions would take place to convert 2 hours into a Hours object (value 2) etc.
Lots of resources exist describing Scala's DSL capabilities. e.g. see this from O'Reilly

Related

Scala - divide the dataset into dataset of arrays with a fixed size

I have a function whose purpose is to divide a dataset into arrays of a given size.
For example - I have a dataset with 123 objects of the Foo type, I provide to the function arraysSize 10 so as a result I will have a Dataset[Array[Foo]] with 12 arrays with 10 Foo's and 1 array with 3 Foo.
Right now function is working on collected data - I would like to change it on dataset based because of performance but I dont know how.
This is my current solution:
private def mapToFooArrays(data: Dataset[Foo],
arraysSize: Int): Dataset[Array[Foo]]= {
data.collect().grouped(arraysSize).toSeq.toDS()
}
The reason for doing this transformation is because the data will be sent in the event. Instead of sending 1 million events with information about 1 object, I prefer to send, for example, 10 thousand events with information about 100 objects
IMO, this is a weird use case. I can not think of any efficient solution to do this, as it is going to require a lot of shuffling no matter how we do it.
But, the following is still better, as it avoids collecting to the driver node and will thus be more scalable.
Things to keep in mind -
what is the value of data.count() ?
what is the size of a single Foo ?
what is the value of arraySize ?
what is your executor configuration ?
Based on these factors you will be able to come up with the desiredArraysPerPartition value.
val desiredArraysPerPartition = 50
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] = {
val size = data.count()
val numArrays = (size.toDouble / arrarySize).ceil
val numPartitions = (numArrays.toDouble / desiredArraysPerPartition).ceil
data
.repartition(numPartitions)
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
}
After reading the edited part, I think that 100 size in 10 thousand events with information about 100 objects is not really important. As it is referred as about 100. There can be more than one events with less than 100 Foo's.
If we are not very strict about that 100 size, then there is no need of reshuffle.
We can locally group the Foo's present in each of the partitions. As this grouping is being done locally and not globally, this might result in more than one (potentially one for each partition) Arrays with less than 100 Foo's.
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] =
data
.mapPartitions(_.grouped(arrarySize).map(_.toArray))

Scala Future multiple cores - slow performance

import java.util.concurrent.{Executors, TimeUnit}
import scala.annotation.tailrec
import scala.concurrent.{Await, ExecutionContext, Future}
import scala.util.{Failure, Success}
object Fact extends App {
def time[R](block: => R): Long = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
val t: Long = TimeUnit.SECONDS.convert((t1 - t0), TimeUnit.NANOSECONDS)
//println(
// "Time taken seconds: " + t)
t
}
def factorial(n: BigInt): BigInt = {
#tailrec
def process(n: BigInt, acc: BigInt): BigInt = {
//println(acc)
if (n <= 0) acc
else process(n - 1, n * acc)
}
process(n, 1)
}
implicit val ec =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(2))
val f1: Future[Stream[Long]] =
Future.sequence(
(1 to 50).toStream.map(x => Future { time(factorial(100000)) }))
f1.onComplete {
case Success(s) => {
println("Success : " + s.foldLeft(0L)(_ + _) + " seconds!")
}
case Failure(f) => println("Fails " + f)
}
import scala.concurrent.duration._
Await.ready(Future { 10 }, 10000 minutes)
}
I have the above Factorial code that needs to use multiple cores to complete the program faster and should utilize more cores.
So I change,
Executors.newFixedThreadPool(1) to utilize 1 core
Executors.newFixedThreadPool(2) to utilize 2 cores
When changing to 1 core, then result appears in 127 seconds.
But when changing to 2 cores, then I get 157 seconds.
My doubt is, when i increase cores(parallelism) then it should give good performance. But it is not. Why?
Please correct me, if I am wrong or missing something.
Thanks in Advance.
How are you measuring the time? The result you are printing out is not the time execution took, but the sum of times of each individual call.
Running Fact.time(Fact.main(Array.empty)) in REPL I get 90 and 178 with two and one threads respectively. Seems to make sense ...
First of all, Dima is right that what you print is total execution time of all tasks rather than total time till the last task is finished. The difference is that the first sums time for all the work done in parallel and only the latter shows actual speed up from multi-threading.
However there is another important effect. When I run this code with 1, 2 and 3 threads and measure both total time (time until f1 is ready) and total parallel time (the one that you print), I get following data (I also reduce number of calculations from 50 to 20 to speed up my tests):
1 - 70 - 70
2 - 47 - 94
3 - 43 - 126
At the first glance it looks OK as the parallel time divided by the real time is about the same as the number of threads. But if you look a bit closer, you may notice that speed up going from 1 thread to 2 is only about 1.5x and only 1.1x for the third thread. Also these figures mean that total time of all tasks actually goes up when you add threads. This might seem puzzling.
The answer to this puzzle is that your calculation is actually not CPU-bound. The thing is that the answer (factorial(100000)) is actually a pretty big number. In fact it is so big that it takes about 185KB of memory to store it. What this means is that at the latter stages of computation your factorial method actually becomes more memory-bound than CPU-bound because this size is big enough to overflow the fastest CPU caches. And this is the reason why adding more threads slows down each calculation: yes, you do calculation faster but memory doesn't get any faster. So when you saturate CPU caches and then memory channel, adding more threads (cores) doesn't improves performance that much.

How to generate a list of random numbers?

This might be the least important Scala question ever, but it's bothering me. How would I generate a list of n random number. What I have so far:
def n_rands(n : Int) = {
val r = new scala.util.Random
1 to n map { _ => r.nextInt(100) }
}
Which works, but doesn't look very Scalarific to me. I'm open to suggestions.
EDIT
Not because it's relevant so much as it's amusing and obvious in retrospect, the following looks like it works:
1 to 20 map r.nextInt
But the index of each entry in the returned list is also the upper bound of that last. The first number must be less than 1, the second less than 2, and so on. I ran it three or four times and noticed "Hmmm, the result always starts with 0..."
You can either use Don's solution or:
Seq.fill(n)(Random.nextInt)
Note that you don't need to create a new Random object, you can use the default companion object Random, as stated above.
How about:
import util.Random.nextInt
Stream.continually(nextInt(100)).take(10)
regarding your EDIT,
nextInt can take an Int argument as an upper bound for the random number, so 1 to 20 map r.nextInt is the same as 1 to 20 map (i => r.nextInt(i)), rather than a more useful compilation error.
1 to 20 map (_ => r.nextInt(100)) does what you intended. But it's better to use Seq.fill since that more accurately represents what you're doing.

subtracting a DateTime from a DateTime in scala

I'm relatively new to both scala and jodatime, but have been pretty impressed with both. I'm trying to figure out if there is a more elegant way to do some date arithmetic. Here's a method:
private def calcDuration() : String = {
val p = new Period(calcCloseTime.toInstant.getMillis - calcOpenTime.toInstant.getMillis)
val s : String = p.getHours.toString + ":" + p.getMinutes.toString +
":" + p.getSeconds.toString
return s
}
I convert everything to a string because I am putting it into a MongoDB and I'm not sure how to serialize a joda Duration or Period. If someone knows that I would really appreciate the answer.
Anyway, the calcCloseTime and calcOpenTime methods return DateTime objects. Converting them to Instants is the best way I found to get the difference. Is there a better way?
Another side question: When the hours, minutes or seconds are single digit, the resulting string is not zero filled. Is there a straightforward way to make that string look like HH:MM:SS?
Thanks,
John
Period formatting is done by the PeriodFormatter class. You can use a default one, or construct your own using PeriodFormatterBuilder. It may take some more code as you might like to set this builder up properly, but you can use it for example like so:
scala> import org.joda.time._
import org.joda.time._
scala> import org.joda.time.format._
import org.joda.time.format._
scala> val d1 = new DateTime(2010,1,1,10,5,1,0)
d1: org.joda.time.DateTime = 2010-01-01T10:05:01.000+01:00
scala> val d2 = new DateTime(2010,1,1,13,7,2,0)
d2: org.joda.time.DateTime = 2010-01-01T13:07:02.000+01:00
scala> val p = new Period(d1, d2)
p: org.joda.time.Period = PT3H2M1S
scala> val hms = new PeriodFormatterBuilder() minimumPrintedDigits(2) printZeroAlways() appendHours() appendSeparator(":") appendMinutes() appendSuffix(":") appendSeconds() toFormatter
hms: org.joda.time.format.PeriodFormatter = org.joda.time.format.PeriodFormatter#4d2125
scala> hms print p
res0: java.lang.String = 03:02:01
You should perhaps also be aware that day transitions are not taken into account:
scala> val p2 = new Period(new LocalDate(2010,1,1), new LocalDate(2010,1,2))
p2: org.joda.time.Period = P1D
scala> hms print p2
res1: java.lang.String = 00:00:00
so if you need to hanldes those as well, you would also need to add the required fields (days, weeks, years maybe) to the formatter.
You might want to take a look at Jorge Ortiz's wrapper for Joda-Time, scala-time for something that's a bit nicer to work with in Scala.
You should then be able to use something like
(calcOpenTime to calcCloseTime).millis
Does this link help?
How do I calculate the difference between two dates?
This question has more than one answer! If you just want the number of whole days between two dates, then you can use the new Days class in version 1.4 of Joda-Time.
Days d = Days.daysBetween(startDate, endDate);
int days = d.getDays();
This method, and other static methods on the Days class have been designed to operate well with the JDK5 static import facility.
If however you want to calculate the number of days, weeks, months and years between the two dates, then you need a Period By default, this will split the difference between the two datetimes into parts, such as "1 month, 2 weeks, 4 days and 7 hours".
Period p = new Period(startDate, endDate);
You can control which fields get extracted using a PeriodType.
Period p = new Period(startDate, endDate, PeriodType.yearMonthDay());
This example will return not return any weeks or time fields, thus the previous example becomes "1 month and 18 days".

Scala - calculate average of SomeObj.double in a List[SomeObj]

I'm on my second evening of scala, and I'm resisting the urge to write things in scala how I used to do them in java and trying to learn all of the idioms. In this case I'm looking to just compute an average using such things as closures, mapping, and perhaps list comprehension. Irrespective of whether this is the best way to compute an average, I just want to know how to do these things in scala for learning purposes only
Here's an example: the average method below is left pretty much unimplemented. I've got a couple of other methods for looking up the rating an individual userid gave that uses the find method of TraversableLike (I think), but nothing more that is scala specific, really. How would I compute an average given a List[RatingEvent] where RatingEvent.rating is a double value that I'd to compute an average of across all values of that List in a scala-like manner?.
package com.brinksys.liftnex.model
class Movie(val id : Int, val ratingEvents : List[RatingEvent]) {
def getRatingByUser(userId : Int) : Int = {
return getRatingEventByUserId(userId).rating
}
def getRatingEventByUserId(userId : Int) : RatingEvent = {
var result = ratingEvents find {e => e.userId == userId }
return result.get
}
def average() : Double = {
/*
fill in the blanks where an average of all ratingEvent.rating values is expected
*/
return 3.8
}
}
How would a seasoned scala pro fill in that method and use the features of scala to make it as concise as possible? I know how I would do it in java, which is what I want to avoid.
If I were doing it in python, I assume the most pythonic way would be:
sum([re.rating. for re in ratingEvents]) / len(ratingEvents)
or if I were forcing myself to use a closure (which is something I at least want to learn in scala):
reduce(lambda x, y : x + y, [re.rating for re in ratingEvents]) / len(ratingEvents)
It's the usage of these types of things I want to learn in scala.
Your suggestions? Any pointers to good tutorials/reference material relevant to this are welcome :D
If you're going to be doing math on things, using List is not always the fastest way to go because List has no idea how long it is--so ratingEvents.length takes time proportional to the length. (Not very much time, granted, but it does have to traverse the whole list to tell.) But if you're mostly manipulating data structures and only occasionally need to compute a sum or whatever, so it's not the time-critical core of your code, then using List is dandy.
Anyway, the canonical way to do it would be with a fold to compute the sum:
(0.0 /: ratingEvents){_ + _.rating} / ratingEvents.length
// Equivalently, though more verbosely:
// ratingEvents.foldLeft(0.0)(_ + _.rating) / ratingEvents.length
or by mapping and then summing (2.8 only):
ratingEvents.map(_.rating).sum / ratingEvents.length
For more information on maps and folds, see this question on that topic.
You might calculate sum and length in one go, but I doubt that this helps except for very long lists. It would look like this:
val (s,l) = ratingEvents.foldLeft((0.0, 0))((t, r)=>(t._1 + r.rating, t._2 + 1))
val avg = s / l
I think for this example Rex' solution is much better, but in other use cases the "fold-over-tuple-trick" can be essential.
Since mean and other descriptive statistics like standard deviation or median are needed in different contexts, you could also use a small reusable implicit helper class to allow for more streamlined chained commands:
implicit class ImplDoubleVecUtils(values: Seq[Double]) {
def mean = values.sum / values.length
}
val meanRating = ratingEvents.map(_.rating).mean
It even seems to be possible to write this in a generic fashion for all number types.
Tail recursive solution can achieve both single traversal and avoid high memory allocation rates
def tailrec(input: List[RatingEvent]): Double = {
#annotation.tailrec def go(next: List[RatingEvent], sum: Double, count: Int): Double = {
next match {
case Nil => sum / count
case h :: t => go(t, sum + h.rating, count + 1)
}
}
go(input, 0.0, 0)
}
Here are jmh measurements of approaches from above answers on a list of million elements:
[info] Benchmark Mode Score Units
[info] Mean.foldLeft avgt 0.007 s/op
[info] Mean.foldLeft:·gc.alloc.rate avgt 4217.549 MB/sec
[info] Mean.foldLeft:·gc.alloc.rate.norm avgt 32000064.281 B/op
...
[info] Mean.mapAndSum avgt 0.039 s/op
[info] Mean.mapAndSum:·gc.alloc.rate avgt 1690.077 MB/sec
[info] Mean.mapAndSum:·gc.alloc.rate.norm avgt 72000009.575 B/op
...
[info] Mean.tailrec avgt 0.004 s/op
[info] Mean.tailrec:·gc.alloc.rate avgt ≈ 10⁻⁴ MB/sec
[info] Mean.tailrec:·gc.alloc.rate.norm avgt 0.196 B/op
I can suggest 2 ways:
def average(x: Array[Double]): Double = x.foldLeft(0.0)(_ + _) / x.length
def average(x: Array[Double]): Double = x.sum / x.length
Both are fine, but in 1 case when using fold you can not only make "+" operation, but as well replace it with other (- or * for example)