Scala Map collection and map method - scala

I don't understand why def + and def adjust of the following Scala code can be correct. I understand def adjust is used to adjust coefficients because when p1 and p2 have same exponents, their respective coefficients need to be summed up together when summing p1 and p2. But what I dont' understand is that: 1) This should be taken care of by code "other.terms map adjust)" under def + ; 2) and if 1)is correct, “terms ++ " in the same def will add p1's coefficient one more time, which should be wrong.
I'm confused as this code works well. Can someone please help me? Thanks a lot!
object polynomials {
class Poly(terms0: Map[Int, Double]) {
val terms = terms0 withDefaultValue 0.0
def +(other: Poly) = new Poly(terms ++ (other.terms map adjust))
def adjust(term: (Int, Double)): (Int, Double) = {
val (exp, coeff) = term
exp -> (coeff + terms(exp))
}
override def toString =
(for ((exp, coeff) <- terms.tolist.sorted.reverse) yield coeff + ”x^” + exp) mkString “+”
}
val p1 = new Poly(Map(1 -> 2.0, 3 -> 4.0, 5 -> 6,2))
val p2 = new Poly(Map(0 -> 3.0, 3 -> 7.0)
p1 + p2

Quick answers:
other.terms map adjust only includes terms from other, but there may be terms in this that are not in other. In order to retain those terms, the adjusted terms are added to the existing ones.
++ on two maps does not merge values with the same key, it replaces any values from the left-hand Map with those from the right-hand Map with the same key. So the terms from other.terms map adjust will replace those in this.terms not modify them.

Related

How can we sample from a list of strings based on their probability of occurrence in the list in Scala?

I have a List[(String, Double)] variable where the second element of tuple denotes the probability of the string in first element appearing in a corpus. An example would be [(Apple, 0.2), (Banana, 0.3), (Lemon, 0.5)] where an Apple appears with a probability of 0.2 in the list of strings. I want to randomly sample from the list of strings based on their probability of appearance something along the lines of numpy random.choice() method. What would be the correct way to do this in Scala?
Another solution:
def choice(samples: Seq[(String, Double)], n: Int): Seq[String] = {
val (strings, probs) = samples.unzip
val cumprobs = probs.scanLeft(0.0){ _ + _ }.init
def p2s(p: Double): String = strings(cumprobs.lastIndexWhere(_ <= p))
Seq.fill(n)(math.random).map(p2s)
}
An usage (and verify):
>> val ss = choice(Seq(("Apple", 0.2), ("Banana", 0.3), ("Lemon", 0.5)), 10000)
>> ss.groupBy(identity).map{ case(k, v) => (k, v.size)}
Map[String, Int] = Map(Banana -> 3013, Lemon -> 4971, Apple -> 2016)
A very naive (and inefficient) solution would be to create a List of 100 elements that repeats each of the original elements the amount of times needed to respect its probabilities. Then you can randomly shuffle that List and finally take the first element.
import scala.util.Random
final val percent_100 = BigDecimal(100)
def choice[T](data: List[(T, Double)]): T = {
val distribution = data.flatMap {
case (elem, probability) =>
val scaledProbability = BigDecimal(probability).setScale(
scale = 2,
BigDecimal.RoundingMode.HALF_EVEN
)
val n = (scaledProbability * percent_100).toIntExact
List.fill(n)(elem)
}
Random.shuffle(distribution).head
}
However, I am sure there should be better ways of solving this.

Scala Generic Type slow

I do need to create a method for comparison for either Int or String or Char. Using AnyVal was not make it possible as there were no method's for <, > comparison.
However Typing it into Ordered shows a significant slowness. Are there better ways to achieve this? The plan is to do a generic binary sorting, and found Generic typing decreases the performance.
def sample1[T <% Ordered[T]](x:T) = { x < (x) }
def sample2(x:Ordered[Int]) = { x < 1 }
def sample3(x:Int) = { x < 1 }
val start1 = System.nanoTime
sample1(5)
println(System.nanoTime - start1)
val start2 = System.nanoTime
sample2(5)
println(System.nanoTime - start2)
val start3 = System.nanoTime
sample3(5)
println(System.nanoTime - start3)
val start4 = System.nanoTime
sample3(5)
println(System.nanoTime - start4)
val start5 = System.nanoTime
sample2(5)
println(System.nanoTime - start5)
val start6 = System.nanoTime
sample1(5)
println(System.nanoTime - start6)
The results shows:
Sample1:696122
Sample2:45123
Sample3:13947
Sample3:5332
Sample2:194438
Sample1:497992
Am I doing the incorrect way of handling Generics? Or should I be doing the old Java method of using Comparator in this case, sample as in:
object C extends Comparator[Int] {
override def compare(a:Int, b:Int):Int = {
a - b
}
}
def sample4[T](a:T, b:T, x:Comparator[T]) {x.compare(a,b)}
The Scala equivalent of Java Comparator is Ordering. One of the main differences is that, if you don't provide one manually, a suitable Ordering can be injected implicitly by the compiler. By default, this will be done for Byte, Int, Float and other primitives, for any subclass of Ordered or Comparable, and for some other obvious cases.
Also, Ordering provides method definitions for all the main comparison methods as extension methods, so you can write the following:
import Ordering.Implicits._
def sample5[T : Ordering](a: T, b: T) = a < b
def run() = sample5(1, 2)
As of Scala 2.12, those extension operations (i.e., a < b) invoke wrapping in a temporary object Ordering#Ops, so the code will be slower than with a Comparator. Not much in most real cases, but still significant if you care about micro-optimisations.
But you can use an alternative syntax to define an implicit Ordering[T] parameter and invoke methods on the Ordering object directly.
Actually even the generated bytecode for the following two methods will be identical (except for the type of the third argument, and potentially the implementation of the respective compare methods):
def withOrdering[T](x: T, y: T)(implicit cmp: Ordering[T]) = {
cmp.compare(x, y) // also supports other methods, like `cmp.lt(x, y)`
}
def withComparator[T](x: T, y: T, cmp: Comparator[T]) = {
cmp.compare(x, y)
}
In practice the runtime on my machine is the same, when invoking these methods with Int arguments.
So, if you want to compare types generically in Scala, you should usually use Ordering.
Do not do micro-tests in such way if you want to get results somehow similar you will have in production env.
First of all you need to warm-up jvm. And after that do your test as average of many iterations. Also, you need to prevent possible jvm optimizations because of const data. E.g.
def sample1[T <% Ordered[T]](x:T) = { x < (x) }
def sample2(x:Ordered[Int]) = { x < 1 }
def sample3(x:Int) = { x < 1 }
val r = new Random()
def measure(f: => Unit): Long = {
val start1 = System.nanoTime
f
System.nanoTime - start1
}
val n = 1000000
(1 to n).map(_ => measure {val k = r.nextInt();sample1(k)})
(1 to n).map(_ => measure {val k = r.nextInt();sample2(k)})
(1 to n).map(_ => measure {val k = r.nextInt();sample3(k)})
val avg1 = (1 to n).map(_ => measure {val k = r.nextInt();sample1(k)}).sum / n
println(avg1)
val avg2 = (1 to n).map(_ => measure {val k = r.nextInt();sample2(k)}).sum / n
println(avg2)
val avg3 = (1 to n).map(_ => measure {val k = r.nextInt();sample3(k)}).sum / n
println(avg3)
I got results, which look more fare for me:
134
92
83
This book could give some light on performance tests.

Refactoring a small Scala function

I have this function to compute the distance between two n-dimensional points using Pythagoras' theorem.
def computeDistance(neighbour: Point) = math.sqrt(coordinates.zip(neighbour.coordinates).map {
case (c1: Int, c2: Int) => math.pow(c1 - c2, 2)
}.sum)
The Point class (simplified) looks like:
class Point(val coordinates: List[Int])
I'm struggling to refactor the method so it's a little easier to read, can anybody help please?
Here's another way that makes the following three assumptions:
The length of the list is the number of dimensions for the point
Each List is correctly ordered, i.e. List(x, y) or List(x, y, z). We do not know how to handle List(x, z, y)
All lists are of equal length
def computeDistance(other: Point): Double = sqrt(
coordinates.zip(other.coordinates)
.flatMap(i => List(pow(i._2 - i._1, 2)))
.fold(0.0)(_ + _)
)
The obvious disadvantage here is that we don't have any safety around list length. The quick fix for this is to simply have the function return an Option[Double] like so:
def computeDistance(other: Point): Option[Double] = {
if(other.coordinates.length != coordinates.length) {
return None
}
return Some(sqrt(coordinates.zip(other.coordinates)
.flatMap(i => List(pow(i._2 - i._1, 2)))
.fold(0.0)(_ + _)
))
I'd be curious if there is a type safe way to ensure equal list length.
EDIT
It was politely pointed out to me that flatMap(x => List(foo(x))) is equivalent to map(foo) , which I forgot to refactor when I was originally playing w/ this. Slightly cleaner version w/ Map instead of flatMap :
def computeDistance(other: Point): Double = sqrt(
coordinates.zip(other.coordinates)
.map(i => pow(i._2 - i._1, 2))
.fold(0.0)(_ + _)
)
Most of your problem is that you're trying to do math with really long variable names. It's almost always painful. There's a reason why mathematicians use single letters. And assign temporary variables.
Try this:
class Point(val coordinates: List[Int]) { def c = coordinates }
import math._
def d(p: Point) = {
val delta = for ((a,b) <- (c zip p.c)) yield pow(a-b, dims)
sqrt(delta.sum)
}
Consider type aliases and case classes, like this,
type Coord = List[Int]
case class Point(val c: Coord) {
def distTo(p: Point) = {
val z = (c zip p.c).par
val pw = z.aggregate(0.0) ( (a,v) => a + math.pow( v._1-v._2, 2 ), _ + _ )
math.sqrt(pw)
}
}
so that for any two points, for instance,
val p = Point( (1 to 5).toList )
val q = Point( (2 to 6).toList )
we have that
p distTo q
res: Double = 2.23606797749979
Note method distTo uses aggregate on a parallelised collection of tuples, and combines the partial results by the last argument (summation). For high dimensional points this may prove more efficient than the sequential counterpart.
For simplicity of use, consider also implicit classes, as suggested in a comment above,
implicit class RichPoint(val c: Coord) extends AnyVal {
def distTo(d: Coord) = Point(c) distTo Point(d)
}
Hence
List(1,2,3,4,5) distTo List(2,3,4,5,6)
res: Double = 2.23606797749979

Idiomatic Scala for applying functions in a chain if Option(s) are defined

Is there a pre-existing / Scala-idiomatic / better way of accomplishing this?
def sum(x: Int, y: Int) = x + y
var x = 10
x = applyOrBypass(target=x, optValueToApply=Some(22), sum)
x = applyOrBypass(target=x, optValueToApply=None, sum)
println(x) // will be 32
My applyOrBypass could be defined like this:
def applyOrBypass[A, B](target: A, optValueToApply: Option[B], func: (A, B) => A) = {
optValueToApply map { valueToApply =>
func(target, valueToApply)
} getOrElse {
target
}
}
Basically I want to apply operations depending on wether certain Option values are defined or not. If they are not, I should get the pre-existing value. Ideally I would like to chain these operations and not having to use a var.
My intuition tells me that folding or reducing would be involved, but I am not sure how it would work. Or maybe there is another approach with monadic-fors...
Any suggestions / hints appreciated!
Scala has a way to do this with for comprehensions (The syntax is similar to haskell's do notation if you are familiar with it):
(for( v <- optValueToApply )
yield func(target, v)).getOrElse(target)
Of course, this is more useful if you have several variables that you want to check the existence of:
(for( v1 <- optV1
; v2 <- optV2
; v3 <- optV3
) yield func(target, v1, v2, v3)).getOrElse(target)
If you are trying to accumulate a value over a list of options, then I would recommend a fold, so your optional sum would look like this:
val vs = List(Some(1), None, None, Some(2), Some(3))
(target /: vs) ( (x, v) => x + v.getOrElse(0) )
// => 6 + target
You can generalise this, under the condition that your operation func has some identity value, identity:
(target /: vs) ( (x, v) => func(x, v.getOrElse(identity)) )
Mathematically speaking this condition is that (func, identity) forms a Monoid. But that's by-the-by. The actual effect is that whenever a None is reached, applying func to it and x will always produce x, (None's are ignored, and Some values are unwrapped and applied as normal), which is what you want.
What I would do in a case like this is use partially applied functions and identity:
def applyOrBypass[A, B](optValueToApply: Option[B], func: B => A => A): A => A =
optValueToApply.map(func).getOrElse(identity)
You would apply it like this:
def sum(x: Int)(y: Int) = x + y
var x = 10
x = applyOrBypass(optValueToApply=Some(22), sum)(x)
x = applyOrBypass(optValueToApply=None, sum)(x)
println(x)
Yes, you can use fold. If you have multiple optional operands, there are some useful abstractions in the Scalaz library I believe.
var x = 10
x = Some(22).fold(x)(sum(_, x))
x = None .fold(x)(sum(_, x))
If you have multiple functions, it can be done with Scalaz.
There are several ways to do it, but here is one of the most concise.
First, add your imports:
import scalaz._, Scalaz._
Then, create your functions (this way isn't worth it if your functions are always the same, but if they are different, it makes sense)
val s = List(Some(22).map((i: Int) => (j: Int) => sum(i,j)),
None .map((i: Int) => (j: Int) => multiply(i,j)))
Finally, apply them all:
(s.flatten.foldMap(Endo(_)))(x)

Scala: groupBy (identity) of List Elements

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).
When I use
elements groupBy()
I want to group by the elements' content itself, so I wrote the following:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:
grouppedPairs = table groupBy (x => x)
This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?
Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!
I'd suggest this:
def producePairs(words: Array[String]): Map[(String,String), Double] = {
val table = for(w1 <- words; w2 <- words) yield (w1,w2)
val grouppedPairs = table.groupBy(identity)
val size = grouppedPairs.size.toDouble
grouppedPairs.mapValues(_.length / size)
}
The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.
you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
val grouped = pairs.groupBy(t => t)
grouped.mapValues(_.size)
}
another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
m + (p -> (m.getOrElse(p, 0) + 1))
}
}
i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0
Alternative approach which is not of order O(num_words * num_words) but of order O(num_unique_words * num_unique_words) (or something like that):
def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
val size = (counts.size * counts.size).toDouble
for(w1 <- counts; w2 <- counts) yield {
((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
}
}