Getting a HashMap from Scala's HashMap.mapValues? - scala

The example below is a self-contained example I've extracted from my larger app.
Is there a better way to get a HashMap after calling mapValues below? I'm new to Scala, so it's very likely that I'm going about this all wrong, in which case feel free to suggest a completely different approach. (An apparently obvious solution would be to move the logic in the mapValues to inside the accum but that would be tricky in the larger app.)
#!/bin/sh
exec scala "$0" "$#"
!#
import scala.collection.immutable.HashMap
case class Quantity(val name: String, val amount: Double)
class PercentsUsage {
type PercentsOfTotal = HashMap[String, Double]
var quantities = List[Quantity]()
def total: Double = (quantities map { t => t.amount }).sum
def addQuantity(qty: Quantity) = {
quantities = qty :: quantities
}
def percentages: PercentsOfTotal = {
def accum(m: PercentsOfTotal, qty: Quantity) = {
m + (qty.name -> (qty.amount + (m getOrElse (qty.name, 0.0))))
}
val emptyMap = new PercentsOfTotal()
// The `emptyMap ++` at the beginning feels clumsy, but it does the
// job of giving me a PercentsOfTotal as the result of the method.
emptyMap ++ (quantities.foldLeft(emptyMap)(accum(_, _)) mapValues (dollars => dollars / total))
}
}
val pu = new PercentsUsage()
pu.addQuantity(new Quantity("A", 100))
pu.addQuantity(new Quantity("B", 400))
val pot = pu.percentages
println(pot("A")) // prints 0.2
println(pot("B")) // prints 0.8

Rather than using a mutable HashMap to build up your Map, you can just use scala collections' built in groupBy function. This creates a map from the grouping property to a list of the values in that group, which can then be aggregated, e.g. by taking a sum:
def percentages: Map[String, Double] = {
val t = total
quantities.groupBy(_.name).mapValues(_.map(_.amount).sum / t)
}
This pipeline transforms your List[Quantity] => Map[String, List[Quantity]] => Map[String, Double] giving you the desired result.

Related

Spark: Not able to use accumulator on a tuple/count using scala

I am trying to replace reduceByKey with accumulator logic for word count.
wc.txt
Hello how are are you
Here's what I've got so far:
val words = sc.textFile("wc.txt").flatMap(_.split(" "))
val accum = sc.accumulator(0,"myacc")
for (i <- 1 to words.count.toInt)
foreach( x => accum+ =x)
.....
How to proceed about it. Any thoughts or ideas are appreciated.
Indeed, using Accumulators for this is cumbersome and not recommended - but for completeness - here's how it can be done (at least with Spark versions 1.6 <= V <= 2.1). Do note that this uses a deprecated API that will not be a part of next versions.
You'll need a Map[String, Long] accumulator, which is not available by default, so you'll need to create your own AccumulableParam implementation and use it implicitly:
// some data:
val words = sc.parallelize(Seq("Hello how are are you")).flatMap(_.split(" "))
// aliasing the type, just for convenience
type AggMap = Map[String, Long]
// creating an implicit AccumulableParam that counts by String key
implicit val param: AccumulableParam[AggMap, String] = new AccumulableParam[AggMap, String] {
// increase matching value by 1, or create it if missing
override def addAccumulator(r: AggMap, t: String): AggMap =
r.updated(t, r.getOrElse(t, 0L) + 1L)
// merge two maps by summing matching values
override def addInPlace(r1: AggMap, r2: AggMap): AggMap =
r1 ++ r2.map { case (k, v) => k -> (v + r1.getOrElse(k, 0L)) }
// start with an empty map
override def zero(initialValue: AggMap): AggMap = Map.empty
}
// create the accumulator; This will use the above `param` implicitly
val acc = sc.accumulable[AggMap, String](Map.empty[String, Long])
// add each word to accumulator; the `count()` can be replaced by any Spark action -
// we just need to trigger the calculation of the mapped RDD
words.map(w => { acc.add(w); w }).count()
// after the action, we acn read the value of the accumulator
val result: AggMap = acc.value
result.foreach(println)
// (Hello,1)
// (how,1)
// (are,2)
// (you,1)
As I understand you want to count all words in you text file using Spark accumulator, in this case you can use:
words.foreach(_ => accum.add(1L))

Working scala code using a var in a pure function. Is this possible without a var?

Is it possible (or even worthwhile) to try to write the below code block without a var? It works with a var. This is not for an interview, it's my first attempt at scala (came from java).
The problem: Fit people as close to the front of a theatre as possible, while keeping each request (eg. Jones, 4 tickets) in a single theatre section. The theatre sections, starting at the front, are sized 6, 6, 3, 5, 5... and so on. I'm trying to accomplish this by putting together all of the potential groups of ticket requests, and then choosing the best fitting group per section.
Here are the classes. A SeatingCombination is one possible combination of SeatingRequest (just the IDs) and the sum of their ticketCount(s):
class SeatingCombination(val idList: List[Int], val seatCount: Int){}
class SeatingRequest(val id: Int, val partyName: String, val ticketCount: Int){}
class TheatreSection(val sectionSize: Int, rowNumber: Int, sectionNumber: Int) {
def id: String = rowNumber.toString + "_"+ sectionNumber.toString;
}
By the time we get to the below function...
1.) all of the possible combinations of SeatingRequest are in a list of SeatingCombination and ordered by descending size.
2.) all of the TheatreSection are listed in order.
def getSeatingMap(groups: List[SeatingCombination], sections: List[TheatreSection]): HashMap[Int, TheatreSection] = {
var seatedMap = new HashMap[Int, TheatreSection]
for (sect <- sections) {
val bestFitOpt = groups.find(g => { g.seatCount <= sect.sectionSize && !isAnyListIdInMap(seatedMap, g.idList) })
bestFitOpt.filter(_.idList.size > 0).foreach(_.idList.foreach(seatedMap.update(_, sect)))
}
seatedMap
}
def isAnyListIdInMap(map: HashMap[Int, TheatreSection], list: List[Int]): Boolean = {
(for (id <- list) yield !map.get(id).isEmpty).reduce(_ || _)
}
I wrote the rest of the program without a var, but in this iterative section it seems impossible. Maybe with my implementation strategy it's impossible. From what else I've read, a var in a pure function is still functional. But it's been bothering me I can't think of how to remove the var, because my textbook told me to try to avoid them, and I don't know what I don't know.
You can use foldLeft to iterate on sections with a running state (and again, inside, on your state to add iteratively all the ids in a section):
sections.foldLeft(Map.empty[Int, TheatreSection]){
case (seatedMap, sect) =>
val bestFitOpt = groups.find(g => g.seatCount <= sect.sectionSize && !isAnyListIdInMap(seatedMap, g.idList))
bestFitOpt.
filter(_.idList.size > 0).toList. //convert option to list
flatMap(_.idList). // flatten list from option and idList
foldLeft(seatedMap)(_ + (_ -> sect))) // add all ids to the map with sect as value
}
By the way, you can simplify the second method using exists and map.contains:
def isAnyListIdInMap(map: HashMap[Int, TheatreSection], list: List[Int]): Boolean = {
list.exists(id => map.contains(id))
}
list.exists(predicate: Int => Boolean) is a Boolean which is true if the predicate is true for any element in list.
map.contains(key) checks if map is defined at key.
If you want to be even more concise, you don't need to give a name to the argument of the predicate:
list.exists(map.contains)
Simply changing var to val should do it :)
I think, you may be asking about getting rid of the mutable map, not of the var (it doesn't need to be var in your code).
Things like this are usually written recursively in scala or using foldLeft, like other answers suggest. Here is a recursive version:
#tailrec
def getSeatingMap(
groups: List[SeatingCombination],
sections: List[TheatreSection],
result: Map[Int, TheatreSection] = Map.empty): Map[Int, TheatreSection] = sections match {
case Nil => result
case head :: tail =>
val seated = groups
.iterator
.filter(_.idList.nonEmpty)
.filterNot(_.idList.find(result.contains).isDefined)
.find(_.seatCount <= head.sectionSize)
.fold(Nil)(_.idList.map(id => id -> sect))
getSeatingMap(groups, tail, result ++ seated)
}
btw, I don't think you need to test every id in list for presence in the map - should suffice to just look at the first one. You could also make it a bit more efficient, probably, if instead of checking the map every time to see if the group is already seated, you'd just drop it from the input list as soon as the section is assigned.
#tailrec
def selectGroup(
sect: TheatreSection,
groups: List[SeatingCombination],
result: List[SeatingCombination] = Nil
): (List[(Int, TheatreSection)], List[SeatingCombination]) = groups match {
case Nil => (Nil, result)
case head :: tail
if(head.idList.nonEmpty && head.seatCount <= sect.sectionSize) => (head.idList.map(_ -> sect), result.reverse ++ tail)
case head :: tail => selectGroup(sect, tail, head :: result)
}
and then in getSeatingMap:
...
case head :: tail =>
val(seated, remaining) => selectGroup(sect, groups)
getSeatingMap(remaining, tail, result ++ seated)
Here is how I was able to achieve without using the mutable.HashMap, the suggestion by the comment to use foldLeft was used to do it:
class SeatingCombination(val idList: List[Int], val seatCount: Int){}
class SeatingRequest(val id: Int, val partyName: String, val ticketCount: Int){}
class TheatreSection(val sectionSize: Int, rowNumber: Int, sectionNumber: Int) {
def id: String = rowNumber.toString + "_"+ sectionNumber.toString;
}
def getSeatingMap(groups: List[SeatingCombination], sections: List[TheatreSection]): Map[Int, TheatreSection] = {
sections.foldLeft(Map.empty[Int, TheatreSection]) { (m, sect) =>
val bestFitOpt = groups.find(g => {
g.seatCount <= sect.sectionSize && !isAnyListIdInMap(m, g.idList)
}).filter(_.idList.nonEmpty)
val newEntries = bestFitOpt.map(_.idList.map(_ -> sect)).getOrElse(List.empty)
m ++ newEntries
}
}
def isAnyListIdInMap(map: Map[Int, TheatreSection], list: List[Int]): Boolean = {
(for (id <- list) yield map.get(id).isDefined).reduce(_ || _)
}

Failure parsing views

I define the following diff function on Seq[Int] which uses view to avoid copying data:
object viewDiff {
def main(args: Array[String]){
val values = 1 to 10
println("diff="+diffInt(values).toList)
}
def diffInt(seq: Seq[Int]): Seq[Int] = {
val v1 = seq.view(0,seq.size-1)
val v2 = seq.view(1,seq.size)
(v2,v1).zipped.map(_-_)
}
}
This code fails with an UnsupportedOperationException. If I uses slice instead of view it works.
Can anyone explain this?
[tested with scala 2.10.5 and 2.11.6]
Edit
I selected Carlos's answer because it was the (first) correct explanation of the problem. However, som-snytt's answer is more detailed, and provides a simple solution using view on the zipped object.
I also posted a very simple solution that works for this specific case.
Note
In the code above, I also made a mistake on the algorithm to compute a seq derivative. The last line should be: seq.head +: (v2,v1).zipped.map( _-_ )
When you use seq.view in your code, you are creating SeqView[Int, Seq[Int]] objects that cannot be zipped, as it can't support TraversableView.Builder.result. But you can use something like this:
def diffInt(seq: Seq[Int]) = {
val v1 = seq.view(0,seq.size-1)
val v2 = seq.view(1,seq.size)
(v2.toList,v1.toList).zipped.map {
case (x1: Int, y1: Int) => x1-y1
case _ => 0
}
}
That looks strange indeed, and the zipped seem to be the culprit. What you can do instead, as a minimal change, is to use zip:
def diffInt(seq: Seq[Int]): Seq[Int] = {
val v1 = seq.view(0,seq.size-1)
val v2 = seq.view(1,seq.size)
v2.zip(v1).map { case (x1, x2) => x1 - x2 }
}
Normally, you don't build views when mapping them, since you want to defer building the result collection until you force the view.
Since Tuple2Zipped is not a view, on map it tries to build a result that is the same type as its first tupled collection, which is a view.
SeqView's CanBuildFrom yields the NoBuilder that refuses to be forced.
Since the point of using Tuple2Zipped is to avoid intermediate collections, you also want to avoid forcing prematurely, so take a view before mapping:
scala> Seq(1,2,3).view(1,3)
res0: scala.collection.SeqView[Int,Seq[Int]] = SeqViewS(...)
scala> Seq(1,2,3).view(0,2)
res1: scala.collection.SeqView[Int,Seq[Int]] = SeqViewS(...)
scala> (res0, res1).zipped
res2: scala.runtime.Tuple2Zipped[Int,scala.collection.SeqView[Int,Seq[Int]],Int,scala.collection.SeqView[Int,Seq[Int]]] = (SeqViewS(...), SeqViewS(...)).zipped
scala> res2.view map { case (i: Int, j: Int) => i - j }
res3: scala.collection.TraversableView[Int,Traversable[_]] = TraversableViewM(...)
scala> .force
res4: Traversable[Int] = List(1, 1)
Here's a look at the mechanism:
import collection.generic.CanBuildFrom
import collection.SeqView
import collection.mutable.ListBuffer
import language._
object Test extends App {
implicit val cbf = new CanBuildFrom[SeqView[Int, Seq[Int]], Int, Seq[Int]] {
def apply(): scala.collection.mutable.Builder[Int,Seq[Int]] = ListBuffer.empty[Int]
def apply(from: scala.collection.SeqView[Int,Seq[Int]]): scala.collection.mutable.Builder[Int,Seq[Int]] = apply()
}
//val res = (6 to 10 view, 1 to 5 view).zipped.map[Int, List[Int]](_ - _)
val res = (6 to 10 view, 1 to 5 view).zipped.map(_ - _)
Console println res
}
Ah, those good old times of imperative programming:
val seq = 1 to 10
val i1 = seq.iterator
val i2 = seq.iterator.drop(1)
val i = scala.collection.mutable.ArrayBuffer.empty[Int]
while (i1.hasNext && i2.hasNext) i += i2.next - i1.next
println(i)
I'd say it's as efficient as it gets (no copying and excessive allocations), and pretty readable.
As Carlos Vilchez wrote, zipped cannot work with view. Looks like a bug to me...
But this only happens if the first zipped seq is a view. As the zipped is stopping when any of its seq is finished, it is possible to use the whole input seq as first zipped item and inverse the - operation:
def diffInt2(seq: Seq[Int]): Seq[Int] = {
val v1 = seq//.view(0,seq.size-1)
val v2 = seq.view(1,seq.size)
seq.head +: (v1,v2).zipped.map( (a,b) => b-a ) // inverse v1 and v2 order
}

What is the ideal collection for incremental (with multiple passings) filtering of collection?

I've seen many questions about Scala collections and could not decide.
This question was the most useful until now.
I think the core of the question is twofold:
1) Which are the best collections for this use case?
2) Which are the recommended ways to use them?
Details:
I am implementing an algorithm that iterates over all elements in a collection
searching for the one that matches a certain criterion.
After the search, the next step is to search again with a new criterion, but without the chosen element among the possibilities.
The idea is to create a sequence with all original elements ordered by the criterion (which changes at every new selection).
The original sequence doesn't really need to be ordered, but there can be duplicates (the algorithm will only pick one at a time).
Example with a small sequence of Ints (just to simplify):
object Foo extends App {
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt), Seq()))
println("ti = " + ti)
//algorithm
def recur(collection: Seq[Int], already_selected: Seq[Int]): (Seq[Int], Seq[Int]) =
if (collection.isEmpty) (Seq(), already_selected)
else {
val selected = collection maxBy f(already_selected)
val rest = collection diff Seq(selected) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}
Try #inline and as icn suggested How can I idiomatically "remove" a single element from a list in Scala and close the gap?:
object Foo extends App {
#inline
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt()).zipWithIndex, Seq()))
println("ti = " + ti)
//algorithm
#tailrec
def recur(collection: Seq[(Int, Int)], already_selected: Seq[Int]): Seq[Int] =
if (collection.isEmpty) already_selected
else {
val (selected, i) = collection.maxBy(x => f(already_selected)(x._2))
val rest = collection.patch(i, Nil, 1) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}

Allocation of Function Literals in Scala

I have a class that represents sales orders:
class SalesOrder(val f01:String, val f02:Int, ..., f50:Date)
The fXX fields are of various types. I am faced with the problem of creating an audit trail of my orders. Given two instances of the class, I have to determine which fields have changed. I have come up with the following:
class SalesOrder(val f01:String, val f02:Int, ..., val f50:Date){
def auditDifferences(that:SalesOrder): List[String] = {
def diff[A](fieldName:String, getField: SalesOrder => A) =
if(getField(this) != getField(that)) Some(fieldName) else None
val diffList = diff("f01", _.f01) :: diff("f02", _.f02) :: ...
:: diff("f50", _.f50) :: Nil
diffList.flatten
}
}
I was wondering what the compiler does with all the _.fXX functions: are they instanced just once (statically), and can be shared by all instances of my class, or will they be instanced every time I create an instance of my class?
My worry is that, since I will use a lot of SalesOrder instances, it may create a lot of garbage. Should I use a different approach?
One clean way of solving this problem would be to use the standard library's Ordering type class. For example:
class SalesOrder(val f01: String, val f02: Int, val f03: Char) {
def diff(that: SalesOrder) = SalesOrder.fieldOrderings.collect {
case (name, ord) if !ord.equiv(this, that) => name
}
}
object SalesOrder {
val fieldOrderings: List[(String, Ordering[SalesOrder])] = List(
"f01" -> Ordering.by(_.f01),
"f02" -> Ordering.by(_.f02),
"f03" -> Ordering.by(_.f03)
)
}
And then:
scala> val orderA = new SalesOrder("a", 1, 'a')
orderA: SalesOrder = SalesOrder#5827384f
scala> val orderB = new SalesOrder("b", 1, 'b')
orderB: SalesOrder = SalesOrder#3bf2e1c7
scala> orderA diff orderB
res0: List[String] = List(f01, f03)
You almost certainly don't need to worry about the perfomance of your original formulation, but this version is (arguably) nicer for unrelated reasons.
Yes, that creates 50 short lived functions. I don't think you should be worried unless you have manifest evidence that that causes a performance problem in your case.
But I would define a method that transforms SalesOrder into a Map[String, Any], then you would just have
trait SalesOrder {
def fields: Map[String, Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] = {
val af = a.fields
val bf = b.fields
af.collect { case (key, value) if bf(key) != value => key }
}
If the field names are indeed just incremental numbers, you could simplify
trait SalesOrder {
def fields: Iterable[Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] =
(a.fields zip b.fields).zipWithIndex.collect {
case ((av, bv), idx) if av != bv => f"f${idx + 1}%02d"
}