Suppose I want to use a mutable map in Scala to keep track of the number of times I've seen some strings. In a single-threaded context, this is easy:
import scala.collection.mutable.{ Map => MMap }
class Counter {
val counts = MMap.empty[String, Int].withDefaultValue(0)
def add(s: String): Unit = counts(s) += 1
}
Unfortunately this isn't thread-safe, since the get and the update don't happen atomically.
Concurrent maps add a few atomic operations to the mutable map API, but not the one I need, which would look something like this:
def replace(k: A, f: B => B): Option[B]
I know I can use ScalaSTM's TMap:
import scala.concurrent.stm._
class Counter {
val counts = TMap.empty[String, Int]
def add(s: String): Unit = atomic { implicit txn =>
counts(s) = counts.get(s).getOrElse(0) + 1
}
}
But (for now) that's still an extra dependency. Other options would include actors (another dependency), synchronization (potentially less efficient), or Java's atomic references (less idiomatic).
In general I'd avoid mutable maps in Scala, but I've occasionally needed this kind of thing, and most recently I've used the STM approach (instead of just crossing my fingers and hoping I don't get bitten by the naïve solution).
I know there are a number of trade-offs here (extra dependencies vs. performance vs. clarity, etc.), but is there anything like a "right" answer to this problem in Scala 2.10?
How about this one? Assuming you don't really need a general replace method right now, just a counter.
import java.util.concurrent.ConcurrentHashMap
import java.util.concurrent.atomic.AtomicInteger
object CountedMap {
private val counts = new ConcurrentHashMap[String, AtomicInteger]
def add(key: String): Int = {
val zero = new AtomicInteger(0)
val value = Option(counts.putIfAbsent(key, zero)).getOrElse(zero)
value.incrementAndGet
}
}
You get better performance than synchronizing on the whole map, and you also get atomic increments.
The simplest solution is definitely synchronization. If there is not too much contention, performance might not be that bad.
Otherwise, you could try to roll up your own STM-like replace implementation. Something like this might do:
object ConcurrentMapOps {
private val rng = new util.Random
private val MaxReplaceRetryCount = 10
private val MinReplaceBackoffTime: Long = 1
private val MaxReplaceBackoffTime: Long = 20
}
implicit class ConcurrentMapOps[A, B]( val m: collection.concurrent.Map[A,B] ) {
import ConcurrentMapOps._
private def replaceBackoff() {
Thread.sleep( (MinReplaceBackoffTime + rng.nextFloat * (MaxReplaceBackoffTime - MinReplaceBackoffTime) ).toLong ) // A bit crude, I know
}
def replace(k: A, f: B => B): Option[B] = {
m.get( k ) match {
case None => return None
case Some( old ) =>
var retryCount = 0
while ( retryCount <= MaxReplaceRetryCount ) {
val done = m.replace( k, old, f( old ) )
if ( done ) {
return Some( old )
}
else {
retryCount += 1
replaceBackoff()
}
}
sys.error("Could not concurrently modify map")
}
}
}
Note that collision issues are localized to a given key. If two threads access the same map but work on distinct keys, you'll have no collisions and the replace operation will always succeed the first time. If a collision is detected, we wait a bit (a random amount of time, so as to minimize the likeliness of threads fighting forever for the same key) and try again.
I cannot guarantee that this is production-ready (I just tossed it right now), but that might do the trick.
UPDATE: Of course (as Ionuț G. Stan pointed out), if all you want is increment/decrement a value, java's ConcurrentHashMap already provides thoses operations in a lock-free manner.
My above solution applies if you need a more general replace method that would take the transformation function as a parameter.
You're asking for trouble if your map is just sitting there as a val. If it meets your use case, I'd recommend something like
class Counter {
private[this] myCounts = MMap.empty[String, Int].withDefaultValue(0)
def counts(s: String) = myCounts.synchronized { myCounts(s) }
def add(s: String) = myCounts.synchronized { myCounts(s) += 1 }
def getCounts = myCounts.synchronized { Map[String,Int]() ++ myCounts }
}
for low-contention usage. For high-contention, you should use a concurrent map designed to support such use (e.g. java.util.concurrent.ConcurrentHashMap) and wrap the values in AtomicWhatever.
If you are ok to work with future based interface:
trait SingleThreadedExecutionContext {
val ec = ExecutionContext.fromExecutor(Executors.newSingleThreadExecutor())
}
class Counter extends SingleThreadedExecutionContext {
private val counts = MMap.empty[String, Int].withDefaultValue(0)
def get(s: String): Future[Int] = future(counts(s))(ec)
def add(s: String): Future[Unit] = future(counts(s) += 1)(ec)
}
Test will look like:
class MutableMapSpec extends Specification {
"thread safe" in {
import ExecutionContext.Implicits.global
val c = new Counter
val testData = Seq.fill(16)("1")
await(Future.traverse(testData)(c.add))
await(c.get("1")) mustEqual 16
}
}
Related
When incrementing concurrent counters by key in ConcurrentHashMap is it safe to use regular Int for value or do we have to use AtomicInteger? For example consider the following two implementations
ConcurrentHashMap[String, Int]
final class ExpensiveMetrics(implicit system: ActorSystem, ec: ExecutionContext) {
import scala.collection.JavaConverters._
private val chm = new ConcurrentHashMap[String, Int]().asScala
system.scheduler.schedule(5.seconds, 60.seconds)(publishAllMetrics())
def countRequest(key: String): Unit =
chm.get(key) match {
case Some(value) => chm.update(key, value + 1)
case None => chm.update(key, 1)
}
private def resetCount(key: String) = chm.replace(key, 0)
private def publishAllMetrics(): Unit =
chm foreach { case (key, value) =>
// publishMetric(key, value.doubleValue())
resetCount(key)
}
}
ConcurrentHashMap[String, AtomicInteger]
final class ExpensiveMetrics(implicit system: ActorSystem, ec: ExecutionContext) {
import scala.collection.JavaConverters._
private val chm = new ConcurrentHashMap[String, AtomicInteger]().asScala
system.scheduler.schedule(5.seconds, 60.seconds)(publishAllMetrics())
def countRequest(key: String): Unit =
chm.getOrElseUpdate(key, new AtomicInteger(1)).incrementAndGet()
private def resetCount(key: String): Unit =
chm.getOrElseUpdate(key, new AtomicInteger(0)).set(0)
private def publishAllMetrics(): Unit =
chm foreach { case (key, value) =>
// publishMetric(key, value.doubleValue())
resetCount(key)
}
}
Is the former implementation safe? If not, at what point in the snippet can race-condition be introduced and why?
The context of the question are AWS CloudWatch metrics which can get very expensive on high-frequency APIs if posted on each request. So I am trying to "batch" them up and publish them periodically.
The first implementation is not correct, because the countRequest method is not atomic. Consider this sequence of events:
Threads A and B both call countRequest with key "foo"
Thread A obtains the counter value, let's call it x
Thread B obtains the counter value. It's the same value x, because Thread A hasn't updated the counter yet.
Thread B updates the map with the new counter value, x+1
Thread A updates the map, and because it obtained the counter value before B wrote the new counter value, it also writes x+1.
The counter should be x+2, but it is x+1. It's a classic lost update problem.
The second implementation has a similar problem due to the use of the `getOrElseUpdate` method. `ConcurrentHashMap` does not have that method, therefore the Scala wrapper needs to emulate it. I think the implementation is the one inherited from `scala.collection.mutable.MapOps`, and it is defined like so:
```
def getOrElseUpdate(key: K, op: => V): V =
get(key) match {
case Some(v) => v
case None => val d = op; this(key) = d; d
}
```
This is obviously not atomic.
To implement this correctly, use the compute method on ConcurrentHashMap.
This method will execute atomically, so you won't need an AtomicInteger.
How to implement cache using functional programming
A few days ago I came across callbacks and proxy pattern implementation using scala.
This code should only apply inner function if the value is not in the map.
But every time map is reinitialized and values are gone (which seems obivous.
How to use same cache again and again between different function calls
class Aggregator{
def memoize(function: Function[Int, Int] ):Function[Int,Int] = {
val cache = HashMap[Int, Int]()
(t:Int) => {
if (!cache.contains(t)) {
println("Evaluating..."+t)
val r = function.apply(t);
cache.put(t,r)
r
}
else
{
cache.get(t).get;
}
}
}
def memoizedDoubler = memoize( (key:Int) => {
println("Evaluating...")
key*2
})
}
object Aggregator {
def main( args: Array[String] ) {
val agg = new Aggregator()
agg.memoizedDoubler(2)
agg.memoizedDoubler(2)// It should not evaluate again but does
agg.memoizedDoubler(3)
agg.memoizedDoubler(3)// It should not evaluate again but does
}
I see what you're trying to do here, the reason it's not working is that every time you call memoizedDoubler it's first calling memorize. You need to declare memoizedDoubler as a val instead of def if you want it to only call memoize once.
val memoizedDoubler = memoize( (key:Int) => {
println("Evaluating...")
key*2
})
This answer has a good explanation on the difference between def and val. https://stackoverflow.com/a/12856386/37309
Aren't you declaring a new Map per invocation ?
def memoize(function: Function[Int, Int] ):Function[Int,Int] = {
val cache = HashMap[Int, Int]()
rather than specifying one per instance of Aggregator ?
e.g.
class Aggregator{
private val cache = HashMap[Int, Int]()
def memoize(function: Function[Int, Int] ):Function[Int,Int] = {
To answer your question:
How to implement cache using functional programming
In functional programming there is no concept of mutable state. If you want to change something (like cache), you need to return updated cache instance along with the result and use it for the next call.
Here is modification of your code that follows that approach. function to calculate values and cache is incorporated into Aggregator. When memoize is called, it returns tuple, that contains calculation result (possibly taken from cache) and new Aggregator that should be used for the next call.
class Aggregator(function: Function[Int, Int], cache:Map[Int, Int] = Map.empty) {
def memoize:Int => (Int, Aggregator) = {
t:Int =>
cache.get(t).map {
res =>
(res, Aggregator.this)
}.getOrElse {
val res = function(t)
(res, new Aggregator(function, cache + (t -> res)))
}
}
}
object Aggregator {
def memoizedDoubler = new Aggregator((key:Int) => {
println("Evaluating..." + key)
key*2
})
def main(args: Array[String]) {
val (res, doubler1) = memoizedDoubler.memoize(2)
val (res1, doubler2) = doubler1.memoize(2)
val (res2, doubler3) = doubler2.memoize(3)
val (res3, doubler4) = doubler3.memoize(3)
}
}
This prints:
Evaluating...2
Evaluating...3
I have a Traversable, and I want to make it into a Java Iterator. My problem is that I want everything to be lazily done. If I do .toIterator on the traversable, it eagerly produces the result, copies it into a List, and returns an iterator over the List.
I'm sure I'm missing something simple here...
Here is a small test case that shows what I mean:
class Test extends Traversable[String] {
def foreach[U](f : (String) => U) {
f("1")
f("2")
f("3")
throw new RuntimeException("Not lazy!")
}
}
val a = new Test
val iter = a.toIterator
The reason you can't get lazily get an iterator from a traversable is that you intrinsically can't. Traversable defines foreach, and foreach runs through everything without stopping. No laziness there.
So you have two options, both terrible, for making it lazy.
First, you can iterate through the whole thing each time. (I'm going to use the Scala Iterator, but the Java Iterator is basically the same.)
class Terrible[A](t: Traversable[A]) extends Iterator[A] {
private var i = 0
def hasNext = i < t.size // This could be O(n)!
def next: A = {
val a = t.slice(i,i+1).head // Also could be O(n)!
i += 1
a
}
}
If you happen to have efficient indexed slicing, this will be okay. If not, each "next" will take time linear in the length of the iterator, for O(n^2) time just to traverse it. But this is also not necessarily lazy; if you insist that it must be you have to enforce O(n^2) in all cases and do
class Terrible[A](t: Traversable[A]) extends Iterator[A] {
private var i = 0
def hasNext: Boolean = {
var j = 0
t.foreach { a =>
j += 1
if (j>i) return true
}
false
}
def next: A = {
var j = 0
t.foreach{ a =>
j += 1
if (j>i) { i += 1; return a }
}
throw new NoSuchElementException("Terribly empty")
}
}
This is clearly a terrible idea for general code.
The other way to go is to use a thread and block the traversal of foreach as it's going. That's right, you have to do inter-thread communication on every single element access! Let's see how that works--I'm going to use Java threads here since Scala is in the middle of a switch to Akka-style actors (though any of the old actors or the Akka actors or the Scalaz actors or the Lift actors or (etc.) will work)
class Horrible[A](t: Traversable[A]) extends Iterator[A] {
private val item = new java.util.concurrent.SynchronousQueue[Option[A]]()
private class Loader extends Thread {
override def run() { t.foreach{ a => item.put(Some(a)) }; item.put(None) }
}
private val loader = new Loader
loader.start
private var got: Option[A] = null
def hasNext: Boolean = {
if (got==null) { got = item.poll; hasNext }
else got.isDefined
}
def next = {
if (got==null) got = item.poll
val ans = got.get
got = null
ans
}
}
This avoids the O(n^2) disaster, but ties up a thread and has desperately slow element-by-element access. I get about two million accesses per second on my machine, as compared to >100M for a typical traversable. This is clearly a horrible idea for general code.
So there you have it. Traversable is not lazy in general, and there is no good way to make it lazy without compromising performance tremendously.
I've run into this problem before and as far as I can tell, no one's particularly interested in making it easier to get an Iterator when all you've defined is foreach.
But as you've noted, toStream is the problem, so you could just override that:
class Test extends Traversable[String] {
def foreach[U](f: (String) => U) {
f("1")
f("2")
f("3")
throw new RuntimeException("Not lazy!")
}
override def toStream: Stream[String] = {
"1" #::
"2" #::
"3" #::
Stream[String](throw new RuntimeException("Not lazy!"))
}
}
Another alternative would be to define an Iterable instead of a Traversable, and then you'd get the iterator method directly. Could you explain a bit more what your Traversable is doing in your real use case?
I've problems understanding where to put type informations in scala, and how to put it. Here I create several sequences of Actors and I don't type them. Even if I had to, I wouldn't know which type of sequence map produces to give them the proper type.
Then later when the compiler yells at me because I'm trying to sum Anys, I've no idea where to begin filling in the gaps.
Here is my code, I tried to minimize it while still letting the necessary info available.
object Actors {
def main(args: Array[String]) {
val array = randomArray(5)
val master = new Master(array, 5)
master.start
}
def randomArray(length: Int): Array[Int] = {
val generator = new Random
new Array[Int](length) map((_:Int) => generator nextInt)
}
}
class Master(array: Array[Int], slavesNumber: Int) extends Actor {
def act () {
val slaves = (1 to slavesNumber).map(_ => new Slave)
slaves.foreach(s => s.start)
val futures = slaves.map(s => s !! Work(array))
val results = awaitAll(3000, futures:_*)
val res2 = results.flatMap(x => x)
println((0 /: res2)(_+_))
}
}
class Slave() extends Actor {
def act () {
Actor.loop {
receive {
case Work(slice) =>
reply((slice :\ 0)(_+_))
}
}
}
}
I'd appreciate too some good pointers towards comprehensive doc on the matter.
The object that are passed between actors are not typed, actors have to filter the object themselves -- as you already do in the Slave actor. As you can see, !! is defined as
def !!(msg: Any): Future[Any]
so there is no type information in the returned Future. Probably the easiest solution is to replace the line var res2 .. with
val res2 = results collect {case Some(y:Int) => y}
this filters out just those Some results that are of type Int.
We have some code which needs to run faster. Its already profiled so we would like to make use of multiple threads. Usually I would setup an in memory queue, and have a number of threads taking jobs of the queue and calculating the results. For the shared data I would use a ConcurrentHashMap or similar.
I don't really want to go down that route again. From what I have read using actors will result in cleaner code and if I use akka migrating to more than 1 jvm should be easier. Is that true?
However, I don't know how to think in actors so I am not sure where to start.
To give a better idea of the problem here is some sample code:
case class Trade(price:Double, volume:Int, stock:String) {
def value(priceCalculator:PriceCalculator) =
(priceCalculator.priceFor(stock)-> price)*volume
}
class PriceCalculator {
def priceFor(stock:String) = {
Thread.sleep(20)//a slow operation which can be cached
50.0
}
}
object ValueTrades {
def valueAll(trades:List[Trade],
priceCalculator:PriceCalculator):List[(Trade,Double)] = {
trades.map { trade => (trade,trade.value(priceCalculator)) }
}
def main(args:Array[String]) {
val trades = List(
Trade(30.5, 10, "Foo"),
Trade(30.5, 20, "Foo")
//usually much longer
)
val priceCalculator = new PriceCalculator
val values = valueAll(trades, priceCalculator)
}
}
I'd appreciate it if someone with experience using actors could suggest how this would map on to actors.
This is a complement to my comment on shared results for expensive calculations. Here it is:
import scala.actors._
import Actor._
import Futures._
case class PriceFor(stock: String) // Ask for result
// The following could be an "object" as well, if it's supposed to be singleton
class PriceCalculator extends Actor {
val map = new scala.collection.mutable.HashMap[String, Future[Double]]()
def act = loop {
react {
case PriceFor(stock) => reply(map getOrElseUpdate (stock, future {
Thread.sleep(2000) // a slow operation
50.0
}))
}
}
}
Here's an usage example:
scala> val pc = new PriceCalculator; pc.start
pc: PriceCalculator = PriceCalculator#141fe06
scala> class Test(stock: String) extends Actor {
| def act = {
| println(System.currentTimeMillis().toString+": Asking for stock "+stock)
| val f = (pc !? PriceFor(stock)).asInstanceOf[Future[Double]]
| println(System.currentTimeMillis().toString+": Got the future back")
| val res = f.apply() // this blocks until the result is ready
| println(System.currentTimeMillis().toString+": Value: "+res)
| }
| }
defined class Test
scala> List("abc", "def", "abc").map(new Test(_)).map(_.start)
1269310737461: Asking for stock abc
res37: List[scala.actors.Actor] = List(Test#6d888e, Test#1203c7f, Test#163d118)
1269310737461: Asking for stock abc
1269310737461: Asking for stock def
1269310737464: Got the future back
scala> 1269310737462: Got the future back
1269310737465: Got the future back
1269310739462: Value: 50.0
1269310739462: Value: 50.0
1269310739465: Value: 50.0
scala> new Test("abc").start // Should return instantly
1269310755364: Asking for stock abc
res38: scala.actors.Actor = Test#15b5b68
1269310755365: Got the future back
scala> 1269310755367: Value: 50.0
For simple parallelization, where I throw a bunch of work out to process and then wait for it all to come back, I tend to like to use a Futures pattern.
class ActorExample {
import actors._
import Actor._
class Worker(val id: Int) extends Actor {
def busywork(i0: Int, i1: Int) = {
var sum,i = i0
while (i < i1) {
i += 1
sum += 42*i
}
sum
}
def act() { loop { react {
case (i0:Int,i1:Int) => sender ! busywork(i0,i1)
case None => exit()
}}}
}
val workforce = (1 to 4).map(i => new Worker(i)).toList
def parallelFourSums = {
workforce.foreach(_.start())
val futures = workforce.map(w => w !! ((w.id,1000000000)) );
val computed = futures.map(f => f() match {
case i:Int => i
case _ => throw new IllegalArgumentException("I wanted an int!")
})
workforce.foreach(_ ! None)
computed
}
def serialFourSums = {
val solo = workforce.head
workforce.map(w => solo.busywork(w.id,1000000000))
}
def timed(f: => List[Int]) = {
val t0 = System.nanoTime
val result = f
val t1 = System.nanoTime
(result, t1-t0)
}
def go {
val serial = timed( serialFourSums )
val parallel = timed( parallelFourSums )
println("Serial result: " + serial._1)
println("Parallel result:" + parallel._1)
printf("Serial took %.3f seconds\n",serial._2*1e-9)
printf("Parallel took %.3f seconds\n",parallel._2*1e-9)
}
}
Basically, the idea is to create a collection of workers--one per workload--and then throw all the data at them with !! which immediately gives back a future. When you try to read the future, the sender blocks until the worker's actually done with the data.
You could rewrite the above so that PriceCalculator extended Actor instead, and valueAll coordinated the return of the data.
Note that you have to be careful passing non-immutable data around.
Anyway, on the machine I'm typing this from, if you run the above you get:
scala> (new ActorExample).go
Serial result: List(-1629056553, -1629056636, -1629056761, -1629056928)
Parallel result:List(-1629056553, -1629056636, -1629056761, -1629056928)
Serial took 1.532 seconds
Parallel took 0.443 seconds
(Obviously I have at least four cores; the parallel timing varies rather a bit depending on which worker gets what processor and what else is going on on the machine.)