I need to build a sequence of objects that are loaded from an external resource. This loading being an expensive operation needs to be delayed until the time the objects are needed. After the collection is built, I need an indexed access to the contained objects. Does Scala standard library provide a collection suited to this use case? If not, what will be the best way to implement it?
Edit:
The indexed lookup should preferably be an O(1) operation.
Strange, Miles recently tweeted about this. A reply then points out to Need at the end of Name.scala in scalaz and another one points to specs' LazyParameter.
Little testing with Need:
import scalaz._
import Scalaz._
val longOp = { var x = 0; () => {println("longOp"); x += 1; x }}
val seq = Seq(Need(longOp()), Need(longOp()))
val sum = seq.map(_.value).sum
val sum = seq.map(_.value).sum
val v = Need(longOp())
val s = v.map("got " + _)
println(s.value)
println(s.value)
prints:
longOp: () => Int = <function0>
seq: Seq[scalaz.Name[Int]] = List(
scalaz.Name$$anon$2#1066d88, scalaz.Name$$anon$2#1011f1f)
longOp
longOp
sum: Int = 3
sum: Int = 3
v: scalaz.Name[Int] = scalaz.Name$$anon$2#133ef6a
s: scalaz.Name[java.lang.String] = scalaz.Name$$anon$2#11696ec
longOp
got 3
got 3
So longOp is only called once on first access of value.
To my knowledge, there's nothing that fit in the standard library. A solution might be to use a kind of Lazy wrapper for your objects:
class Lazy[A](operation: => A) {
lazy val get = operation
}
And then you can build your Collection[Lazy[A] with any kind of collection you want to use.
It sounds like you want a Stream.
Related
I have an coordinates RDD[(Int,Int)] and I want to create a new RDD[(Int,(Int,Int))] what is the best practice?
object GlobalVariables{
private var pointId : Int = 0
def newPointId(): Long ={
pointId += 1
pointId
}
}
points = coordinates.map(x=> (GlobalVariables.newPointID,x._1, x._2))
Is this code executed on workers or should I use a combination of broadcast Variables and accumulators?
If the code is executed on workers how can I be sure that I will not have any concurrency error?
You can try another solution without the need to have mutable counter, The transformation zipWithIndex provides a stable indexing, numbering each element in its original order.
example :
val myRdd = RDD(1,2,3)
val zippedWithIndex = myRdd.zipWithIndex // ((1,0),(2,1),(3,2))
After this first transormation you can flip the index and the value
val result = zippedWithIndex.map{case (index,value) => (value,index)} // ((0,1),(1,2),(2,3))
I want to reduce a RDD[Array[Double]] in order to each element of the array will be add with the same element of the next array.
I use this code for the moment :
var rdd1 = RDD[Array[Double]]
var coord = rdd1.reduce( (x,y) => { (x, y).zipped.map(_+_) })
Is there a better way to make this more efficiently because it cost a harm.
Using zipped.map is very inefficient, because it creates a lot of temporary objects and boxes the doubles.
If you use spire, you can just do this
> import spire.implicits._
> val rdd1 = sc.parallelize(Seq(Array(1.0, 2.0), Array(3.0, 4.0)))
> var coord = rdd1.reduce( _ + _)
res1: Array[Double] = Array(4.0, 6.0)
This is much nicer to look at, and should also be much more efficient.
Spire is a dependency of spark, so you should be able to do the above without any extra dependencies. At least it worked with a spark-shell for spark 1.3.1 here.
This will work for any array where there is an AdditiveSemigroup typeclass instance available for the element type. In this case, the element type is Double. Spire typeclasses are #specialized for double, so there will be no boxing going on anywhere.
If you really want to know what is going on to make this work, you have to use reify:
> import scala.reflect.runtime.{universe => u}
> val a = Array(1.0, 2.0)
> val b = Array(3.0, 4.0)
> u.reify { a + b }
res5: reflect.runtime.universe.Expr[Array[Double]] = Expr[scala.Array[Double]](
implicits.additiveSemigroupOps(a)(
implicits.ArrayNormedVectorSpace(
implicits.DoubleAlgebra,
implicits.DoubleAlgebra,
Predef.this.implicitly)).$plus(b))
So the addition works because there is an instance of AdditiveSemigroup for Array[Double].
I assume the concern is that you have very large Array[Double] and the transformation as written does not distribute the addition of them. If so, you could do something like (untested):
// map Array[Double] to (index, double)
val rdd2 = rdd1.flatMap(a => a.zipWithIndex.map(t => (t._2,t._1))
// get the sum for each index
val reduced = rdd2.reduceByKey(_ + _)
// key everything the same to get a single iterable in groubByKey
val groupAll = reduced.map(t => ("constKey", (t._1, t._2)
// get the doubles back together into an array
val coord = groupAll.groupByKey { (k,vs) =>
vs.toList.sortBy(_._1).toArray.map(_._2) }
I often need to do something like
coll.groupBy(f(_)).mapValues(_.foldLeft(x)(g(_,_)))
What is the best way to achieve the same effect, but avoid explicitly constructing the intermediate collections with groupBy?
You could fold the initial collection over a map holding your intermediate results:
def groupFold[A,B,X](as: Iterable[A], f: A => B, init: X, g: (X,A) => X): Map[B,X] =
as.foldLeft(Map[B,X]().withDefaultValue(init)){
case (m,a) => {
val key = f(a)
m.updated(key, g(m(key),a))
}
}
You said collection and I wrote Iterable, but you have to think whether order matters in the fold in your question.
If you want efficient code, you will probably use a mutable map as in Rex' answer.
You can't really do it as a one-liner, so you should be sure you need it before writing something more elaborate like this (written from a somewhat performance-minded view since you asked for "efficient"):
final case class Var[A](var value: A) { }
def multifold[A,B,C](xs: Traversable[A])(f: A => B)(zero: C)(g: (C,A) => C) = {
import scala.collection.JavaConverters._
val m = new java.util.HashMap[B, Var[C]]
xs.foreach{ x =>
val v = {
val fx = f(x)
val op = m.get(fx)
if (op != null) op
else { val nv = Var(zero); m.put(fx, nv); nv }
}
v.value = g(v.value, x)
}
m.asScala.mapValues(_.value)
}
(Depending on your use case you may wish to pack into an immutable map instead in the last step.) Here's an example of it in action:
scala> multifold(List("salmon","herring","haddock"))(_(0))(0)(_ + _.length)
res1: scala.collection.mutable.HashMap[Char,Int] = Map(h -> 14, s -> 6)
Now, you might notice something weird here: I'm using a Java HashMap. This is because Java's HashMaps are 2-3x faster than Scala's. (You can write the equivalent thing with a Scala HashMap, but it doesn't actually make things any faster than your original.) Consequently, this operation is 2-3x faster than what you posted. But unless you're under severe memory pressure, creating the transient collections doesn't actually hurt you much.
Given a very large instance of collection.parallel.mutable.ParHashMap (or any other parallel collection), how can one abort a filtering parallel scan once a given, say 50, number of matches has been found ?
Attempting to accumulate intermediate matches in a thread-safe "external" data structure or keeping an external AtomicInteger with result count seems to be 2 to 3 times slower on 4 cores than using a regular collection.mutable.HashMap and pegging a single core at 100%.
I am aware that find or exists on Par* collections do abort "on the inside". Is there a way to generalize this to find more than one result ?
Here's the code which still seems to be 2 to 3 times slower on the ParHashMap with ~ 79,000 entries and also has a problem of stuffing more than maxResults results into the results CHM (Which is probably due to thread being preempted after incrementAndGet but before break which allows other threads to add more elements in). Update: it seems the slow down is due to worker threads contending on the counter.incrementAndGet() which of course defeats the purpose of the whole parallel scan :-(
def find(filter: Node => Boolean, maxResults: Int): Iterable[Node] =
{
val counter = new AtomicInteger(0)
val results = new ConcurrentHashMap[Key, Node](maxResults)
import util.control.Breaks._
breakable
{
for ((key, node) <- parHashMap if filter(node))
{
results.put(key, node)
val total = counter.incrementAndGet()
if (total > maxResults) break
}
}
results.values.toArray(new Array[Node](results.size))
}
I would first do parallel scan in which variable maxResults would be threadlocal. This would find up to (maxResults * numberOfThreads) results.
Then I would do single threaded scan to reduce it to maxResults.
I had performed an interesting investigation about your case.
Investigation reasoning
I suspected the problem is with the mutability of the input Map and I will try to explain you why: HashMap implementation organizes the data in different buckets, as one can see on Wikipedia.
The first thread-safe collections in Java, the synchronized collections were based on synchronizing all the methods around the underlying implementation and resulted in poor performance. Further research and thinking brought to the more performant Concurrent Collection, such as the ConcurrentHashMap which approach was smarter : why don't we protect each bucket with a specific lock?
According to my feeling the performance problem occurs because:
when you run in parallel your filter, some threads will conflict on accessing the same bucket at once and will hit the same lock, because your map is mutable.
You hold a counter to see how many results you have while you can actually check the size of your
result. If you have a thread-safe way to build a collection, you don't need a thread-safe counter too.
Investigation result
I have developed a test case and I find out I was wrong. The problem is with the concurrent nature of the output map. In fact, that is where the collision occurs, when you are putting elements in the map, rather then when you are iterating on it. Additionally, since you want only the result on values, you don't need the keys and the hashing and all the map features. It might be interesting to test if you remove the AtomicCounter and you use only the result map to check if you collected enough elements how your version performs.
Please be careful with the following code in Scala 2.9.2. I am explaining in another post why I need two different functions for the parallel and the non parallel version: Calling map on a parallel collection via a reference to an ancestor type
object MapPerformance {
val size = 100000
val items = Seq.tabulate(size)( x => (x,x*2))
val concurrentParallelMap = ImmutableParHashMap(items:_*)
val concurrentMutableParallelMap = MutableParHashMap(items:_*)
val unparallelMap = Map(items:_*)
class ThreadSafeIndexedSeqBuilder[T](maxSize:Int) {
val underlyingBuilder = new VectorBuilder[T]()
var counter = 0
def sizeHint(hint:Int) { underlyingBuilder.sizeHint(hint) }
def +=(item:T):Boolean ={
synchronized{
if(counter>=maxSize)
false
else{
underlyingBuilder+=item
counter+=1
true
}
}
}
def result():Vector[T] = underlyingBuilder.result()
}
def find(map:ParMap[Int,Int],filter: Int => Boolean, maxResults: Int): Iterable[Int] =
{
// we already know the maximum size
val resultsBuilder = new ThreadSafeIndexedSeqBuilder[Int](maxResults)
resultsBuilder.sizeHint(maxResults)
import util.control.Breaks._
breakable
{
for ((key, node) <- map if filter(node))
{
val newItemAdded = resultsBuilder+=node
if (!newItemAdded)
break()
}
}
resultsBuilder.result().seq
}
def findUnParallel(map:Map[Int,Int],filter: Int => Boolean, maxResults: Int): Iterable[Int] =
{
// we already know the maximum size
val resultsBuilder = Array.newBuilder[Int]
resultsBuilder.sizeHint(maxResults)
var counter = 0
for {
(key, node) <- map if filter(node)
if counter < maxResults
}{
resultsBuilder+=node
counter+=1
}
resultsBuilder.result()
}
def measureTime[K](f: => K):(Long,K) = {
val startMutable = System.currentTimeMillis()
val result = f
val endMutable = System.currentTimeMillis()
(endMutable-startMutable,result)
}
def main(args:Array[String]) = {
val maxResultSetting=10
(1 to 10).foreach{
tryNumber =>
println("Try number " +tryNumber)
val (mutableTime, mutableResult) = measureTime(find(concurrentMutableParallelMap,_%2==0,maxResultSetting))
val (immutableTime, immutableResult) = measureTime(find(concurrentMutableParallelMap,_%2==0,maxResultSetting))
val (unparallelTime, unparallelResult) = measureTime(findUnParallel(unparallelMap,_%2==0,maxResultSetting))
assert(mutableResult.size==maxResultSetting)
assert(immutableResult.size==maxResultSetting)
assert(unparallelResult.size==maxResultSetting)
println(" The mutable version has taken " + mutableTime + " milliseconds")
println(" The immutable version has taken " + immutableTime + " milliseconds")
println(" The unparallel version has taken " + unparallelTime + " milliseconds")
}
}
}
With this code, I have systematically the parallel (both mutable and immutable version of the input map) about 3,5 time faster then the unparallel on my machine.
You could try to get an iterator and then create a lazy list (a Stream) where you filter (with your predicate) and take the number of elements you want. Because it is a non strict, this 'taking' of elements is not evaluated.
Afterwards you can force the execution by adding ".par" to the whole thing and achieve parallelization.
Example code:
A parallelized map with random values (simulating your parallel hash map):
scala> myMap
res14: scala.collection.parallel.immutable.ParMap[Int,Int] = ParMap(66978401 -> -1331298976, 256964068 -> 126442706, 1698061835 -> 1622679396, -1556333580 -> -1737927220, 791194343 -> -591951714, -1907806173 -> 365922424, 1970481797 -> 162004380, -475841243 -> -445098544, -33856724 -> -1418863050, 1851826878 -> 64176692, 1797820893 -> 405915272, -1838192182 -> 1152824098, 1028423518 -> -2124589278, -670924872 -> 1056679706, 1530917115 -> 1265988738, -808655189 -> -1742792788, 873935965 -> 733748120, -1026980400 -> -163182914, 576661388 -> 900607992, -1950678599 -> -731236098)
Get an iterator and create a Stream from the iterator and filter it.
In this case my predicate is only accepting pairs (of the value member of the map).
I want to get 10 even elements, so I take 10 elements which will only get evaluated when I force it to:
scala> val mapIterator = myMap.toIterator
mapIterator: Iterator[(Int, Int)] = HashTrieIterator(20)
scala> val r = Stream.continually(mapIterator.next()).filter(_._2 % 2 == 0).take(10)
r: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), ?)
Finally, I force the evaluation which only gets 10 elements as planned
scala> r.force
res16: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), (256964068,126442706), (1698061835,1622679396), (-1556333580,-1737927220), (791194343,-591951714), (-1907806173,365922424), (1970481797,162004380), (-475841243,-445098544), (-33856724,-1418863050), (1851826878,64176692))
This way you only get the number of elements you want (without needing to process the remaining elements) and you parallelize the process without locks, atomics or breaks.
Please compare this to your solutions to see if it is any good.
Is there any reason for Scala not support the ++ operator to increment primitive types by default?
For example, you can not write:
var i=0
i++
Thanks
My guess is this was omitted because it would only work for mutable variables, and it would not make sense for immutable values. Perhaps it was decided that the ++ operator doesn't scream assignment, so including it may lead to mistakes with regard to whether or not you are mutating the variable.
I feel that something like this is safe to do (on one line):
i++
but this would be a bad practice (in any language):
var x = i++
You don't want to mix assignment statements and side effects/mutation.
I like Craig's answer, but I think the point has to be more strongly made.
There are no "primitives" -- if Int can do it, then so can a user-made Complex (for example).
Basic usage of ++ would be like this:
var x = 1 // or Complex(1, 0)
x++
How do you implement ++ in class Complex? Assuming that, like Int, the object is immutable, then the ++ method needs to return a new object, but that new object has to be assigned.
It would require a new language feature. For instance, let's say we create an assign keyword. The type signature would need to be changed as well, to indicate that ++ is not returning a Complex, but assigning it to whatever field is holding the present object. In Scala spirit of not intruding in the programmers namespace, let's say we do that by prefixing the type with #.
Then it could be like this:
case class Complex(real: Double = 0, imaginary: Double = 0) {
def ++: #Complex = {
assign copy(real = real + 1)
// instead of return copy(real = real + 1)
}
The next problem is that postfix operators suck with Scala rules. For instance:
def inc(x: Int) = {
x++
x
}
Because of Scala rules, that is the same thing as:
def inc(x: Int) = { x ++ x }
Which wasn't the intent. Now, Scala privileges a flowing style: obj method param method param method param .... That mixes well C++/Java traditional syntax of object method parameter with functional programming concept of pipelining an input through multiple functions to get the end result. This style has been recently called "fluent interfaces" as well.
The problem is that, by privileging that style, it cripples postfix operators (and prefix ones, but Scala barely has them anyway). So, in the end, Scala would have to make big changes, and it would be able to measure up to the elegance of C/Java's increment and decrement operators anyway -- unless it really departed from the kind of thing it does support.
In Scala, ++ is a valid method, and no method implies assignment. Only = can do that.
A longer answer is that languages like C++ and Java treat ++ specially, and Scala treats = specially, and in an inconsistent way.
In Scala when you write i += 1 the compiler first looks for a method called += on the Int. It's not there so next it does it's magic on = and tries to compile the line as if it read i = i + 1. If you write i++ then Scala will call the method ++ on i and assign the result to... nothing. Because only = means assignment. You could write i ++= 1 but that kind of defeats the purpose.
The fact that Scala supports method names like += is already controversial and some people think it's operator overloading. They could have added special behavior for ++ but then it would no longer be a valid method name (like =) and it would be one more thing to remember.
I think the reasoning in part is that +=1 is only one more character, and ++ is used pretty heavily in the collections code for concatenation. So it keeps the code cleaner.
Also, Scala encourages immutable variables, and ++ is intrinsically a mutating operation. If you require +=, at least you can force all your mutations to go through a common assignment procedure (e.g. def a_=).
The primary reason is that there is not the need in Scala, as in C. In C you are constantly:
for(i = 0, i < 10; i++)
{
//Do stuff
}
C++ has added higher level methods for avoiding for explicit loops, but Scala has much gone further providing foreach, map, flatMap foldLeft etc. Even if you actually want to operate on a sequence of Integers rather than just cycling though a collection of non integer objects, you can use Scala range.
(1 to 5) map (_ * 3) //Vector(3, 6, 9, 12, 15)
(1 to 10 by 3) map (_ + 5)//Vector(6, 9, 12, 15)
Because the ++ operator is used by the collection library, I feel its better to avoid its use in non collection classes. I used to use ++ as a value returning method in my Util package package object as so:
implicit class RichInt2(n: Int)
{
def isOdd: Boolean = if (n % 2 == 1) true else false
def isEven: Boolean = if (n % 2 == 0) true else false
def ++ : Int = n + 1
def -- : Int = n - 1
}
But I removed it. Most of the times when I have used ++ or + 1 on an integer, I have later found a better way, which doesn't require it.
It is possible if you define you own class which can simulate the desired output however it may be a pain if you want to use normal "Int" methods as well since you would have to always use *()
import scala.language.postfixOps //otherwise it will throw warning when trying to do num++
/*
* my custom int class which can do ++ and --
*/
class int(value: Int) {
var mValue = value
//Post-increment
def ++(): int = {
val toReturn = new int(mValue)
mValue += 1
return toReturn
}
//Post-decrement
def --(): int = {
val toReturn = new int(mValue)
mValue -= 1
return toReturn
}
//a readable toString
override def toString(): String = {
return mValue.toString
}
}
//Pre-increment
def ++(n: int): int = {
n.mValue += 1
return n;
}
//Pre-decrement
def --(n: int): int = {
n.mValue -= 1
return n;
}
//Something to get normal Int
def *(n: int): Int = {
return n.mValue
}
Some possible test cases
scala>var num = new int(4)
num: int = 4
scala>num++
res0: int = 4
scala>num
res1: int = 5 // it works although scala always makes new resources
scala>++(num) //parentheses are required
res2: int = 6
scala>num
res3: int = 6
scala>++(num)++ //complex function
res4: int = 7
scala>num
res5: int = 8
scala>*(num) + *(num) //testing operator_*
res6: Int = 16
Of course you can have that in Scala, if you really want:
import scalaz._
import Scalaz._
case class IncLens[S,N](lens: Lens[S,N], num : Numeric[N]) {
def ++ = lens.mods(num.plus(_, num.one))
}
implicit def incLens[S,N:Numeric](lens: Lens[S,N]) =
IncLens[S,N](lens, implicitly[Numeric[N]])
val i = Lens[Int,Int](identity, (x, y) => y)
val imperativeProgram = for {
_ <- i := 0;
_ <- i++;
_ <- i++;
x <- i++
} yield x
def runProgram = imperativeProgram ! 0
And here you go:
scala> runProgram
runProgram: Int = 3
It isn't included because Scala developers thought it make the specification more complex while achieving only negligible benefits and because Scala doesn't have operators at all.
You could write your own one like this:
class PlusPlusInt(i: Int){
def ++ = i+1
}
implicit def int2PlusPlusInt(i: Int) = new PlusPlusInt(i)
val a = 5++
// a is 6
But I'm sure you will get into some trouble with precedence not working as you expect. Additionally if i++ would be added, people would ask for ++i too, which doesn't really fit into Scala's syntax.
Lets define a var:
var i = 0
++i is already short enough:
{i+=1;i}
Now i++ can look like this:
i(i+=1)
To use above syntax, define somewhere inside a package object, and then import:
class IntPostOp(val i: Int) { def apply(op: Unit) = { op; i } }
implicit def int2IntPostOp(i: Int): IntPostOp = new IntPostOp(i)
Operators chaining is also possible:
i(i+=1)(i%=array.size)(i&=3)
The above example is similar to this Java (C++?) code:
i=(i=i++ %array.length)&3;
The style could depend, of course.