How to append a string to a list or array in a for loop in scala? - scala

var RetainKeyList: mutable.Seq[String] = new scala.collection.mutable.ListBuffer[String]()
for(element<-ts_rdd)
{
var elem1 = element._1
var kpssSignificance: Double = 0.05
var dOpt: Option[Int] = (0 to 2).find
{
diff =>
var testTs = differencesOfOrderD(element._2, diff)
var (stat, criticalValues) = kpsstest(testTs, "c")
stat < criticalValues(kpssSignificance)
}
var d = dOpt match
{
case Some(v) => v
case None => 300000
}
if(d.equals(300000))
{
println("Bad Key: " + elem1)
RetainKeyList += elem1
}
Hi all,
I created a empty mutable list buffer var RetainKeyList: mutable.Seq[String] = new scala.collection.mutable.ListBuffer[String]() and I am trying to add a string elem1 to it in a for loop.
When I try to compile the code it hangs with no error message but if I remove the code RetainKeyList += elem1 I am able to print all of the elem1 string properly.
What am I doing wrong here? Is there a cleaner way to collect all the string elem1 generated in the for loop?

Long story short, your code is running on a distributed environment, so the local collection is not modified. Every week someone asks this question, please if you do not understand what are the implications of distributed computing do not use a distributed framework like Spark.
Also, you are abusing of mutability in all parts. And mutability and a distributed environment don't play nicely.
Anyway, here is a better way to solve your problem.
val retainKeysRdd = ts_rdd.map {
case (elem1, elem2) =>
val kpssSignificance = 0.05d
val dOpt = (0 to 2).find { diff =>
val testTs = differencesOfOrderD(elem2, diff)
val (stat, criticalValues) = kpsstest(testTs, "c")
stat < criticalValues(kpssSignificance)
}
(elem1 -> dOpt)
} collect {
case (key, None) => key
}
This returns an RDD with the retain keys. If you are really sure you need this as a local collection and that they won't blow up your memory, you can do this:
val retainKeysList = retainKeysRdd.collect().toList

Related

Is it faster to create a new Map or clear it and use again?

I need to use many Maps in my project so I wonder which way is more efficient:
val map = mutable.Map[Int, Int] = mutable.Map.empty
for (_ <- 0 until big_number)
{
// do something with map
map.clear()
}
or
for (_ <- 0 until big_number)
{
val map = mutable.Map[Int, Int] = mutable.Map.empty
// do something with map
}
to use in terms of time and memory?
Well, my formal answer would always be depends. As you need to benchmark your own scenario, and see what fits better for your scenario. I'll provide an example how you can try benchmarking your own code. Let's start with writing a measuring method:
def measure(name: String, f: () => Unit): Unit = {
val l = System.currentTimeMillis()
println(name + ": " + (System.currentTimeMillis() - l))
f()
println(name + ": " + (System.currentTimeMillis() - l))
}
Let's assume that in each iteration we need to insert into the map one key-value pair, and then to print it:
Await.result(Future.sequence(Seq(Future {
measure("inner", () => {
for (i <- 0 until 10) {
val map2 = mutable.Map.empty[Int, Int]
map2(i) = i
println(map2)
}
})
},
Future {
measure("outer", () => {
val map1 = mutable.Map.empty[Int, Int]
for (i <- 0 until 10) {
map1(i) = i
println(map1)
map1.clear()
}
})
})), 10.seconds)
The output in this case, is almost always equal between the inner and the outer. Please note that in this case I run the two options in parallel, as if I wouldn't the first one always takes significantly more time, no matter which one of then is first.
Therefore, we can conclude, that in this case they are almost the same.
But, if for example I add an immutable option:
Future {
measure("immutable", () => {
for (i <- 0 until 10) {
val map1 = Map[Int, Int](i -> i)
println(map1)
}
})
}
it always ends up first. This makes sense because immutable collections are much more performant than the mutables.
For better performance tests you probably need to use some third parties, such as scalameter, or others that exists.

scala regex multiple integers

I have the following string that I would like to match on: 1-10 employees.
Here is my regex statement val regex = ("\\d+").r
The problem I have is Im trying to find a way to extract the matched data and determine which value returned is bigger.
Here is what IM doing to process it
def setMinAndMaxValue(currentCompany: CurrentCompany, matchIterator: Iterator[Regex.Match]): CurrentCompany = {
var max = 0
println(s"matchIterator - $matchIterator")
matchIterator.collect {
case regex(s: String) => println("found string")
case regex(IntConv(x)) =>
println("regex case")
if (x > max) max = x
}
val (minVal, maxVal) = rangesForMaxValue(max)
val newDetails = currentCompany.details.copy(minSize = Some(minVal), maxSize = Some(maxVal))
currentCompany.copy(details = newDetails)
}
object IntConv {
def unapply(s : String) : Option[Int] = Try {
Some(s.toInt)
}.toOption.flatten
}
I thought I was confused by your original question, then you clarified it with code and now I have no idea what you're trying to do.
To extract numbers from a string, try this.
val re = """(\d+)""".r
val nums = re.findAllIn(string_with_numbers).map(_.toInt).toList
Then you can just nums.min, and nums.max, and whatever number processing you need.

Spark job slow on cluster than standalone

I have this piece of code which works fine in standalone but works slowly when it comes to work on a cluster of 4 slaves (8cores 30Go memory) at AWS.
For a file of 10000 entries
Standalone : 257s
Aws 4S : 369s
def tabHash(nb:Int, dim:Int) = {
var tabHash0 = Array(Array(0.0)).tail
for( ind <- 0 to nb-1) {
var vechash1 = Array(0.0).tail
for( ind <- 0 to dim-1) {
val nG = Random.nextGaussian
vechash1 = vechash1 :+ nG
}
tabHash0 = tabHash0 :+ vechash1
}
tabHash0
}
def hashmin3(x:Vector, w:Double, b:Double, tabHash1:Array[Array[Double]]) = {
var tabHash0 = Array(0.0).tail
val x1 = x.toArray
for( ind <- 0 to tabHash1.size-1) {
var sum = 0.0
for( ind2 <- 0 to x1.size-1) {
sum = sum + (x1(ind2)*tabHash1(ind)(ind2))
}
tabHash0 = tabHash0 :+ (sum+b)/w
}
tabHash0
}
def pow2(tab1:Array[Double], tab2:Array[Double]) = {
var sum = 0.0
for( ind <- 0 to tab1.size-1) {
sum = sum - Math.pow(tab1(ind)-tab2(ind),2)
}
sum
}
val w = ww
val b = Random.nextDouble * w
val tabHash2 = tabHash(nbseg,dim)
var rdd_0 = parsedData.map(x => (x.get_id,(x.get_vector,hashmin3(x.get_vector,w,b,tabHash2)))).cache
var rdd_Yet = rdd_0
for( ind <- 1 to maxIterForYstar ) {
var rdd_dist = rdd_Yet.cartesian(rdd_0).flatMap{ case (x,y) => Some((x._1,(y._2._1,pow2(x._2._2,y._2._2))))}//.coalesce(64)
var rdd_knn = rdd_dist.topByKey(k)(Ordering[(Double)].on(x=>x._2))
var rdd_bary = rdd_knn.map(x=> (x._1,Vectors.dense(bary(x._2,k))))
rdd_Yet = rdd_bary.map(x=>(x._1,(x._2,hashmin3(x._2,w,b,tabHash2))))
}
I tried to broadcast some variables
val w = sc.broadcast(ww.toDouble)
val b = sc.broadcast(Random.nextDouble * ww)
val tabHash2 = sc.broadcast(tabHash(nbseg,dim))
Without any effects
I know that's not the bary function because i tried another version of this code without hashmin3 which works fine with 4 slaves and worse with 8 slaves which is for another topic.
Bad code. Especially for distributed and large computations. I can't fast tell what is root of problem, but you anyway must rewrite this code.
Array is terrible for universal and sharable data. It is mutable and require continious memory allocation (last may be a problem even you have enough of memory). Better use Vector (or List sometimes). Never use arrays, really.
var vechash1 = Array(0.0).tail You create collection with one element, then call function to get empty collection. If it's rare no worry about performance, but it's ugly! var vechash1: Array[Double] = Array() or var vechash1: Vector[Double] = Vector() or var vechash1 = Vector.empty[Double].
def tabHash(nb:Int, dim:Int) = Always set return type of function when it's unclear. Power of scala is rich type system. It's very helpful to have compile time checks (about what you exactly get in result but not what you imagine to get!). It's very important when deal with huge data, because compile checks save your time and money. Also it's more easy to read such code later. def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] =
def hashmin3(x: Vector, typo? it doesn't compile without type parameter.
First function more compact:
def tabHash(nb:Int, dim:Int): Vector[Vector[Double]] = {
(0 to nb-1).map {_ =>
(0 to dim - 1).map(_ => Random.nextGaussian()).toVector
}.toVector
}
Second function is ((x*M) + scalar_b)/scalar_w. It's may be more efficient to use library that is specifically optimized for work with matrices.
Third (i suppose mistake here with sign of computation, if you count square error):
def pow2(tab1:Vector[Double], tab2:Vector[Double]): Double =
tab1.zip(tab2).map{case (t1,t2) => Math.pow(t1 - t2, 2)}.reduce(_ - _)
var rdd_Yet = rdd_0 Cached RDD is rewrited in cycle. So it's useless storage.
Last cycle is hard to analyse. I think it's must be simplified.

workaround for prepending to a LinkedHashMap in Scala?

I have a LinkedHashMap which I've been using in a typical way: adding new key-value
pairs to the end, and accessing them in order of insertion. However, now I have a
special case where I need to add pairs to the "head" of the map. I think there's
some functionality inside the LinkedHashMap source for doing this, but it has private
accessibility.
I have a solution where I create a new map, add the pair, then add all the old mappings.
In Java syntax:
newMap.put(newKey, newValue)
newMap.putAll(this.map)
this.map = newMap
It works. But the problem here is that I then need to make my main data structure
(this.map) a var rather than a val.
Can anyone think of a nicer solution? Note that I definitely need the fast lookup
functionality provided by a Map collection. The performance of a prepending is not
such a big deal.
More generally, as a Scala developer how hard would you fight to avoid a var
in a case like this, assuming there's no foreseeable need for concurrency?
Would you create your own version of LinkedHashMap? Looks like a hassle frankly.
This will work but is not especially nice either:
import scala.collection.mutable.LinkedHashMap
def prepend[K,V](map: LinkedHashMap[K,V], kv: (K, V)) = {
val copy = map.toMap
map.clear
map += kv
map ++= copy
}
val map = LinkedHashMap('b -> 2)
prepend(map, 'a -> 1)
map == LinkedHashMap('a -> 1, 'b -> 2)
Have you taken a look at the code of LinkedHashMap? The class has a field firstEntry, and just by taking a quick peek at updateLinkedEntries, it should be relatively easy to create a subclass of LinkedHashMap which only adds a new method prepend and updateLinkedEntriesPrepend resulting in the behavior you need, e.g. (not tested):
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = e
}
}
Here is a sample implementation I threw together real quick (that is, not thoroughly tested!):
class MyLinkedHashMap[A, B] extends LinkedHashMap[A,B] {
def prepend(key: A, value: B): Option[B] = {
val e = findEntry(key)
if (e == null) {
val e = new Entry(key, value)
addEntry(e)
updateLinkedEntriesPrepend(e)
None
} else {
// The key already exists, so we might as well call LinkedHashMap#put
put(key, value)
}
}
private def updateLinkedEntriesPrepend(e: Entry) {
if (firstEntry == null) { firstEntry = e; lastEntry = e }
else {
val oldFirstEntry = firstEntry
firstEntry = e
firstEntry.later = oldFirstEntry
oldFirstEntry.earlier = firstEntry
}
}
}
Tested like this:
object Main {
def main(args:Array[String]) {
val x = new MyLinkedHashMap[String, Int]();
x.prepend("foo", 5)
x.prepend("bar", 6)
x.prepend("olol", 12)
x.foreach(x => println("x:" + x._1 + " y: " + x._2 ));
}
}
Which, on Scala 2.9.0 (yeah, need to update) results in
x:olol y: 12
x:bar y: 6
x:foo y: 5
A quick benchmark shows order of magnitude in performance difference between the extended built-in class and the "map rewrite" approach (I used the code from Debilski's answer in "ExternalMethod" and mine in "BuiltIn"):
benchmark length us linear runtime
ExternalMethod 10 1218.44 =
ExternalMethod 100 1250.28 =
ExternalMethod 1000 19453.59 =
ExternalMethod 10000 349297.25 ==============================
BuiltIn 10 3.10 =
BuiltIn 100 2.48 =
BuiltIn 1000 2.38 =
BuiltIn 10000 3.28 =
The benchmark code:
def timeExternalMethod(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) prepend(map, (i, i))
r -= 1
}
}
def timeBuiltIn(reps: Int) = {
var r = reps
while(r > 0) {
for(i <- 1 to 100) map.prepend(i, i)
r -= 1
}
}
Using a scala benchmarking template.

Tune Nested Loop in Scala

I was wondering if I can tune the following Scala code :
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
var listNoDuplicates: List[(Class1, Class2)] = Nil
for (outerIndex <- 0 until listOfTuple.size) {
if (outerIndex != listOfTuple.size - 1)
for (innerIndex <- outerIndex + 1 until listOfTuple.size) {
if (listOfTuple(i)._1.flag.equals(listOfTuple(j)._1.flag))
listNoDuplicates = listOfTuple(i) :: listNoDuplicates
}
}
listNoDuplicates
}
Usually if you have someting looking like:
var accumulator: A = new A
for( b <- collection ) {
accumulator = update(accumulator, b)
}
val result = accumulator
can be converted in something like:
val result = collection.foldLeft( new A ){ (acc,b) => update( acc, b ) }
So here we can first use a map to force the unicity of flags. Supposing the flag has a type F:
val result = listOfTuples.foldLeft( Map[F,(ClassA,ClassB)] ){
( map, tuple ) => map + ( tuple._1.flag -> tuple )
}
Then the remaining tuples can be extracted from the map and converted to a list:
val uniqList = map.values.toList
It will keep the last tuple encoutered, if you want to keep the first one, replace foldLeft by foldRight, and invert the argument of the lambda.
Example:
case class ClassA( flag: Int )
case class ClassB( value: Int )
val listOfTuples =
List( (ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)), (ClassA(1),ClassB(-1)) )
val result = listOfTuples.foldRight( Map[Int,(ClassA,ClassB)]() ) {
( tuple, map ) => map + ( tuple._1.flag -> tuple )
}
val uniqList = result.values.toList
//uniqList: List((ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)))
Edit: If you need to retain the order of the initial list, use instead:
val uniqList = listOfTuples.filter( result.values.toSet )
This compiles, but as I can't test it it's hard to say if it does "The Right Thing" (tm):
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
(for {outerIndex <- 0 until listOfTuple.size
if outerIndex != listOfTuple.size - 1
innerIndex <- outerIndex + 1 until listOfTuple.size
if listOfTuple(i)._1.flag == listOfTuple(j)._1.flag
} yield listOfTuple(i)).reverse.toList
Note that you can use == instead of equals (use eq if you need reference equality).
BTW: https://codereview.stackexchange.com/ is better suited for this type of question.
Do not use index with lists (like listOfTuple(i)). Index on lists have very lousy performance. So, some ways...
The easiest:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
SortedSet(listOfTuple: _*)(Ordering by (_._1.flag)).toList
This will preserve the last element of the list. If you want it to preserve the first element, pass listOfTuple.reverse instead. Because of the sorting, performance is, at best, O(nlogn). So, here's a faster way, using a mutable HashSet:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
// Produce a hash map to find the duplicates
import scala.collection.mutable.HashSet
val seen = HashSet[Flag]()
// now fold
listOfTuple.foldLeft(Nil: List[(Class1,Class2)]) {
case (acc, el) =>
val result = if (seen(el._1.flag)) acc else el :: acc
seen += el._1.flag
result
}.reverse
}
One can avoid using a mutable HashSet in two ways:
Make seen a var, so that it can be updated.
Pass the set along with the list being created in the fold. The case then becomes:
case ((seen, acc), el) =>