In which order will the elements of Scala Buffer be accessed in a for loop? - scala

for(elt <- bufferObject: scala.collection.mutable.Buffer)
// Do something with the element of the collection
In which order will the elements in the for loop be accessed ? Randomly ?
From the Scala API one can see that Buffer is a subclass of Seq, in which the elements are ordered. Does this also hold for the loop above ?

Here's a selection of the super-types of mutable.Buffer[A], and the traversal guarantees they provide:
Seq[A] -- All elements have a position, with an index associated; they are always traversed one by one, from lowest index to highest index.
GenSeq[A] -- All elements have a position, with an index associated; they may be traversed one by one or in parallel; if a new collection is generated, the position of the elements in the new collection will correspond to the old collection, even if the traversal is in parallel.
Iterable[A] -- Elements may be traversed in any order, but will always be traversed in the same order (that is, it can't change from one iteration to another); they'll be traversed one by one.
GenIterable[A] -- Elements may be traversed in any order, but will always be traversed in the same order (that is, it can't change from one iteration to another); traversal may happen one by one or in paralllel; if a new collection is generated, the position of the elements in the new collection will correspond to the old collection, even if the traversal is in parallel.
Traversable[A] -- Same guarantees as Iterable[A], with the additional limitation that you can interrupt traversal, but you cannot determine when the next element will be choosen (which you can in Iterable[A] and descendants, by producing an Iterator).
GenTraversable[A] -- Same guarantees as GenIterable[A], with the additional limitation that you can interrupt traversal, but you cannot determine when the next element will be choosen (which you can in GenIterable[A] and descendants, by producing an Iterator).
TraversableOnce[A] -- Same guarantees as in Traversable[A], with the additional limitation that you might not be able to traverse the elements more than once.
GenTraversableOnce[A] -- Same guarantees as in GenTraversable[A], with the additional limitation that you might not be able to traverse the elements more than once.
Now, all guarantees apply, with the fewer limitations, which effectively means that everything said about Seq[A] holds true for mutable.Buffer[A].
Now, to the for loop:
for(elt <- bufferObject: scala.collection.mutable.Buffer)
doSomething(elt)
is the same thing as:
bufferObject.foreach(elt => dosomething(elt))
And
for(elt <- bufferObject: scala.collection.mutable.Buffer)
yield makeSomething(elt)
is the same thing as:
bufferObject.map(elt => makeSomething(elt))
In fact, all for variants will be translated into methods available on Buffer (or whatever other type you have there), so the guarantees provided by the collections all apply. Note, for instance, that a GenSeq used with map (for-yield) may process all elements in parallel, but will produce a collection where newCollection(i) == makeSomething(bufferObject(i)), that is, the indices are preserved.

Yes, the for-comprehension will desugar to some combination of map, flatMap, and foreach, and these all follow the Seq's defined order.

Unless you're using parallel collections (via par method), the order of operations (like for comprehension, map, foreach and other sequential methods) in mutable Buffer is guaranteed.

Should be ordered. Buffer defaults to ArrayBuffer I believe.
scala> import scala.collection.mutable.Buffer
import scala.collection.mutable.Buffer
scala> val x = Buffer(1, 2, 3, 4, 5)
x: scala.collection.mutable.Buffer[Int] = ArrayBuffer(1, 2, 3, 4, 5)
scala> for (y <- x) println(y)
1
2
3
4
5

Related

Is any ways to speedup work of Map.getOrElse(val, 0) on big tuple maps?

I has simple immutable Map in Scala:
// ... - mean and so on
val myLibrary = Map("qwe" -> 1.2, "qasd" -> -0.59, ...)
And for that myMap i calling MyFind method which call getOrElse(val, 0):
def MyFind (srcMap: Map[String,Int], str: String): Int ={
srcMap.getOrElse(str,0)
}
val res = MyFind(myLibrary, "qwe")
Problem in that this method called several times for different input strings. E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
What you can advice to speedup that code?
Maybe use another type of Map?
Maybe another collection?
Map does not have linear time lookup. Default concrete implementation of Map is HashMap
Map is the interface for immutable maps
while scala.collection.immutable.HashMap is a concrete implementation.
which has effective constant lookup time, as per collections performance characteristic
lookup add remove min
HashSet/HashMap eC eC eC L
E.g. as i suppose for map length 100 and 1 input string it will be try to compare that string 100 times (ones for 1 map value). As you guess for 10,000 it will be gain 10,000 comparisons.
No, it won't. That's rather the point of Map in the first place. While it allows implementations which do require checking each value one-by-one (such as ListMap) they are very rarely used and by default when calling Map(...) you'll get a HashMap which doesn't. Its lookup is logarithmic time (with a large base), so basically when going from 100 to 10000 it doubles instead of increasing by 100 times.
By that with huge map length over 10.000 my method which find value of string keys in that map significantly slows down the work.
10000 is quite small.
Actually, look at http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html#performance. You can also see that mutable maps are much faster. Note that this predates collection changes in Scala 2.13, so may have changed.

Multiple types in a list?

Rephrasing of my questions:
I am writing a program that implements a data mining algorithm. In this program I want to save the input data which is supposed to be minded. Imagine the input data to be a table with rows and columns. Each row is going to be represented by an instance of my Scala class (the one in question). The columns of the input data can be of different type (Integer, Double, String, whatnot) and which type will change depending on the input data. I need a way to store a row inside my Scala class instance. Thus I need an ordered collection (like a special List) that can hold (many) different types as elements and it must be possible that the type is only determined at runtime. How can I do this? A Vector or a List require that all elements are supposed to be of the same type. A Tuple can hold different types (which can be determined at runtime if I am not mistaken), but only up to 22 elements which is too few.
Bonus (not sure if I am asking too much now):
I would also like to have the rows' columns to be named and excess-able by name. However, I thinkg this problem can easily be solved by using two lists. (Altough, I just read about this issue somewhere - but I forgot where - and think this was solved more elegantly.)
It might be good to have my collection to be random access (so "Vector" rather than "List").
Having linear algebra (matrix multiplication etc.) capabilities would be nice.
Even more bonus: If I could save matrices.
Old phrasing of my question:
I would like to have something like a data.frame as we know it from R in Scala, but I am only going to need one row. This row is going to be a member in a class. The reason for this construct is that I want methods related to each row to be close to the data itself. Each data row is also supposed to have meta data about itself and it will be possible to give functions so that different rows will be manipulated differently. However I need to save rows somehow within the class. A List or Vector comes to mind, but they only allow to be all Integer, String, etc. - but as we know from data.frame, different columns (here elements in Vector or List) can be of different type. I also would like to save the name of each column to be able to access the row values by column name. That seems the smallest issue though. I hope it is clear what I mean. How can I implement this?
DataFrames in R are heterogenous lists of homogeneous column vectors:
> df <- data.frame(c1=c(r1=1,r2=2), c2=c('a', 'b')); df
c1 c2
r1 1 a
r2 2 b
You could think of each row as a heterogeneous list of scalar values:
> as.list(df['r1',])
$c1
[1] 1
$c2
[1] a
An analogous implementation in scala would be a tuple of lists:
scala> val df = (List(1, 2), List('a', 'b'))
df: (List[Int], List[Char]) = (List(1, 2),List(a, b))
Each row could then just be a tuple:
scala> val r1 = (1, 'a')
r1: (Int, Char) = (1,a)
If you want to name all your variables, another possibility is a case class:
scala> case class Row (col1:Int, col2:Char)
defined class Row
scala> val r1 = Row(col1=1, col2='a')
r1: Row = Row(1,a)
Hope that helps bridge the R to scala divide.

Remove all items in PriorityQueue with value less than x

Is there a way to remove all items in a scala priority queue with a value less than a specified value?
eg.
val queue = scala.collection.mutable.PriorityQueue[Int]()
queue.enqueue(3)
queue.enqueue(5)
queue.enqueue(10)
queue.enqueue(8)
queue.removeAllLessThan(6)
println(queue) // PriorityQueue(10, 8)
I know you could do this using a filter but it seems like there would be a very efficient way of doing this on a heap.
The reason I want to do this is to keep the memory footprint low for an A* algorithm.
In PriorityQueue ordering is prevailed for insertion, namely a new element is inserted in the position where ordering is guaranteed. In this case using the default ordering on Int (greater than),
queue.takeWhile(_ > 6)
iterates over the queue while the predicate holds, in contrast with the filter which scrutinises each and every item.

Looping through a LinkedHashSet using indexes in Scala

I have a LinkedHashSet which was created from a Seq. I used a LinkedHashSet because I need to keep the order of the Seq, but also ensure uniqueness, like a Set. I need to check this LinkedHashSet against another sequence to verify that various properties within them are the same. I assumed that I could loop through using an index, i, but it appears not. Here is an example of what I would like to accomplish.
var s: Seq[Int] = { 1 to mySeq.size }
return s.forall { i =>
myLHS.indexOf(i).something == mySeq.indexOf(i).something &&
myLHS.indexOf(i).somethingelse == mySeq.indexOf(i).somethingelse
}
So how do I access individual elements of the LHS?
Consider using the zip method on collections to create a collection of pairs (Tuples). The specifics of this depend on your specifics. You may want to do mySeq.zip(myLHS) or myLHS.zip(mySeq), which will create different structures. You probably want mySeq.zip(myLHS), but I'm guessing. Also, if the collections are very large, you may want to take a view first, e.g. mySeq.view.zip(myLHS) so that the pair collection is also non-strict.
Once you have this combined collection, you can use a for-comprehension (or directly, myZip.foreach) to traverse it.
A LinkedHashSet is not necessary in this situation. Since I made it from a Seq, it is already ordered. I do not have to convert it to a LHS in order to also make it unique. Apparently, Seq has the distinct method which will remove duplicates from the sequence. From there, I can access the items via their indexes.

getting number of values within reduceByKey RDD

when reduceByKey operation is called, it is receiving list of values of a particular key. My question is:
are the list of values it receives in a sorted order?
is it possible to know how many values it receive?
i'm trying to calculate first quartile of the list of values of a key within reduceByKey. is this possible to do within reduceByKey?
.1. No, that would be totally going against the whole point of a reduce operation - i.e. to parallelalize an operation into an arbitrary tree of suboperations by taking advantage of associativity and commutativity.
.2. You'll need to define a new monoid by composing the integer monoid and whatever it is your doing. Let's assume your operation is op then
.
yourRdd.map(kv => (kv._1, (kv._2, 1)))
.reduceByKey((left, right) => (left._1 op right._1, left._2 + right._2))
will give you an RDD[(KeyType, (ReducedValueType, Int))] where the Int will be the number of values the reduce received for each key.
.3. You'll have to be more specific about what you mean by first quartile. Given that the answer to 1. is no, then you would have to have a bound that defines the first quartile then you won't need the data to be sorted because you could filter the values out by that bound.