Scala vector splicing (maintaining sorted sequence) - scala

I need to maintain a sorted sequence (mutable or immutable — I don't care), dynamically inserting elements into the middle of it (to keep it sorted) and removing them likewise (so, random access by index is crucial).
The best thing I came onto is using a Vector and scala.collections.Searching from 2.11, and then:
var vector: Vector[Ordered]
...
val ip = vector.search(element)
Inserting:
vector = (vector.take(ip.insertionPoint) :+ element) ++ vector.drop(ip.insertionPoint)
Deleting:
vector.patch(from = ip.insertionPoint, patch = Nil, replaced = 1)
Doesn't look elegant to me, and I suspect performance issues. Is there a better way? Splicing sequences seems like a very basic operation to me, but I can't find an elegant solution.

You should use SortedSet. Default implementation of SortedSet is immutable red-black tree. There is also a mutable implementation.
SortedSet[Int]() + 5 + 3 + 4 + 7 + 1
// SortedSet[Int] = TreeSet(1, 3, 4, 5, 7)
Set contains no duplicate elements. In case you want to count duplicate elements you could use SortedMap[Key, Int] with elements as keys and counts as values. See this answer for MultiSet emulation using Map.

Related

Scala: For loop that matches ints in a List

New to Scala. I'm iterating a for loop 100 times. 10 times I want condition 'a' to be met and 90 times condition 'b'. However I want the 10 a's to occur at random.
The best way I can think is to create a val of 10 random integers, then loop through 1 to 100 ints.
For example:
val z = List.fill(10)(100).map(scala.util.Random.nextInt)
z: List[Int] = List(71, 5, 2, 9, 26, 96, 69, 26, 92, 4)
Then something like:
for (i <- 1 to 100) {
whenever i == to a number in z: 'Condition a met: do something'
else {
'condition b met: do something else'
}
}
I tried using contains and == and =! but nothing seemed to work. How else can I do this?
Your generation of random numbers could yield duplicates... is that OK? Here's how you can easily generate 10 unique numbers 1-100 (by generating a randomly shuffled sequence of 1-100 and taking first ten):
val r = scala.util.Random.shuffle(1 to 100).toList.take(10)
Now you can simply partition a range 1-100 into those who are contained in your randomly generated list and those who are not:
val (listOfA, listOfB) = (1 to 100).partition(r.contains(_))
Now do whatever you want with those two lists, e.g.:
println(listOfA.mkString(","))
println(listOfB.mkString(","))
Of course, you can always simply go through the list one by one:
(1 to 100).map {
case i if (r.contains(i)) => println("yes: " + i) // or whatever
case i => println("no: " + i)
}
What you consider to be a simple for-loop actually isn't one. It's a for-comprehension and it's a syntax sugar that de-sugares into chained calls of maps, flatMaps and filters. Yes, it can be used in the same way as you would use the classical for-loop, but this is only because List is in fact a monad. Without going into too much details, if you want to do things the idiomatic Scala way (the "functional" way), you should avoid trying to write classical iterative for loops and prefer getting a collection of your data and then mapping over its elements to perform whatever it is that you need. Note that collections have a really rich library behind them which allows you to invoke cool methods such as partition.
EDIT (for completeness):
Also, you should avoid side-effects, or at least push them as far down the road as possible. I'm talking about the second example from my answer. Let's say you really need to log that stuff (you would be using a logger, but println is good enough for this example). Doing it like this is bad. Btw note that you could use foreach instead of map in that case, because you're not collecting results, just performing the side effects.
Good way would be to compute the needed stuff by modifying each element into an appropriate string. So, calculate the needed strings and accumulate them into results:
val results = (1 to 100).map {
case i if (r.contains(i)) => ("yes: " + i) // or whatever
case i => ("no: " + i)
}
// do whatever with results, e.g. print them
Now results contains a list of a hundred "yes x" and "no x" strings, but you didn't do the ugly thing and perform logging as a side effect in the mapping process. Instead, you mapped each element of the collection into a corresponding string (note that original collection remains intact, so if (1 to 100) was stored in some value, it's still there; mapping creates a new collection) and now you can do whatever you want with it, e.g. pass it on to the logger. Yes, at some point you need to do "the ugly side effect thing" and log the stuff, but at least you will have a special part of code for doing that and you will not be mixing it into your mapping logic which checks if number is contained in the random sequence.
(1 to 100).foreach { x =>
if(z.contains(x)) {
// do something
} else {
// do something else
}
}
or you can use a partial function, like so:
(1 to 100).foreach {
case x if(z.contains(x)) => // do something
case _ => // do something else
}

filling a matrix with Scala library breeze

I'm new to Scala and I'm having a mental block on a seemingly easy problem. I'm using the Scala library breeze and need to take an array buffer (mutable) and put the results into a matrix. This... should be simple but? Scala is so insanely type casted breeze seems really picky about what data types it will take when making a DenseVector. This is just some prototype code, but can anyone help me come up with a solution?
Right now I have something like...
//9 elements that need to go into a 3x3 matrix, 1-3 as top row, 4-6 as middle row, etc)
val numbersForMatrix: ArrayBuffer[Double] = (1, 2, 3, 4, 5, 6, 7, 8, 9)
//the empty 3x3 matrix
var M: breeze.linalg.DenseMatrix[Double] = DenseMatrix.zeros(3,3)
In breeze you can do stuff like
M(0,0) = 100 and set the first value to 100 this way,
You can also do stuff like:
M(0, 0 to 2) := DenseVector(1, 2, 3)
which sets the first row to 1, 2, 3
But I cannot get it to do something like...
var dummyList: List[Double] = List(1, 2, 3) //this works
var dummyVec = DenseVector[Double](dummyList) //this works
M(0, 0 to 2) := dummyVec //this does not work
and successfully change the first row to the 1, 2,3.
And that's with a List, not even an ArrayBuffer.
Am willing to change datatypes from ArrayBuffer but just not sure how to approach this at all... could try updating the matrix values one by one but that seems like it would be VERY hacky to code up(?).
Note: I'm a Python programmer who is used to using numpy and just giving it arrays. The breeze documentation doesn't provide enough examples with other datatypes for me to have been able to figure this out yet.
Thanks!
Breeze is, in addition to pickiness over types, pretty picky about vector shape: DenseVectors are column vectors, but you are trying to assign to a subset of a row, which expects a transposed DenseVector:
M(0, 0 to 2) := dummyVec.t

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)

Scala mutable list - insert element in the middle

How can I implement inserting element in the middle of DoubleLinkedList without coping it?
In general I want to find some element in the collection and then insert a new one after it.
There is method DoubleLinkedList.insert but I'm not quite sure how it works. In the documentation it's described this way:
Insert linked list that at current position of this linked list
But what is the current position in the linked list? How can I set it?
I want to have O(1) insertion time.
To insert element in the middle of DoubleLinkedList you first to find your 'middle' and then make insertion:
val list = mutable.DoubleLinkedList(...)
list.next.next....insert(insertion)
For the first, you can't use insert to insert element, because it takes a DoubleLinkedList as parameter
def insert(that: DoubleLinkedList[A]): Unit
just use indexWhere and apply methods, for example replacing 30 with 3:
yourlist(yourlist.indexWhere(_ == 30)) = 3
Use span for separating the collection into the initial elements that hold a condition and those where the first item does not hold; e.g
val (l,r) = DoubleLinkedList(1,2,3,4,5).span(_ != 3)
l = DoubleLinkedList(1, 2)
r = DoubleLinkedList(3, 4, 5)
Then replace value 3 with 33 like this
l ++ DoubleLinkedList(33) ++ r.tail
Note In Scala 2.11.6 run the REPL with -deprecation to notice a deprecation warning such as
warning: object DoubleLinkedList in package mutable is deprecated:
Low-level linked lists are deprecated.

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.