Scala: k fold cross validation - scala

I want to do k-fold cross validation. Essentially we are given a bunch of data allData. Suppose we partition our input into "k" cluster and put it in groups.
The desired output is a trainAndTestDataList: List[(Iterable[T], Iterable[T])], where the Listis of size "k". The "i"th element of the trainAndTestDataList is a tuple like (A, B), where A should be the "i"th element of groups and B should be all elements of groups except the "i"th one, concatenated.
Any ideas on implementing this efficiently?
val allData: Iterable[T] = ... // we get the data from somewhere
val groupSize = Math.ceil(allData.size / k).toInt
val groups = allData.grouped(groupSize).toList
val trainAndTestDataList = ... // fill out this part
One thing to keep in mind is that allData can be very long, however "k" is very small (say 5). So it is very crucial to keep all the data vectors as Iterator (and not List, Seq, etc).
Update: Here is how I did (and I am not happy about it):
val trainAndTestDataList = {
(0 until k).map{ fold =>
val (a,b) = groups.zipWithIndex.partition{case (g, idx) => idx == fold}
(a.unzip._1.flatten.toIterable, b.unzip._1.flatten.toIterable)
}
}
Reasons I don't like it:
much twisted especially after partition where I do an unzip, then ._1 and flatten. I think one should be able to do a better job.
Although a is a Iterable[T], the output of a.unzip._1.flatten. is a List[T], I think. This is no good, since the number of the element in this list might be very large.

You could try that operation
implicit class TeeSplitOp[T](data: Iterable[T]) {
def teeSplit(count: Int): Stream[(Iterable[T], Iterable[T])] = {
val size = data.size
def piece(i: Int) = i * size / count
Stream.range(0, size - 1) map { i =>
val (prefix, rest) = data.splitAt(piece(i))
val (test, postfix) = rest.splitAt(piece(i + 1) - piece(i))
val train = prefix ++ postfix
(test, train)
}
}
}
This split will be as lazy as splitAt and ++ are within your collection type.
You can try it with
1 to 10 teeSplit 3 force

I believe this should work. It also takes care of the randomization (don't neglect this!) in a reasonably efficient manner, i.e. O(n) instead of O(n log(n)) required for the more naive approach using a random shuffle/permutation of the data.
import scala.util.Random
def testTrainDataList[T](
data: Seq[T],
k: Int,
seed: Long = System.currentTimeMillis()
): Seq[(Iterable[T], Iterable[T])] = {
def createKeys(n: Int, k: Int) = {
val groupSize = n/k
val rem = n % k
val cumCounts = Array.tabulate(k){ i =>
if (i < rem) (i + 1)*(groupSize + 1) else (i + 1)*groupSize + rem
}
val rng = new Random(seed)
for (count <- n to 1 by -1) yield {
val j = rng.nextInt(count)
val i = cumCounts.iterator.zipWithIndex.find(_._1 > j).map(_._2).get
for (s <- i until k) cumCounts(s) -= 1
}
}
val keys = createKeys(data.length, k)
for (i <- 0 until k) yield {
val testIterable = new Iterable[T] {
def iterator = (keys.iterator zip data.iterator).filter(_._1 == i).map(_._2)
}
val trainIterable = new Iterable[T] {
def iterator = (keys.iterator zip data.iterator).filter(_._1 != i).map(_._2)
}
(testIterator, trainIterator)
}
}
Note the way I define testIterable and trainIterable. This makes your test/train sets lazy and non-memoized, which I gathered is what you wanted.
Example usage:
val data = 'a' to 'z'
for (((testData, trainData), index) <- testTrainDataList(data, 4).zipWithIndex) {
println(s"index = $index")
println("test: " + testData.mkString(", "))
println("train: " + trainData.mkString(", "))
}
//index = 0
//test: i, l, o, q, v, w, y
//train: a, b, c, d, e, f, g, h, j, k, m, n, p, r, s, t, u, x, z
//
//index = 1
//test: a, d, e, h, n, r, z
//train: b, c, f, g, i, j, k, l, m, o, p, q, s, t, u, v, w, x, y
//
//index = 2
//test: b, c, m, t, u, x
//train: a, d, e, f, g, h, i, j, k, l, n, o, p, q, r, s, v, w, y, z
//
//index = 3
//test: f, g, j, k, p, s
//train: a, b, c, d, e, h, i, l, m, n, o, q, r, t, u, v, w, x, y, z

Related

Matrix Vector multiplication in Scala

I am having a Matrix of size D by D (implemented as List[List[Int]]) and a Vector of size D (implemented as List[Int]). Assuming value of D = 3, I can create matrix and vector in following way.
val Vector = List(1,2,3)
val Matrix = List(List(4,5,6) , List(7,8,9) , List(10,11,12))
I can multiply both these as
(Matrix,Vector).zipped.map((x,y) => (x,Vector).zipped.map(_*_).sum )
This code multiplies matrix with vector and returns me vector as needed. I want to ask is there any faster or optimal way to get the same result using Scala functional style? As in my scenario I have much bigger value of D.
What about something like this?
def vectorDotProduct[N : Numeric](v1: List[N], v2: List[N]): N = {
import Numeric.Implicits._
// You may replace this with a while loop over two iterators if you require even more speed.
#annotation.tailrec
def loop(remaining1: List[N], remaining2: List[N], acc: N): N =
(remaining1, remaining2) match {
case (x :: tail1, y :: tail2) =>
loop(
remaining1 = tail1,
remaining2 = tail2,
acc + (x * y)
)
case (Nil, _) | (_, Nil) =>
acc
}
loop(
remaining1 = v1,
remaining2 = v2,
acc = Numeric[N].zero
)
}
def matrixVectorProduct[N : Numeric](matrix: List[List[N]], vector: List[N]): List[N] =
matrix.map(row => vectorDotProduct(vector, row))

Extracting data out of a recursive function using tuples

I have a Scala function that does 2-3 recursive calls through its lifetime. I want to save the variable inside the second tuple in a list. Is there a smart way to do this?
Just passing the variable around would mean that I would have a List[String], when in actuality what I want is a List[List[String]].
Would there be a need for a variable inside the function that updated with each itteration?
def someRecursiveFunction(listOfWords:List[String])List[List[String]] = {
val textSplitter = listOfWords.lastIndexOf("Some Word")
if (Some Word != -1) {
val someTuple = listofWords.splitAt(textSplitter)
val valueIwant = someTuple._2
someRecursiveFunction(someTuple._1)
}
List(someTuple._2,someTuple._2(1),someTuple._2(2)) // What I want back
}
Is there a way to extract the second tuple out of the recursive function so that I can use it further on in my program?
If the return type is fixed to be List[List[String]], the following changes to be made
to the code :
Because someType._2 is accessed as someType._2(2), there should be at least
3 strings in someType._2 list.
The last expression to be must of return type ie., List[List[String]]. Because someType._2(1)
and someType._2(2) are just strings and not List[String]:
List(someTuple._2,List(someTuple._2(1),someTuple._2(2))) will be of return type
List[List[String]]
The value of "Some Word" will be changing in the recursive process duly noting
that the someTuple._2.size is always >=3.
As we need to access someType._2 and it will be changing during each recursion,
it is declared as var within the recursive function.
With this understanding drawn from your requirement, the following code may be
what you are looking for:
def someRecursiveFunction(listOfWords:List[String],sw: String):List[List[String]] = {
val textSplitter = listOfWords.lastIndexOf(sw)
var i =0
if(i==0) { var someTuple:(List[String],List[String]) = (List(),List()) }
if (textSplitter != -1 && listOfWords.size-3>=textSplitter) {
someTuple = listOfWords.splitAt(textSplitter)
println(someTuple._1,someTuple._2) // for checking recursion
if( someTuple._1.size>=3){ i+=1
someRecursiveFunction(someTuple._1,someTuple._1(textSplitter-3))}
}
List(someTuple._2,List(someTuple._2(1),someTuple._2(2))) // What I want back
}
In Scala REPL:
val list = List("a","b","c","x","y","z","k","j","g","Some Word","d","e","f","u","m","p")
scala> val list = List("a","b","c","x","y","z","k","j","g","Some Word","d","e","f","u","m","p")
list: List[String] = List(a, b, c, x, y, z, k, j, g, Some Word, d, e, f, u, m, p)
scala> someRecursiveFunction(list,"d")
(List(a, b, c, x, y, z, k, j, g, Some Word),List(d, e, f, u, m, p))
(List(a, b, c, x, y, z, k),List(j, g, Some Word))
(List(a, b, c, x),List(y, z, k))
(List(a),List(b, c, x))
res70: List[List[String]] = List(List(b, c, x), List(c, x))
scala> someRecursiveFunction(list,"Some Word")
(List(a, b, c, x, y, z, k, j, g),List(Some Word, d, e, f, u, m, p))
(List(a, b, c, x, y, z),List(k, j, g))
(List(a, b, c),List(x, y, z))
(List(),List(a, b, c))
res71: List[List[String]] = List(List(a, b, c), List(b, c))

Scala loops with multiple conditions - what gets returned?

I'm going through scala for the impatient and came across an example of the multi condition loops that I can't seem to understand.
Coming from Java background I'm looking at these loops as nested for loops. But why does the first return a collection and second a String?
scala> for (i <- 0 to 1; c <- "Hello") yield (i + c).toChar
res11: scala.collection.immutable.IndexedSeq[Char] = Vector(H, e, l, l, o, I, f, m, m, p)
scala> for (c <- "Hello"; i <- 0 to 1) yield (i + c).toChar
res12: String = HIeflmlmop
for comprehensions are just syntax sugar and are translated into invocations of map, flatMap, withFilter (also foreach if you don't use yield).
for {
i <- 0 to 1
c <- "Hello"
} yield (i + c).toChar
is equivalent to
(0 to 1).flatMap(i => "Hello".map(c => (i + c).toChar))
These transformers are defined in a way that they return the same type of collection they were called on, or the closest one, for example here Range becomes a Vector in the end as you can't have Range that contains arbitrary characters. Starting from String you still can have String back.
In general you can think of it like this: result type created by for comprehension will be same as the type of the first generator (or closest possible).
For example if you convert string into a Set
for {
c <- "Hello".toSet[Char]
i <- 0 to 1
} yield (i + c).toChar
you will get a Set back, and because it is a set it will not contain duplicates so the result is different. Set(e, f, m, I, l, p, H, o)
The way how type is determined involves the CanBuildFrom trait. You can read more about how it works here
Use scala 2.11.8 repl for desugar (press tab after print, remove<pressed TAB here>):
scala> for (i <- 0 to 1; c <- "Hello") yield (i + c).toChar //print<pressed TAB here>
scala.Predef.intWrapper(0).to(1).flatMap[Char, scala.collection.immutable.IndexedSeq[Char]](((i: Int) =>
scala.Predef.augmentString(scala.Predef.augmentString("Hello").
map[Char, String](((c: Char) => i.+(c).toChar))(scala.Predef.StringCanBuildFrom))))(scala.collection.immutable.IndexedSeq.canBuildFrom[Char]) // : scala.collection.immutable.IndexedSeq[Char]
scala> for (i <- 0 to 1; c <- "Hello") yield (i + c).toChar //print
res4: scala.collection.immutable.IndexedSeq[Char] = Vector(H, e, l, l, o, I, f, m, m, p)
scala> for (c <- "Hello"; i <- 0 to 1) yield (i + c).toChar //print<pressed TAB here>
scala.Predef.augmentString("Hello").flatMap[Char, String](((c: Char) => scala.Predef.intWrapper(0).to(1).
map[Char, scala.collection.immutable.IndexedSeq[Char]](((i: Int) => i.+(c).toChar))(scala.collection.immutable.IndexedSeq.canBuildFrom[Char])))(scala.Predef.StringCanBuildFrom) // : String
scala> for (c <- "Hello"; i <- 0 to 1) yield (i + c).toChar //print
res5: String = HIeflmlmop
More readable output:
scala> (0 to 1).flatMap(i => "Hello".map(c => (i+c).toChar))
res14: scala.collection.immutable.IndexedSeq[Char] = Vector(H, e, l, l, o, I, f, m, m, p)
scala> "Hello".flatMap(c => (0 to 1).map(i => (i + c).toChar))
res15: String = HIeflmlmop

Scala - Combining two sequences to consecutively increasing triples

What is a nice and efficient functional way of solving the following problem? In imperative style, this can be done in linear time.
Given two sorted sequences p and q, f returns a sequence r (or any collection) of triples where for every triple (a,b,c) in r, the following hold:
(a < b < c)
One of the following two holds:
a,c are two consecutive elements p, and b is in q
a,c are two consecutive elements q, and b is in p
Example: Consider the following two sequences.
val p = Seq(1,4,5,7,8,9)
val q = Seq(2,3,6,7,8,10)
Then f(p,s) computes the following sequence:
Seq((1,2,4), (1,3,4), (5,6,7), (3,4,6), (3,5,6), (8,9,10))
Current solution: I do not find this one very elegant. I am looking for a better one.
def consecutiveTriplesOneWay(s1: Seq[Int], s2:Seq[Int]) = {
for {
i <- 0 until s1.size - 1 if s1(i) < s1(i+1)
j <- 0 until s2.size if s1(i) < s2(j) && s2(j) < s1(i+1)
} yield (s1(i), s2(j), s1(i+1))
}
def consecutiveTriples(s1: Seq[Int], s2:Seq[Int]) =
consecutiveTriplesOneWay(s1, s2) ++ consecutiveTriplesOneWay(s2, s1)
def main(args: Array[String]) {
val p = Seq(1,4,5,7,8,9)
val q = Seq(2,3,6,7,8,10)
consecutiveTriples(p, q).foreach(println(_))
}
Edit: My imperative solution
def consecutiveTriplesOneWayImperative(s1: Seq[Int], s2:Seq[Int]) = {
var i = 0
var j = 0
val triples = mutable.MutableList.empty[(Int,Int,Int)]
while (i < s1.size - 1 && j < s2.size) {
if (s1(i) < s2(j) && s2(j) < s1(i + 1)) {
triples += ((s1(i), s2(j), s1(i + 1)))
j += 1
} else if (s1(i) >= s2(j))
j += 1
else
i += 1
}
triples.toSeq
}
def consecutiveTriples(s1: Seq[Int], s2:Seq[Int]) =
consecutiveTriplesOneWayImperative(s1,s2) ++
consecutiveTriplesOneWayImperative(s2,s1)
Imperative solution translated to tailrec. Bit verbose but works
def consecutiveTriplesRec(s1: Seq[Int], s2: Seq[Int]) = {
#tailrec
def consTriplesOneWay(left: Seq[Int], right: Seq[Int],
triples: Seq[(Int, Int, Int)]): Seq[(Int, Int, Int)] = {
(left, right) match {
case (l1 :: l2 :: ls, r :: rs) =>
if (l1 < r && r < l2) consTriplesOneWay(left, rs, (l1, r, l2) +: triples)
else if (l1 >= r) consTriplesOneWay(left, rs, triples)
else consTriplesOneWay(l2 :: ls, right, triples)
case _ => triples
}
}
consTriplesOneWay(s1, s2, Nil) ++ consTriplesOneWay(s2, s1, Nil)
}

How to compute inverse of a multi-map

I have a Scala Map:
x: [b,c]
y: [b,d,e]
z: [d,f,g,h]
I want inverse of this map for look-up.
b: [x,y]
c: [x]
d: [x,z] and so on.
Is there a way to do it without using in-between mutable maps
If its not a multi-map - Then following works:
typeMap.flatMap { case (k, v) => v.map(vv => (vv, k))}
EDIT: fixed answer to include what Marth rightfully pointed out. My answer is a bit more lenghty than his as I try to go through each step and not use the magic provided by flatMaps for educational purposes, his is more straightforward :)
I'm unsure about your notation. I assume that what you have is something like:
val myMap = Map[T, Set[T]] (
x -> Set(b, c),
y -> Set(b, d, e),
z -> Set(d, f, g, h)
)
You can achieve the reverse lookup as follows:
val instances = for {
keyValue <- myMap.toList
value <- keyValue._2
}
yield (value, keyValue._1)
At this point, your instances variable is a List of the type:
(b, x), (c, x), (b, y) ...
If you now do:
val groupedLookups = instances.groupBy(_._1)
You get:
b -> ((b, x), (b, y)),
c -> ((c, x)),
d -> ((d, y), (d, z)) ...
Now we want to reduce the values so that they only contain the second part of each pair. Therefore we do:
val reverseLookup = groupedLookup.map(_._1 -> _._2.map(_._2))
Which means that for every pair we maintain the original key, but we map the list of arguments to something that only has the second value of the pair.
And there you have your result.
(You can also avoid assigning to an intermediate result, but I thought it was clearer like this)
Here is my simplification as a function:
def reverseMultimap[T1, T2](map: Map[T1, Seq[T2]]): Map[T2, Seq[T1]] =
map.toSeq
.flatMap { case (k, vs) => vs.map((_, k)) }
.groupBy(_._1)
.mapValues(_.map(_._2))
The above was derived from #Diego Martinoia's answer, corrected and reproduced below in function form:
def reverseMultimap[T1, T2](myMap: Map[T1, Seq[T2]]): Map[T2, Seq[T1]] = {
val instances = for {
keyValue <- myMap.toList
value <- keyValue._2
} yield (value, keyValue._1)
val groupedLookups = instances.groupBy(_._1)
val reverseLookup = groupedLookups.map(kv => kv._1 -> kv._2.map(_._2))
reverseLookup
}