transforming from native matrix format, scalding - scala

So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?

Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))

Related

How to make stateful API's pure

I'm learning functional programming partially by reading the book Functional Programming in Scala a.k.a. The Red Book and I've run into my first real blocker. In Chapter 6, The book uses the example of a random number generator to illustrate how to change state by using side effects. Then the book goes on to demonstrate the patterns you would typically encounter as well as some of the tangents you might take in making a functional stateful api. My problem comes when trying to understand the following code:
type Rand[+A] = RNG => (A, RNG)
def map[A, B](s: Rand[A])(f: A => B): Rand[B] =
rng => {
val (nxt, nxtrng) = s(rng)
(f(nxt), nxtrng)
}
def nonNegativeLessThan(n: Int): Rand[Int] =
map(nonNegativeIntv2) { i =>
val mod = i % n
if (i + (n - 1) - mod >= 0) mod else nonNegativeLessThan(n)(???)
}
I hope this is enough context to get an idea what the code does. This is coming directly from the book on page 86. How does the if statement for method nonNegativeLessThan filters out values of i that are greater than the largest multiple of n in a 32 bit integer? Doesn't the recursive call in the else clause return a value of type Rand[Int]? In general I'm confused about what's going on in the code that is bolded. This is a pretty good book so I'm happy with how things are going so far and I feel I've learned a lot about functional programming. My apologies if this question is ill formed and if there is an issue with the formatting. This is my first post on stack overflow! Thank you to those who take a look at this and I hope it helps someone who has the same problem.
How does the if statement for method nonNegativeLessThan filters out values of i that are greater than the largest multiple of n in a 32 bit integer?
If i is greater than the largest multiple of n, then i + (n - 1) - mod will overflow and yield a negative number. The subsequent >= 0 is then false.
Doesn't the recursive call in the else clause return a value of type Rand[Int]?
Well, nonNegativeLessThan(n) is indeed of type Rand[Int]. However it says nonNegativeLessThan(n)(???), that is, it applies nonNegativeLessThan to n, and then it applies the resulting value (of type Rand[Int], which is a function type) to ???, and that yields an Int. Or rather it would do that if ??? ever yielded real value, which it doesn't.
The problem here is that you would have to pass the state of the RNG instead of ???, but the map function doesn't let you access that state. You'll need flatMap to solve this – presumably that's what the book is going to discuss next.
Imagine that you have a new domain of a set of integers, for example 0 to 11. The Rand type represents an aleatory number in that domain, but this type can not be skewed. That means that there can not be numbers that have a greater probability of being generated that others.
If you want to generate a number that is positive and less than four, you can use the map function of the Rand type, that allows to transform the result of a state action. First, it generates a nonnegative number and transform the result, via the map, applying the anonymous function: _ % n to obtain a number less than n.
def nonNegativeLessThan(n: Int): Rand[Int] =
map(nonNegativeInt) { _ % n }
It uses the modulo operation:
scala> 1 % 4
res0: Int = 1
scala> 2 % 4
res1: Int = 2
scala> 3 % 4
res2: Int = 3
scala> 4 % 4
res3: Int = 0
scala> 5 % 4
res4: Int = 1
scala> 6 % 4
res5: Int = 2
scala> 7 % 4
res6: Int = 3
scala> 8 % 4
res7: Int = 0
scala> 9 % 4
res8: Int = 1
scala> 10 % 4
res9: Int = 2
As you can see, the largest multiple of 4 in this domain is 8, so if the non-negative number that is generated is 9 or 10, we have a problem. The probability of having a 1 or 2 is greater than having a 3 or a 0. And for that reason, the other implementation detects that the number which is first generated as a result of the nonnegativeInt is not major than the largest multiple of n in a specific domain, the Int32 numbers domain in the book, to have a non- biased generator.
And yes, that book is amazing.

scala mixing view and strict collection in for expression

This piece of scala code mixes view with strict List in a for expression:
val list = List.range(1, 4)
def compute(n: Int) = {
println("Computing "+n)
n * 2
}
val view = for (n <- list.view; k<-List(1,2)) yield compute(n)
val x = view(0)
The output is:
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
Computing 1
Computing 1
I expected that it should just have the last 2 lines "Computing 1" in the output. Why would it computed all the values eagerly? And why it then recomputed the values again?
Arguably, access by index forces the view to be computed. Also, notice that you flatmap the list with something which is not lazy (k is not a view).
Compare the following:
// 0) Your example
val v0 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v0(0) // Computing 1
// Computing 1
// Computing 2
// Computing 2
// Computing 3
// Computing 3
// Computing 1
// Computing 1
v0(0) // Computing 1
// Computing 1
// 1) Your example, but access by head and not by index
val v1 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v1.head // Computing 1
// Computing 1
// 2) Do not mix views and strict lists
val v2 = List.range(1, 4).view.flatMap(n => List(1,2).view.map(k => compute(n)))
v2(0) // Computing 1
Regarding example 0, notice that views are not like streams; while streams do cache their results, lazy views do not (they just compute lazily, i.e., by-need, on access). It seems that indexed-access requires computing the entire list, and then another computation is needed to actually access the element by index.
You may ask why indexed access in example 2 does not compute the entire list. This requires an understanding of how things work underneath; in particular, we may see the difference of the method calls from example 0 wrt example 2 in the following excerpts:
Example 0
java.lang.Exception scala.collection.SeqViewLike$FlatMapped.$anonfun$index$1(SeqViewLike.scala:75)
at scala.collection.SeqViewLike$FlatMapped.index(SeqViewLike.scala:74)
at scala.collection.SeqViewLike$FlatMapped.index$(SeqViewLike.scala:71)
at scala.collection.SeqViewLike$$anon$5.index$lzycompute(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$$anon$5.index(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.length(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$FlatMapped.length$(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$$anon$5.length(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:86)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
Example 2
java.lang.Exception scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:12)
at scala.collection.SeqViewLike$Mapped.apply(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$Mapped.apply$(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$$anon$4.apply(SeqViewLike.scala:196)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:88)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
In particular, you see that example 0 results in a call of Flatmapped.length (which needs to evaluate the entire list).
view here is a SeqView[Int,Seq[_]], which is immutable and recomputes every item when iterated over.
You could access just the first by explicitly using the .iterator:
# view.iterator.next
Computing 1
Computing 1
res11: Int = 2
Or explicitly make it a List (eg. if you need to reuse many entries):
# val view2: List[Int] = view.toList
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
view2: List[Int] = List(2, 2, 4, 4, 6, 6)
# view2(0)
res13: Int = 2

Unexpected behavior inside the foreachPartition method of a RDD

I evaluated through the spark-shell the following lines of scala codes:
val a = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
val b = a.coalesce(1)
b.foreachPartition { p =>
p.map(_ + 1).foreach(println)
p.map(_ * 2).foreach(println)
}
The output is the following:
2
3
4
5
6
7
8
9
10
11
Why the partition p becomes empty after the first map?
It does not look strange to me since p is Iterator, when you walk through it with map, it has no more values, and taking into account that length is shortcut for size which is implemented like this:
def size: Int = {
var result = 0
for (x <- self) result += 1
result
}
you get 0.
The answer is in the scala doc http://www.scala-lang.org/api/2.11.8/#scala.collection.Iterator. It explicitely states that an iterator (p is an iterator) must be discarded after calling on it the map method.

Scala comprehension from input

I am new to Scala and I am having troubles constructing a Map from inputs.
Here is my problem :
I am getting an input for elevators information. It consists of n lines, each one has the elevatorFloor number and the elevatorPosition on the floor.
Example:
0 5
1 3
4 5
So here I have 3 elevators, first one is on floor 0 at position 5, second one at floor 1 position 3 etc..
Is there a way in Scala to put it in a Map without using var ?
What I get so far is a Vector of all the elevators' information :
val elevators = {
for{i <- 0 until n
j <- readLine split " "
} yield j.toInt
}
I would like to be able split the lines in two variables "elevatorFloor" and "elevatorPos" and group them in a data structure (my guess is Map would be the appropriate choice) I would like to get something looking like:
elevators: SomeDataStructure[Int,Int] = ( 0->5, 1 -> 3, 4 -> 5)
I would like to clarify that I know I could write Javaish code, initialise a Map and then add the values to it, but I am trying to keep as close to functionnal programming as possible.
Thanks for the help or comments
You can do:
val res: Map[Int, Int] =
Source.fromFile("myfile.txt")
.getLines
.map { line =>
Array(floor, position) = line.split(' ')
(floor.toInt -> position.toInt)
}.toMap

Using Scala Breeze to do numPy style broadcasting

Is there a generic way using Breeze to achieve what you can do using broadcasting in NumPy?
Specifically, if I have an operator I'd like to apply to two 3x4 matrices, I can apply that operation element-wise. However, what I have is a 3x4 matrix and a 3-element column vector. I'd like a function which produces a 3x4 matrix created from applying the operator to each element of the matrix with the element from the vector for the corresponding row.
So for a division:
2 4 6 / 2 3 = 1 2 3
3 6 9 1 2 3
If this isn't available. I'd be willing to look at implementing it.
You can use mapPairs to achieve what I 'think' you're looking for:
val adder = DenseVector(1, 2, 3, 4)
val result = DenseMatrix.zeros[Int](3, 4).mapPairs({
case ((row, col), value) => {
value + adder(col)
}
})
println(result)
1 2 3 4
1 2 3 4
1 2 3 4
I'm sure you can adapt what you want from simple 'adder' above.
Breeze now supports broadcasting of this sort:
scala> val dm = DenseMatrix( (2, 4, 6), (3, 6, 9) )
dm: breeze.linalg.DenseMatrix[Int] =
2 4 6
3 6 9
scala> val dv = DenseVector(2,3)
dv: breeze.linalg.DenseVector[Int] = DenseVector(2, 3)
scala> dm(::, *) :/ dv
res4: breeze.linalg.DenseMatrix[Int] =
1 2 3
1 2 3
The * operator says which axis to broadcast along. Breeze doesn't allow implicit broadcasting, except for scalar types.