Why mapped pairs get obliterated? - scala

I'm trying to understand the example here which computes Jaccard similarity between pairs of vectors in a matrix.
val aBinary = adjacencyMatrix.binarizeAs[Double]
// intersectMat holds the size of the intersection of row(a)_i n row (b)_j
val intersectMat = aBinary * aBinary.transpose
val aSumVct = aBinary.sumColVectors
val bSumVct = aBinary.sumRowVectors
//Using zip to repeat the row and column vectors values on the right hand
//for all non-zeroes on the left hand matrix
val xMat = intersectMat.zip(aSumVct).mapValues( pair => pair._2 )
val yMat = intersectMat.zip(bSumVct).mapValues( pair => pair._2 )
Why does the last comment mention non-zero values? As far as I'm aware, the ._2 function selects the second element of a pair independent of the first element. At what point are (0, x) pairs obliterated?

Yeah, I don't know anything about scalding but this seems odd. If you look at zip implementation it mentions specifically that it does an outer join to preserve zeros on either side. So it does not seem that the comment applies to how zeroes are actually treated in matrix.zip.
Besides looking at the dimension returned by zip, it really seems this line just replicates the aSumVct column vector for each column:
val xMat = intersectMat.zip(aSumVct).mapValues( pair => pair._2 )
Also I find the val bSumVct = aBinary.sumRowVectors suspicious, because it sums the matrix along the wrong dimension. It feels like something like this would be better:
val bSumVct = aBinary.tranpose.sumRowVectors
Which would conceptually be the same as aSumVct.transpose, so that at the end of the day, in the cell (i, j) of xMat + yMat we find the sum of elements of row(i) plus the sum of elements of row(j), then we subtract intersectMat to adjust for the double counting.
Edit: a little bit of googling unearthed this blog post: http://www.flavianv.me/post-15.htm. It seems the comments were related to that version where the vectors to compare are in two separate matrices that don't necessarily have the same size.

Related

how to accumulate 2D array elements in Scala?

I have 2D array as:
Array(Array(1,1,0), Array(1,0,1))
and I would like to accumulate values over column so my final output look like
Array(Array(1,1,0), Array(2,1,1))
If this is 1D array, I can simply use 'scan' but I'm having trouble with using scan in 2D array.
can anyone help on this issue?
Here's one way to do it:
val t = Array(Array(1,1,0), Array(1,0,1))
val result = t.scanLeft(Array.fill(t(0).length)(0)) ((x,y) =>
x.zip(y).map(e => e._1 + e._2)).drop(1)
//to see the results
result.foreach(e => println(e.toList))
gives:
List(1, 1, 0)
List(2, 1, 1)
The idea is to create an array filled with zeros (using Array.fill) and then scan the 2D array using that as an accumulator. In the end, drop(1) gets rid of the zero-filled array.
EDIT:
In response to the comment, this solution works for a matrix of any size. The zip function takes care of element-wise addition.
EDIT 2 (Step by step explanation):
You already know about scan or a one-dimensional array. The idea is essentially the same.
We initialize the accumulator with zero. In this case, zero means an array of zeros. Array.fill is used to create an array filled with zeros.
Instead of a single addition, we need to add arrays element-wise. This is what the combination of zip and map do. There are a lot of examples available on the Internet about how these methods work.
Finally, we drop the zero element using Scala's drop(1). The result is an array of arrays containing accumulated values.
I would solve it as for given row r, sum all previous rows.
val accumatedMatrix =
for(row <- array.indices)
yield array.take(row + 1).foldLeft(Array(0, 0, 0)) {
case (a, b) => Array(a(0) + b(0), a(1) + b(1), a(2) + b(2))
}
input:
val array = Array(
Array(1,1,0),
Array(1,0,1)
)
output:
1,1,0
2,1,1
Instead of summing all the previous rows repeatedly you can improve it to memoize as well.
Pretty much the same approach as a 1D array:
a.tail.scan(a.head)((acc, value) =>
Array(acc(0) + value(0), acc(1) + value(1), acc(2) + value(2))
)
For arbitrary number (as long as they all have the same number):
a.tail.scan(a.head)((acc, value) =>
acc zip value map {case (a,b) => a+b}
)

how do I chain the custom logic using scala?

I am trying to achieve the calculation of area of polygon using scala
Reference for Area of Polygon https://www.mathopenref.com/coordpolygonarea.html
I was able to write the code successfully but my requirement is I want the last for loop to be chained within points variable like other map logics. Map is not working because output of list is (N-1)
var lines=io.Source.stdin.getLines()
val nPoints= lines.next.toInt
var s=0
var points = lines.take(nPoints).toList map(_.split(" ")) map{case Array(e1,e2)=>(e1.toInt,e2.toInt)}
for(i <- 0 until points.length-1){ // want this for loop to be chained within points variable
s=s+(points(i)._1 * points(i+1)._2) - (points(i)._2 * points(i+1)._1)
}
println(scala.math.abs(s/2.0))
You're looking for a sliding and a foldLeft, I think. sliding(2) will give you lists List(i, i+1), List(i+1, i+2),... and then you can use foldLeft to do the computation:
val s = points.sliding(2).foldLeft(0){
case (acc, (p1, p2)::(q1, q2)::Nil) => acc + p1*q2 - p2*q1 }
If I read the formula correctly, there's one additional term coming from the first and last vertices (which looks like it's missing from your implementation, by the way). You could either add that separately or repeat the first vertex at the end of your list. As a result, combining into the same line that defines points doesn't work so easily. In any case, I think it is more readable split into separate statements--one defining points, another repeating the first element of points at the end of the list and then the fold.
Edited to fix the fact that you have tuples for your points, not lists. I'm also tacitly assuming points is a list, I believe.
how about something like (untested)
val points2 = points.tail :+ points.head // shift list one to the left
val area = ((points zip points2)
.map{case (p1, p2) => (p1._1 * p2._2) - (p2._1 * p1._2)}
.fold(0)(_+_))/2

Spark approach for equation involving pairwise distance

I have a piece of pseudocode that I've been trying to implement in Spark (currently using Scala, but happy to use another language if needed) for about a week and am utterly stuck. The section of pseudocode that's causing me issues is (apologies for the image, but w/o MathOverflow's LaTeX option, it seemed the clearest): pseudocode
Each row contains an id1, id2, ob, x, and y.
I'm using Window to partition by (id1, id2) with each window having multiple (x: Integer, y: Integer, and ob: Double) which constitute a cell or c.
The loop is for k = 1 ... m with m being the number of rows in the window.
Order of the rows does not matter for my purposes (the st value will be affected, but past work suggests it doesn't have an observable difference in the final result).
All previously calculated rows with st > 0 are part of K. Since st >= 0, it seems safe to include all previously calculated rows.
alpha is a fixed parameter.
Dis_grid is currently a euclidian distance UDF between the x and y coordinates by row, but it can be a different distance measure if that would make implementation easier.
I'm not able to figure out how to:
Assign the first row in a window a distinct value.
Use the st of previously calculated rows to calculate the st for the next row.
Calculate individual pairwise distances between rows to be part of the st formula.
Any assistance is greatly appreciated.
I am having a hard time following precisely what you're trying to do (I think order does matter within a group/window), so this may not be right, but these are my thoughts. First off, I think this will be a lot easier to do thinking in terms of Dataset operations rather than DataFrame operations. (If you want to do this in the DataFrame world, I think you want to look at User Defined Aggregation Functions (UDAFs).) Since a DataFrame is nothing but Dataset[Row], it's all equivalent, but I'm going to define a Cell case class and work with a Dataset[Cell] to simplify things, so please indulge me.
case class Cell(id1: Int, id2: Int, ob: Double, x: Int, y: Int)
Imagine we have a Dataset[Cell], call it ds. We want to group it by pairs (id1,id2) and then apply some function (call if f) on each group and end up with a Dataset[(Cell, Double)]. (I think this is what you want. I'm having trouble seeing how order of rows/cells within a group does not matter in this case.) This amounts to
ds.groupByKey(c => (c.id1, c.id2).flatMapGroups(f)
So what's f? It needs to compute st for each cell in the group. If I'm understanding properly, the summation works like this: for cell i, sum st(j) * alpha^dist(i, j) for all j < i. Subtract that sum from the cell's observation and max with 0 to get st. Assuming you have a function dist to compute the distance between two cells, as a function all the cells, the previously computed sts and the cell c, the summation term works out to be:
def stSum(cells: List[Cell], sts: List[Double], c: Cell): Double =
sts.zipWithIndex.map{ case (st, l) =>
st * math.pow(alpha, dist(c, cells(l)))
}.sum
Then for a cell c, st is math.max(0, c.ob - stSum(cells, sts, c)), where cells is all the cells in the group, and sts is a list of the st values for the "earlier" cells. We can then compute the (ordered) list of sts with a foldLeft:
cells.foldLeft(List()){ case (sts, c) =>
sts :+ math.max(0, c.ob - stSum(cells, sts, c))
}
We can then assemble f:
def f(key: (Int, Int), cellIterable: Iterable[Cell]): List[(Cell, Double)] = {
val cells = cellIterable.toList
val sts = cells.foldLeft(List()){ case (stAcc, c) =>
stAcc :+ math.max(0, c.ob - stSum(cells, stAcc, c))
}
cells.zip(sts)
}
We need to convert cellIterable to a list because we need to access its elements by index, so we need a type that supports that. It's generally faster in Scala to prepend to a list rather than append, and it would be possible to build up the list of sts in reverse, but it'd require some index juggling. One final note is that the f in flatMapGroups takes tuples and the f I wrote just above doesn't, so you'd end up with
ds.groupByKey(c => (c.id1, c.id2).flatMapGroups(Function.tupled(f))

identify sequence of numbers in Scala

I was wondering what would be the logic to identify from a list of numbers a pattern of consecutive numbers and max of that in Scala. For example, if
val x = List(1,2,3,8,15,26)
Then the output of the function should be
val y = List(3,8,15,26) where 3 is the max of 1,2,3 which is a sequence of consecutive numbers. 8,15,26 are not consecutive and hence those numbers are unaltered.
The position does not matter, meaning, I can sort the list and then identify the sequences.
How to approach this problem?
After x is sorted you could do something like this.
(x :+ x.last).sliding(2).filter(p => p(0) != p(1)-1).map(_(0)).toList

How to calculate median over RDD[org.apache.spark.mllib.linalg.Vector] in Spark efficiently?

What I want to do like this:
http://cn.mathworks.com/help/matlab/ref/median.html?requestedDomain=www.mathworks.com
Find the median value of each column.
It can be done by collecting the RDD to driver, for a big data which will become impossible.
I know Statistics.colStats() can calculate mean, variance... but median is not included.
Additionally, the vector is high-dimensional and sparse.
Well I didn't understand the vector part, however this is my approach (I bet there are better ones):
val a = sc.parallelize(Seq(1, 2, -1, 12, 3, 0, 3))
val n = a.count() / 2
println(n) // outputs 3
val b = a.sortBy(x => x).zipWithIndex()
val median = b.filter(x => x._2 == n).collect()(0)._1 // this part doesn't look nice, I hope someone tells me how to improve it, maybe zero?
println(median) // outputs 2
b.collect().foreach(println) // (-1,0) (0,1) (1,2) (2,3) (3,4) (3,5) (12,6)
The trick is to sort your dataset using sortBy, then zip the entries with their index using zipWithIndex and then get the middle entry, note that I set an odd number of samples, for simplicity but the essence is there, besides you have to do this with every column of your dataset.