Spark approach for equation involving pairwise distance - scala

I have a piece of pseudocode that I've been trying to implement in Spark (currently using Scala, but happy to use another language if needed) for about a week and am utterly stuck. The section of pseudocode that's causing me issues is (apologies for the image, but w/o MathOverflow's LaTeX option, it seemed the clearest): pseudocode
Each row contains an id1, id2, ob, x, and y.
I'm using Window to partition by (id1, id2) with each window having multiple (x: Integer, y: Integer, and ob: Double) which constitute a cell or c.
The loop is for k = 1 ... m with m being the number of rows in the window.
Order of the rows does not matter for my purposes (the st value will be affected, but past work suggests it doesn't have an observable difference in the final result).
All previously calculated rows with st > 0 are part of K. Since st >= 0, it seems safe to include all previously calculated rows.
alpha is a fixed parameter.
Dis_grid is currently a euclidian distance UDF between the x and y coordinates by row, but it can be a different distance measure if that would make implementation easier.
I'm not able to figure out how to:
Assign the first row in a window a distinct value.
Use the st of previously calculated rows to calculate the st for the next row.
Calculate individual pairwise distances between rows to be part of the st formula.
Any assistance is greatly appreciated.

I am having a hard time following precisely what you're trying to do (I think order does matter within a group/window), so this may not be right, but these are my thoughts. First off, I think this will be a lot easier to do thinking in terms of Dataset operations rather than DataFrame operations. (If you want to do this in the DataFrame world, I think you want to look at User Defined Aggregation Functions (UDAFs).) Since a DataFrame is nothing but Dataset[Row], it's all equivalent, but I'm going to define a Cell case class and work with a Dataset[Cell] to simplify things, so please indulge me.
case class Cell(id1: Int, id2: Int, ob: Double, x: Int, y: Int)
Imagine we have a Dataset[Cell], call it ds. We want to group it by pairs (id1,id2) and then apply some function (call if f) on each group and end up with a Dataset[(Cell, Double)]. (I think this is what you want. I'm having trouble seeing how order of rows/cells within a group does not matter in this case.) This amounts to
ds.groupByKey(c => (c.id1, c.id2).flatMapGroups(f)
So what's f? It needs to compute st for each cell in the group. If I'm understanding properly, the summation works like this: for cell i, sum st(j) * alpha^dist(i, j) for all j < i. Subtract that sum from the cell's observation and max with 0 to get st. Assuming you have a function dist to compute the distance between two cells, as a function all the cells, the previously computed sts and the cell c, the summation term works out to be:
def stSum(cells: List[Cell], sts: List[Double], c: Cell): Double =
sts.zipWithIndex.map{ case (st, l) =>
st * math.pow(alpha, dist(c, cells(l)))
}.sum
Then for a cell c, st is math.max(0, c.ob - stSum(cells, sts, c)), where cells is all the cells in the group, and sts is a list of the st values for the "earlier" cells. We can then compute the (ordered) list of sts with a foldLeft:
cells.foldLeft(List()){ case (sts, c) =>
sts :+ math.max(0, c.ob - stSum(cells, sts, c))
}
We can then assemble f:
def f(key: (Int, Int), cellIterable: Iterable[Cell]): List[(Cell, Double)] = {
val cells = cellIterable.toList
val sts = cells.foldLeft(List()){ case (stAcc, c) =>
stAcc :+ math.max(0, c.ob - stSum(cells, stAcc, c))
}
cells.zip(sts)
}
We need to convert cellIterable to a list because we need to access its elements by index, so we need a type that supports that. It's generally faster in Scala to prepend to a list rather than append, and it would be possible to build up the list of sts in reverse, but it'd require some index juggling. One final note is that the f in flatMapGroups takes tuples and the f I wrote just above doesn't, so you'd end up with
ds.groupByKey(c => (c.id1, c.id2).flatMapGroups(Function.tupled(f))

Related

Removing duplicates (ints) in an array and replacing them with chars

So i'm trying to make a basic hitori solver, but i am not sure where i should start. I'm still new to Scala.
My first issue is that i'm trying to have an array of some ints (1,2,3,4,2)
and making the program output them like this: (1,2,3,4,B)
notice that the duplicate has become a char B.
Where do i start? Here is what i already did, but didn't do what i excatly need.
val s = lines.split(" ").toSet;
var jetSet = s
for(i<-jetSet){
print(i);
}
One way is to fold over the numbers, left to right, building the Set[Int], for the uniqueness test, and the list of output, as you go along.
val arr = Array(1,2,3,4,2)
arr.foldLeft((Set[Int](),List[String]())){case ((s,l),n) =>
if (s(n)) (s,"B" :: l)
else (s + n, n.toString :: l)
}._2.reverse // res0: List[String] = List(1, 2, 3, 4, B)
From here you can use mkString() to format the output as desired.
What I'd suggest is to break your program into a number of steps and try to solve those.
As a first step you could transform the list into tuples of the numbers and the number of times they have appeared so far ...
(1,2,3,4,2) becomes ((1,1),(2,1),(3,1),(4,1),(2,2)
Next step it's easy to map over this list returning the number if the count is 1 or the letter if it is greater.
That first step is a little bit tricky because as you walk through the list you need to keep track of how many you've seen so far of each letter.
When want to process a sequence and maintain some changing state as you do, you should use a fold. If you're not familiar with fold it has the following signature:
def foldLeft[B](z: B)(op: (B, A) => B): B
Note that the type of z (the initial value) has to match the type of the return value from the fold (B).
So one way to do this would be for type B to be a tuple of (outputList, seensofarCounts)
outputList would accumulate in each step by taking the next number and updating the map of how many of each numbers you've seen so far. "seensofarCounts" would be a map of the numbers and the current count.
So what you get out of the foldLeft is a tuple of (((1,1),(2,1),(3,1),(4,1),(2,2), Map(1 -> 1, 2, 2 ETC ... ))
Now you can map over that first element of the tuple as described above.
Once it's working you could avoid the last step by updating the numbers to letters as you work through the fold.
Usually this technique of breaking things into steps makes it simple to reason about, then when it's working you may see that some steps trivially collapse into each other.
Hope this helps.

Scala: Sort a ListBuffer of (Int, Int, Int) by the third value most efficiently

I am looking to sort a ListBuffer(Int, Int, Int) by the third value most efficiently. This is what I currently have. I am using sortBy. Note y and z are both ListBuffer[(Int, Int, Int)], which I am taking the difference first. My goal is to optimize this and do this operation (taking the difference between the two lists and sorting by the third element) most efficiently. I am assuming the diff part cannot be optimized but the sortBy can, so I am looking for a efficient way to do to the sorting part. I found posts on sorting Arrays but I am working with ListBuffer and converting to an Array adds overhead, so I rather not to convert my ListBuffer.
val x = (y diff z).sortBy(i => i._3)
1) If you want to use Scala libraries then you can't do much better than that. Scala already tries to sort your collection in the most efficient way possible.
SeqLike defines def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f) which calls this implementation:
def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
val len = this.length
val arr = new ArraySeq[A](len)
var i = 0
for (x <- this.seq) {
arr(i) = x
i += 1
}
java.util.Arrays.sort(arr.array, ord.asInstanceOf[Ordering[Object]])
val b = newBuilder
b.sizeHint(len)
for (x <- arr) b += x
b.result
}
This is what your code will be calling. As you can see it already uses arrays to sort data in place. According to the javadoc of public static void sort(Object[] a):
Implementation note: This implementation is a stable, adaptive,
iterative mergesort that requires far fewer than n lg(n) comparisons
when the input array is partially sorted, while offering the
performance of a traditional mergesort when the input array is
randomly ordered. If the input array is nearly sorted, the
implementation requires approximately n comparisons.
2) If you try to optimize by inserting results of your diff directly into a sorted structure like a binary tree as you produce them element by element, you'll still be paying the same price: average cost of insertion is log(n) times n elements = n log(n) - same as any fast sorting algorithm like merge sort.
3) Thus you can't optimize this case generically unless you optimize to your particular use-case.
3a) For instance, ListBuffer might be replaced with a Set and diff should be much faster. In fact it's implemented as:
def diff(that: GenSet[A]): This = this -- that
which uses - which in turn should be faster than diff on Seq which has to build a map first:
def diff[B >: A](that: GenSeq[B]): Repr = {
val occ = occCounts(that.seq)
val b = newBuilder
for (x <- this)
if (occ(x) == 0) b += x
else occ(x) -= 1
b.result
}
3b) You can also avoid sorting by using _3 as an index in an array. If you insert using that index your array will be sorted. This will only work if your data is dense enough or you are happy to deal with sparse array afterwards. One index might also have multiple values mapping to it, you'll have to deal with it as well. Effectively you are building a sorted map. You can use a Map for that as well, but a HashMap won't be sorted and a TreeMap will require log(n) for add operation again.
Consult Scala Collections Performance Characteristics to understand what you can gain based on your case.
4) Anyhow, sort is really fast on modern computers. Do some benchmarking to make sure you are not prematurely optimizing it.
To summarize complexity for different scenarios...
Your current case:
diff for SeqLike: n to create a map from that + n to iterate over this (map lookup is effectively constant time (C)) = 2n or O(n)
sort - O(n log(n))
total = O(n) + O(n log(n)) = O(n log(n)), more precisely: 2n + nlog(n)
If you use Set instead of SeqLike:
diff for Set: n to iterate (lookup is C) = O(n)
sort - same
total - same: O(n) + O(n log(n)) = O(n log(n)), more precisely: n + nlog(n)
If you use Set and array to insert:
diff - same as for Set
sort - 0 - array is sorted by construction
total: O(n) + O(0) = O(n), more precisely: n. Might not be very practical for sparse data.
Looks like in the grand scheme of things it does not matter that much unless you have a unique case that benefits from last option (array).
If you would have a ListBuffer[Int] rather than ListBuffer[(Int, Int, Int)] I would suggest to sort both collections first and then do a diff by doing a single pass through both of them at the same time. This would be O(nlog(n)). In your case a sort by _3 is not sufficient to guarantee exact order in both collections. You can sort by all three fields of a tuple but that will change the original ordering. If you are fine with that and writing your own diff then it might be the fastest option.

Tail recursion with List + .toVector or Vector?

val dimensionality = 10
val zeros = DenseVector.zeros[Double](dimensionality)
#tailrec private def specials(list: List[DenseVector[Int]], i: Int): List[DenseVector[Int]] = {
if(i >= dimensionality) list
else {
val vec = zeros.copy
vec(i to i) := 1
specials(vec :: list, i + 1)
}
}
val specialList = specials(Nil, 0).toVector
specialList.map(...doing my thing...)
Should I write my tail recursive function using a List as accumulator above and then write
specials(Nil, 0).toVector
or should I write my trail recursion with a Vector in the first place? What is computationally more efficient?
By the way: specialList is a list that contains DenseVectors where every entry is 0 with the exception of one entry, which is 1. There are as many DenseVectors as they are long.
I'm not sur what you're trying to do here but you could rewrite your code like so:
type Mat = List[Vector[Int]]
#tailrec
private def specials(mat: Mat, i: Int): Mat = i match {
case `dimensionality` => mat
case _ =>
val v = zeros.copy.updated(i,1)
specials(v :: mat, i + 1)
}
As you are dealing with a matrix, Vector is probably a better choice.
Let's compare the performance characteristics of both variants:
List: prepending takes constant time, conversion to Vector takes linear time.
Vector: prepending takes "effectively" constant time (eC), no subsequent conversion needed.
If you compare the implementations of List and Vector, then you'll find out that prepending to a List is a simpler and cheaper operation than prepending to a Vector. Instead of just adding another element at the front as it is done by List, Vector potentially has to replace a whole branch/subtree internally. On average, this still happens in constant time ("effectively" constant, because the subtrees can differ in their size), but is more expensive than prepending to List. On the plus side, you can avoid the call to toVector.
Eventually, the crucial point of interest is the size of the collection you want to create (or in other words, the amount of recursive prepend-steps you are doing). It's totally possible that there is no clear winner and one of the two variants is faster for <= n steps, whereas the other variant is faster for > n steps. In my naive toy benchmark, List/toVecor seemed to be faster for less than 8k elements, but you should perform a set of well-chosen benchmarks that represent your scenario adequately.

Why mapped pairs get obliterated?

I'm trying to understand the example here which computes Jaccard similarity between pairs of vectors in a matrix.
val aBinary = adjacencyMatrix.binarizeAs[Double]
// intersectMat holds the size of the intersection of row(a)_i n row (b)_j
val intersectMat = aBinary * aBinary.transpose
val aSumVct = aBinary.sumColVectors
val bSumVct = aBinary.sumRowVectors
//Using zip to repeat the row and column vectors values on the right hand
//for all non-zeroes on the left hand matrix
val xMat = intersectMat.zip(aSumVct).mapValues( pair => pair._2 )
val yMat = intersectMat.zip(bSumVct).mapValues( pair => pair._2 )
Why does the last comment mention non-zero values? As far as I'm aware, the ._2 function selects the second element of a pair independent of the first element. At what point are (0, x) pairs obliterated?
Yeah, I don't know anything about scalding but this seems odd. If you look at zip implementation it mentions specifically that it does an outer join to preserve zeros on either side. So it does not seem that the comment applies to how zeroes are actually treated in matrix.zip.
Besides looking at the dimension returned by zip, it really seems this line just replicates the aSumVct column vector for each column:
val xMat = intersectMat.zip(aSumVct).mapValues( pair => pair._2 )
Also I find the val bSumVct = aBinary.sumRowVectors suspicious, because it sums the matrix along the wrong dimension. It feels like something like this would be better:
val bSumVct = aBinary.tranpose.sumRowVectors
Which would conceptually be the same as aSumVct.transpose, so that at the end of the day, in the cell (i, j) of xMat + yMat we find the sum of elements of row(i) plus the sum of elements of row(j), then we subtract intersectMat to adjust for the double counting.
Edit: a little bit of googling unearthed this blog post: http://www.flavianv.me/post-15.htm. It seems the comments were related to that version where the vectors to compare are in two separate matrices that don't necessarily have the same size.

Having hard time to understand why this function is called MapReduce

In my Scala course an example has given. It was about finding a more generalized function, which can be used to define an arithmetic summation function and an arithmetic production function. Here are the functions that should be generalized.
def sum(f:Int=>Int)(a:Int,b:Int):Int ={
if(a>b) 0
else f(a) + sum(f)(a+1,b)
}
def product(f:Int=>Int)(a:Int,b:Int):Int={
if(a>b)1
else f(a)*product(f)(a+1,b)
}
To generalize these functions the teacher gave such a function :
def mapReduce(f:Int=>Int,combine: (Int,Int)=>Int, zero:Int)(a:Int,b:Int):Int ={
if(a>b) zero
else combine(f(a),mapReduce(f, combine, zero)(a+1, b))
}
So mapReduce function can be used to generalize sum and product functions as follows:
def sumGN(f:Int=>Int)(a:Int,b:Int) = mapReduce(f, (x,y)=>(x+y), 0)(a, b)
def productGN(f:Int=>Int)(a:Int,b:Int) = mapReduce(f, (x,y)=>(x*y), 1)(a, b)
I took a look at the definition of map reduce in functional programming but I have a hard time why the generalized function has been named as map reduce above. I can not grasp the relation. Any help will make my very happy.
Regards
Functional programming usually has three central operators: map, reduce (sometimes called fold), and filter.
Map takes a list and an operation and produces a list containing the operation applied to everything in the first list.
Filter takes a list and a test and produces another list containing only the elements that pass the test.
Reduce (or fold) takes a list, an operation, and an initial value and applies the operation to the initial value and the elements in the list, passing the output into itself along with the next list item, producing the operational sum of the list.
If, for example, your list is [2,3,4,5,6,7], your initial value is 1, and your operation is addition, reduction will behave in the following way:
Reduce([2,3,4,5,6,7], +, 1) = ((((((initial + 2) + 3) + 4) + 5) + 6) + 7)
Your instructor may be calling it mapReduce because this is the paradigm's name, though simply reduce would be sufficient as well.
If you're curious as to the significance of his name, you should ask him. He is your instructor and all.
This is by no means an exact explanation (names are fuzzy anyway) but here’s an alternative definition:
def mapReduce(f: Int => Int, combine: (Int, Int) => Int, zero: Int)(a: Int, b: Int): Int ={
if (a > b) zero
else (a to b).map(f).reduce(combine)
}
Do you see the link?
mapReduce's mapping function is f in the question, though there's never an example of its definition. For sum and product it would be the identity function, but if you were summing the squares then the mapping function would be the square function.
mapReduce's reducer function is combine, wherein we are reducing a tuple of accumulator+value to a new accumulator for the next recursion.
I think the missing link besides the code not being very clear is to treat numbers as collections (e.g., 3 is a collection of three 1s). This is quite unusual and I don't know what it buys you but it could be that your teacher will use the analogy between numbers and collections for something more profound later.
Is this from Odersky's coursera course?