Using Scala Breeze to do numPy style broadcasting - scala

Is there a generic way using Breeze to achieve what you can do using broadcasting in NumPy?
Specifically, if I have an operator I'd like to apply to two 3x4 matrices, I can apply that operation element-wise. However, what I have is a 3x4 matrix and a 3-element column vector. I'd like a function which produces a 3x4 matrix created from applying the operator to each element of the matrix with the element from the vector for the corresponding row.
So for a division:
2 4 6 / 2 3 = 1 2 3
3 6 9 1 2 3
If this isn't available. I'd be willing to look at implementing it.

You can use mapPairs to achieve what I 'think' you're looking for:
val adder = DenseVector(1, 2, 3, 4)
val result = DenseMatrix.zeros[Int](3, 4).mapPairs({
case ((row, col), value) => {
value + adder(col)
}
})
println(result)
1 2 3 4
1 2 3 4
1 2 3 4
I'm sure you can adapt what you want from simple 'adder' above.

Breeze now supports broadcasting of this sort:
scala> val dm = DenseMatrix( (2, 4, 6), (3, 6, 9) )
dm: breeze.linalg.DenseMatrix[Int] =
2 4 6
3 6 9
scala> val dv = DenseVector(2,3)
dv: breeze.linalg.DenseVector[Int] = DenseVector(2, 3)
scala> dm(::, *) :/ dv
res4: breeze.linalg.DenseMatrix[Int] =
1 2 3
1 2 3
The * operator says which axis to broadcast along. Breeze doesn't allow implicit broadcasting, except for scalar types.

Related

Find a chain that fill range without gaps

Given a range f.e. :
range : 1 4
there are 3 lists of pairs that 'fill' this range w/o gaps. Call it a 'chain'
chain1: 1 2, 2 3, 3 4
chain2: 1 2, 2 4
chain3: 1 3, 3 4
the following lists of pairs fail to fill the range, so it is not a chain :
1 2 <= missing 2 3, 3 4 OR 2 4
1 3 <= missing 3 4
1 2, 3 4 <= missing 2 3
1 2, 2 3 <= missing 3 4
2 3, 3 4 <= missing 1 2
Now the question is given a random list of pairs and a range how can I find is there a combination of pairs that makes a CHAIN.
Pairs do not repeat, are always (lower-value, higher-value) and never overlap i.e. this is never an input [ 1 3, 2 4 ]
In addition I can pre-filter only the pairs that are within the range, so f.e. you dont have to worry about pairs like : 4 6, 7 9, 0 1 ...
update ... this is also valid input
chain: 1 2, 1 3, 3 4
chain: 1 2, 1 3, 1 4
chain: 1 2, 2 4, 3 4
i.e a pair can subsume another .... this breaks the sort-and-loop idea
implemented the Union-Find algo from here in python ..
q=QFUF(10)
In [131]: q.is_cycle([(0,1),(1,3),(2,3),(0,3)])
Out[131]: True
In [132]: q.is_cycle([0,1,4,3,1])
Out[132]: True
only cares about first two elements of the tuple
In [134]: q.is_cycle([(0,1,0.3),(1,3,0.7),(2,3,0.4),(0,3,0.2)])
Out[134]: True
here
import numpy as np
class QFUF:
""" Detect cycles using Union-Find algorithm """
def __init__(self, n):
self.n = n
self.reset()
def reset(self): self.ids = np.arange(self.n)
def find(self, a): return self.ids[a]
def union(self, a, b):
#assign cause can be updated in the for loop
aid = self.ids[a]
bid = self.ids[b]
if aid == bid : return
for x in range(self.n) :
if self.ids[x] == aid : self.ids[x] = bid
#given next ~link/pair check if it forms a cycle
def step(self, a, b):
# print(f'{a} => {b}')
if self.find(a) == self.find(b) : return True
self.union(a, b)
return False
def is_cycle(self, seq):
self.reset()
#if not seq of pairs, make them
if not isinstance(seq[0], tuple) :
seq = zip(seq[:-1], seq[1:])
for tpl in seq :
a,b = tpl[0], tpl[1]
if self.step(a, b) : return True
return False

Spark Scala: How to work with each 3 elements of rdd?

everyone.
I have such problem:
I have very big rdd: billions elements like:
Array[((Int, Int), Double)] = Array(((0,0),729.0), ((0,1),169.0), ((0,2),1.0), ((0,3),5.0), ...... ((34,45),34.0), .....)
I need to do such operation:
take value of each element by key (i,j) and add to it the
min(rdd_value[(i-1, j)],rdd_value[(i, j-1)], rdd_value[(i-1, j-1)])
How can I do this without using collect() as After collect() I have got Java memory errror as my rdd is very big.
Thank you very much!
I try to realize this algorithm from python. when time series are rdds.
def DTWDistance(s1, s2):
DTW={}
for i in range(len(s1)):
DTW[(i, -1)] = float('inf')
for i in range(len(s2)):
DTW[(-1, i)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(len(s2)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])
And now I should perform last operation with for loop. The dist is already calculated.
Example:
Input (like matrix):
4 5 1
7 2 3
9 0 1
Rdd looks like
rdd.take(10)
Array(((1,1), 4), ((1,2), 5), ((1,3), 1), ((2,1), 7), ((2,2), 2), ((2,3), 3), ((3,1), 9), ((3,2), 0), ((3,3), 1))
I want to do this operation
rdd_value[(i, j)] = rdd_value[(i, j)] + min(rdd_value[(i-1, j)],rdd_value[(i, j-1)], rdd_value[(i-1, j-1)])
For example:
((1, 1), 4) = 4 + min(infinity, infinity, 0) = 4 + 0 = 4
4 5 1
7 2 3
9 0 1
Then
((1, 2), 5) = 5 + min(infinity, 4, infinity) = 5 + 4 = 9
4 9 1
7 2 3
9 0 1
Then
....
Then
((2, 2), 2) = 2 + min(7, 9, 4) = 2 + 4 = 6
4 9 1
7 6 3
9 0 1
Then
.....
((3, 3), 1) = 1 + min(3, 0, 2) = 1 + 0 = 1
A short answer is that the problem you try to solve cannot be efficiently and concisely expressed using Spark. It doesn't really matter if you choose plain RDDs are distributed matrices.
To understand why you'll have to think about the Spark programming model. A fundamental Spark concept is a graph of dependencies where each RDD depends on one or more parent RDDs. If your problem was defined as follows:
given an initial matrix M0
for i <- 1..n
find matrix Mi where Mi(m,n) = Mi - 1(m,n) + min(Mi - 1(m-1,n), Mi - 1(m-1,n-1), Mi - 1(m,n-1))
then it would be trivial to express using Spark API (pseudocode):
rdd
.flatMap(lambda ((i, j), v):
[((i + 1, j), v), ((i, j + 1), v), ((i + 1, j + 1), v)])
.reduceByKey(min)
.union(rdd)
.reduceByKey(add)
Unfortunately you are trying to express dependencies between individual values in the same data structure. Spark aside it a problem which is much harder to parallelize not to mention distribute.
This type of dynamic programming is hard to parallelize because at different points is completely or almost completely sequential. When you try to compute for example Mi(0,0) or Mi(m,n) there is nothing to parallelize. It is hard to distribute because it can generate complex dependencies between blocks.
There are non trivial ways to handle this in Spark by computing individual blocks and expressing dependencies between these blocks or using iterative algorithms and propagating messages over the explicit graph (GraphX) but this is far from easy to do it right.
At the end of the day there tools which can be much better choice for this type of computations than Spark.

Is it possible to join two rdds' values to avoid expensive shuffling?

I have two RDDs both having two columns as (K,V). In the sources for those RDDs keys are appearing one under the other and for each row a different and distinct value is assigned to the keys. The text files to create RDDs are given at the bottom of this post.
Keys are totally different in both RDDs and I would like to join two RDDs based on their values and try to find how many common values exist for each pair. e.g. I am trying to reach a result such as (1-5, 10) meaning that a key value of "1" from RDD1 and a key value of "5" from RDD2 share 10 values in common.
I work on a single machine with 256 GB ram and 72 cores. One text file is 500 MB while the other is 3 MB.
Here is my code:
val conf = new SparkConf().setAppName("app").setMaster("local[*]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4")
.set("spark.executor.memory","128g")
.set("spark.driver.maxResultSize", "0")
val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(0),s(1))}
val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(v1, Iterable())
} yield (s"$k-$v2", 1))
joined.foreach(println)
val result = joined.reduceByKey((a,b) => a+b)
I try to manage this issue by using a broadcast variable as seen from the script. If I join RDD2 (having 250K rows) with itself pairs show up in the same partitions and so less shuffle takes place so it takes 3 minutes to get the results. However, when applying RDD1 vs. RDD2 the pairs are scattered through partitions resulting in very expensive shuffling procedure and it always ends up giving
ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 168591 ms error.
Based on my results:
Should I try to partition text file to create RDD1 in smaller chunks
and join those smaller chunks separately with RDD2?
Is there another way of joining two RDDs based on their Value fields? If I describe the original values as keys and join them with the join function the value pairs are again scattered over the partitions which results in again a very expensive reducebykey operation. e.g.
val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
RDD1.join(RDD2).map(line => (line._2,1)).reduceByKey((a,b) => (a+b))
PSEUDO DATA SAMPLE:
KEY VALUE
1 13894
1 17376
1 15688
1 22434
1 2282
1 14970
1 11549
1 26027
1 2895
1 15052
1 20815
2 9782
2 3393
2 11783
2 22737
2 12102
2 10947
2 24343
2 28620
2 2486
2 249
2 3271
2 30963
2 30532
2 2895
2 13894
2 874
2 2021
3 6720
3 3402
3 25894
3 1290
3 21395
3 21137
3 18739
...
A SMALL EXAMPLE
RDD1
2 1
2 2
2 3
2 4
2 5
2 6
3 1
3 6
3 7
3 8
3 9
4 3
4 4
4 5
4 6
RDD2
21 1
21 2
21 5
21 11
21 12
21 10
22 7
22 8
22 13
22 9
22 11
BASED ON THIS DATA JOIN RESULTS:
(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(3-22,1)
(4-21,1)
(2-21,1)
(3-21,1)
(3-22,1)
(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(4-21,1)
(2-21,1)
(3-21,1)
REDUCEBYKEY RESULTS:
(4-21,1)
(3-21,1)
(2-21,3)
(3-22,3)
Have you looked at using a cartesian join? You could maybe try something like below:
val rdd1 = sc.parallelize(for { x <- 1 to 3; y <- 1 to 5 } yield (x, y)) // sample RDD
val rdd2 = sc.parallelize(for { x <- 1 to 3; y <- 3 to 7 } yield (x, y)) // sample RDD with slightly displaced values from the first
val g1 = rdd1.groupByKey()
val g2 = rdd2.groupByKey()
val cart = g1.cartesian(g2).map { case ((key1, values1), (key2, values2)) =>
((key1, key2), (values1.toSet & values2.toSet).size)
}
When I try running the above example in a cluster, I see the following:
scala> rdd1.take(5).foreach(println)
...
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
scala> rdd2.take(5).foreach(println)
...
(1,3)
(1,4)
(1,5)
(1,6)
(1,7)
scala> cart.take(5).foreach(println)
...
((1,1),3)
((1,2),3)
((1,3),3)
((2,1),3)
((2,2),3)
The result indicates that for (key1, key2), there are 3 matching elements between the sets. Note that the result is always 3 here since the initialized input tuples' ranges overlapped by 3 elements.
The cartesian transformation does not cause a shuffle either since it just iterates over the elements of each RDD and produces a cartesian product. You can see this by calling the toDebugString() function on an example.
scala> val carts = rdd1.cartesian(rdd2)
carts: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = CartesianRDD[9] at cartesian at <console>:25
scala> carts.toDebugString
res11: String =
(64) CartesianRDD[9] at cartesian at <console>:25 []
| ParallelCollectionRDD[1] at parallelize at <console>:21 []
| ParallelCollectionRDD[2] at parallelize at <console>:21 []

understanding aggregate in Scala

I am trying to understand aggregate in Scala and with one example, i understood the logic, but the result of second one i tried confused me.
Please let me know, where i went wrong.
Code:
val list1 = List("This", "is", "an", "example");
val b = list1.aggregate(1)(_ * _.length(), _ * _)
1 * "This".length = 4
1 * "is".length = 2
1 * "an".length = 2
1 * "example".length = 7
4 * 2 = 8 , 2 * 7 = 14
8 * 14 = 112
the output also came as 112.
but for the below,
val c = list1.aggregate(1)(_ * _.length(), _ + _)
I Thought it will be like this.
4, 2, 2, 7
4 + 2 = 6
2 + 7 = 9
6 + 9 = 15,
but the output still came as 112.
It is ideally doing whatever the operation i mentioned at seqop, here _ * _.length
Could you please explain or correct me where i went wrong.?
aggregate should be used to compute only associative and commutative operations. Let's look at the signature of the function :
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
B can be seen as an accumulator (and will be your output). You give an initial output value, then the first function is how to add a value A to this accumulator and the second is how to merge 2 accumulators. Scala "chooses" a way to aggregate your collection but if your aggregation is not associative and commutative the output is not deterministic because the order matter. Look at this example :
val l = List(1, 2, 3, 4)
l.aggregate(0)(_ + _, _ * _)
If we create one accumulator and then aggregate all the values we get 1 + 2 + 3 + 4 = 10 but if we decide to parallelize the process by splitting the list in halves we could have (1 + 2) * (3 + 4) = 21.
So now what happens in reality is that for List aggregate is the same as foldLeft which explains why changing your second function didn't change the output. But where aggregate can be useful is in Spark for example or other distributed environments where it may be useful to do the folding on each partition independently and then combine the results with the second function.

transforming from native matrix format, scalding

So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))