Element wise addition in spark rdd - scala

I have an RDD of RDD[Array[Array[Double]]]. So essentially each element is a matrix. I need to do element wise addition.
so if the first element of the rdd is
1 2
3 4
and second element is
5 6
7 8
at the end i need to have
6 8
10 12
I have looked into zip but not sure if I can use it for this case.

Yes, you can use zip, but you'll have to use it twice, once for the rows and once for the columns:
val rdd = sc.parallelize(List(Array(Array(1.0,2.0),Array(3.0,4.0)),
Array(Array(5.0,6.0),Array(7.0,8.0))))
rdd.reduce((a,b) => a.zip(b).map {case (c,d) => c.zip(d).map{ case (x,y) => x+y}})

Related

Count the number of keys for each value of a set tagged to that key

I have a pair RDD like this:
id value
id1 set(1232, 3,1,93,35)
id2 set(321,42,5,13)
id3 set(1233,3,5)
id4 set(1232, 56,3,35,5)
Now, I want to get the total count of ids per value contained in the set. So the output for the above table should be something like this:
set value count
1232 2
1 1
93 1
35 2
3 3
5 3
321 1
42 1
13 1
1233 1
56 1
Is there a way to achieve this?
I would recommend using the dataframe API since it is easier and more understandable. Using this API, the problem can be solved by using explode and groupBy as follows:
df.withColumn("value", explode($"value"))
.groupBy("value")
.count()
Using an RDD instead, one possible solution is using flatMap and aggregateByKey:
rdd.flatMap(x => x._2.map(s => (s, x._1)))
.aggregateByKey(0)((n, str) => n + 1, (p1, p2) => p1 + p2)
The result is the same in both cases.
yourrdd.toDF().withColumn(“_2”,explode(col(“_2”))).groupBy(“_2”).count.show

how to use split in spark scala?

I have a input file something looks like:
1: 3 5 7
3: 6 9
2: 5
......
I hope to get two list
the first list is made up of numbers before ":", the second list is made up of numbers after ":".
the two lists in the above example are:
1 3 2
3 5 7 6 9 5
I just write code as following:
val rdd = sc.textFile("input.txt");
val s = rdd.map(_.split(":"));
But do not know how to implement following things. Thanks.
I would use flatmaps!
So,
val rdd = sc.textFile("input.txt")
val s = rdd.map(_.split(": ")) # I recommend adding a space after the colon
val before_colon = s.map(x => x(0))
val after_colon = s.flatMap(x => x(1).split(" "))
Now you have two RDDs, one with the items from before the colon, and one with the items after the colon!
If it is possible for your the part of the text before the colon to have multiple numbers, such as an example like 1 2 3: 4 5 6, I would write val before_colon = s.flatMap(x => x(0).split(" "))

On Spark's RDD's take and takeOrdered methods

I'm a bit confused on how Spark's rdd.take(n) and rdd.takeOrdered(n) works. Can someone explain to me these two methods with some examples? Thanks.
In order to explain how ordering works we create an RDD with integers from 0 to 99:
val myRdd = sc.parallelize(Seq.range(0, 100))
We can now perform:
myRdd.take(5)
Which will extract the first 5 elements of the RDD and we will obtain an Array[Int] containig the first 5 integers of myRDD: '0 1 2 3 4 5' (with no ordering function, just the first 5 elements in the first 5 position)
The takeOrdered(5) operation works in a similar way: it will extract the first 5 elements of the RDD as an Array[Int] but we have to opportunity to specify the ordering criteria:
myRdd.takeOrdered(5)( Ordering[Int].reverse)
Will extract the first 5 elements according to specified ordering. In our case the result will be: '99 98 97 96 95'
If you have a more complex data structure in your RDD you may want to perform your own ordering function with the operation:
myRdd.takeOrdered(5)( Ordering[Int].reverse.on { x => ??? })
Which will extract the first 5 elements of your RDD as an Array[Int] according to your custom ordering function.

How to transpose an RDD in Spark

I have an RDD like this:
1 2 3
4 5 6
7 8 9
It is a matrix. Now I want to transpose the RDD like this:
1 4 7
2 5 8
3 6 9
How can I do this?
Say you have an N×M matrix.
If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
val transposed = sc.parallelize(rdd.collect.toSeq.transpose)
If N or M is so large that you cannot hold N or M entries in memory, then you cannot have an RDD line of this size. Either the original or the transposed matrix is impossible to represent in this case.
N and M may be of a medium size: you can hold N or M entries in memory, but you cannot hold N×M entries. In this case you have to blow up the matrix and put it together again:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
// Split the matrix into one number per line.
val byColumnAndRow = rdd.zipWithIndex.flatMap {
case (row, rowIndex) => row.zipWithIndex.map {
case (number, columnIndex) => columnIndex -> (rowIndex, number)
}
}
// Build up the transposed matrix. Group and sort by column index first.
val byColumn = byColumnAndRow.groupByKey.sortByKey().values
// Then sort by row index.
val transposed = byColumn.map {
indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
}
A first draft without using collect(), so everything runs worker side and nothing is done on driver:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
rdd.flatMap(row => (row.map(col => (col, row.indexOf(col))))) // flatMap by keeping the column position
.map(v => (v._2, v._1)) // key by column position
.groupByKey.sortByKey // regroup on column position, thus all elements from the first column will be in the first row
.map(_._2) // discard the key, keep only value
The problem with this solution is that the columns in the transposed matrix will end up shuffled if the operation is performed in a distributed system. Will think of an improved version
My idea is that in addition to attach the 'column number' to each element of the matrix, we attach also the 'row number'. So we could key by column position and regroup by key like in the example, but then we could reorder each row on the row number and then strip row/column numbers from the result.
I just don't have a way to know the row number when importing a file into an RDD.
You might think it's heavy to attach a column and a row number to each matrix element, but i guess that's the price to pay to have the possibility to process your input as chunks in a distributed fashion and thus handle huge matrices.
Will update the answer when i find a solution to the ordering problem.
As of Spark 1.6 you can use the pivot operation on DataFrames, depending on the actual shape of your data, if you put it into a DF you could pivot columns to rows, the following databricks blog is very useful as it describes in detail a number of pivoting use cases with code examples

transforming from native matrix format, scalding

So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))