in scala how do we aggregate an Array to determine the count per Key and the percentage vs total - scala

I am trying to find an efficient way to find the following :
Int1 = 1 or 0, Int2 = 1..k (where k = 3) and Double = 1.0
I want to find how many 1 or 0 are there in every k
I need to find the percentage of result of 3 on the total of the size of the Array??
Input is :
val clusterAndLabel = sc.parallelize(Array((0, 0), (0, 0), (1, 0), (1, 1), (2, 1), (2, 1), (2, 0)))
So in this example:
I have : 0,0 = 2 , 0,1 = 0
I have : 1,0 = 1 , 1,1 = 1
I have : 2,1 = 2 , 2,0 = 1
Total is 7 instances
I was thinking of doing some aggegation but I am stuck on the thought that they are both considered 2-key join

If you want to find how many 1 and 0s there are you can do:
val rdd = clusterAndLabel.map(x => (x,1)).reduceByKey(_+_)
this will give you an RDD[(Int,Int),Int] containing exactly what you described, meaning: [((0,0),2), ((1,0),1), ((1,1),1), ((2,1),2), ((2,0),1)]. If you really want them gathered by their first key, you can add this line:
val rdd2 = rdd.map(x => (x._1._1, (x._1._2, x._2))).groupByKey()
this will yield an RDD[(Int, (Int,Int)] which will look like what you described, i.e.: [(0, [(0,2)]), (1, [(0,1),(1,1)]), (2, [(1,2),(0,1)])].
If you need the number of instances, it looks like (at least in your example) clusterAndLabel.count() should do the work.
I don't really understand question 3? I can see two things:
you want to know how many keys have 3 occurrences. To do so, you can start from the object I called rdd (no need for the groupByKey line) and do so:
val rdd3 = rdd.map(x => (x._2,1)).reduceByKey(_+_)
this will yield and RDD[(Int,Int)] which is kind of a frequency RDD: the key is the number of occurences and the value is how many times this key is hit. Here it would look like: [(1,3),(2,2)]. So if you want to know how many pairs occur 3 times, you just do rdd3.filter(_._1==3).collect() (which will be an array of size 0, but if it's not empty then it'll have one value and it will be your answer).
you want to know how many time the first key 3 occurs (once again 0 in your example). Then you start from rdd2 and do:
val rdd3 = rdd2.map(x=>(x._1,x._2.size)).filter(_._1==3).collect()
once again it will yield either an empty array or an array of size 1 containing how many elements have a 3 for their first key. Note that you can do it directly if you don't need to display rdd2, you can just do:
val rdd4 = rdd.map(x => (x._1._1,1)).reduceByKey(_+_).filter(_._1==3).collect()
(for performance you might want to do the filter before reduceByKey also!)

Related

Scala Shuffle A List Randomly And repeat it

I want to shuffle a scala list randomly.
I know i can do this by using scala.util.Random.shuffle
But here by calling this i will always get a new set of list. What i really want is that in some cases i want the shuffle to give me the same output always. So how can i achieve that?
Basically what i want to do is to shuffle a list randomly at first and then repeat it in the same order. For the first time i want to generate the list randomly and then based on some parameter repeat the same shuffling.
Use setSeed() to seed the generator before shuffling. Then if you want to repeat a shuffle reuse the seed.
For example:
scala> util.Random.setSeed(41L) // some seed chosen for no particular reason
scala> util.Random.shuffle(Seq(1,2,3,4))
res0: Seq[Int] = List(2, 4, 1, 3)
That shuffled: 1st -> 3rd, 2nd -> 1st, 3rd -> 4th, 4th -> 2nd
Now we can repeat that same shuffle pattern.
scala> util.Random.setSeed(41L) // same seed
scala> util.Random.shuffle(Seq(2,4,1,3)) // result of previous shuffle
res1: Seq[Int] = List(4, 3, 2, 1)
Let a be the seed parameter
Let b be the how many time you want to shuffle
There are two ways to kinda of do this
you can use scala.util.Random.setSeed(a) where 'a' can be any integer so after you finish your shuffling b times you can set the 'a' seed again and then your shuffling will be in the same order as your parameter 'a'
The other way is to shuffle List(1,2,3,...a) == 1 to a b times save that as a nested list or vector and then you can map it to your iterable
val arr = List(Bob, Knight, John)
val randomer = (0 to b).map(x => scala.util.Random.shuffle((0 to arr.size))
randomer.map(x => x.map(y => arr(y)))
You can use the same randomer for you other list you want to shuffle by mapping it

Sliding window based on values (not index)

I have a sparse list of timestamps, let's dumb it down to:
val stamps = List(1,2,3,7,10,11)
Imagine I have a window size of three, what would be a Scala/Functional way to get the following output
valueWindowed (3, stamps) == List(
// Starting at timestamp 1 of stamps, we include the next values which are no bigger than the current value + the window size
List(1, 2, 3),
// Starting at timestamp 2 include the next values in this window
List(2, 3),
List(3), // ...
// This one is empty as the window starts at timestamp 4 and ends at 6 (inclusive)
List(),
// This one _does_ include 7, as the windows starts at 5 and ends at 7 (inclusive)
List(7),
List(7),
List(7),
List(10),
List(10,11),
List(10,11),
List(11)
)
Update
I have the following implementation, but it looks very procedural, jammed into functional constructs. Also the complexity is max(stamps) * stamps size
def valueWindowed(step: Int, times: List[Int]) = {
for(j <- (1 to times.max).toIterator) yield{
times.dropWhile(_ < j) takeWhile(_ < j+step)
}
}
Here's a functional one that is O(N) - where N is the range of the numbers in times, not the length of it. But it's not possible to do better than that since that's the size of your output.
def valueWindowed(step:Int, times:List[Int]) = {
(times :+ times.last+step)
.sliding(2)
.flatMap{case List(a,b) => Some(a)::List.fill(b-a-1)(None)}
.sliding(step)
.map(_.flatten)
}
The first sliding and flatMap expand the list so it has Some(x) for all x in times, and None for all intermediate values (adding a sentinel value to get the last element included correctly). Then we take steps windows and use flatten to remove the None/convert the Some back

How to calculate median over RDD[org.apache.spark.mllib.linalg.Vector] in Spark efficiently?

What I want to do like this:
http://cn.mathworks.com/help/matlab/ref/median.html?requestedDomain=www.mathworks.com
Find the median value of each column.
It can be done by collecting the RDD to driver, for a big data which will become impossible.
I know Statistics.colStats() can calculate mean, variance... but median is not included.
Additionally, the vector is high-dimensional and sparse.
Well I didn't understand the vector part, however this is my approach (I bet there are better ones):
val a = sc.parallelize(Seq(1, 2, -1, 12, 3, 0, 3))
val n = a.count() / 2
println(n) // outputs 3
val b = a.sortBy(x => x).zipWithIndex()
val median = b.filter(x => x._2 == n).collect()(0)._1 // this part doesn't look nice, I hope someone tells me how to improve it, maybe zero?
println(median) // outputs 2
b.collect().foreach(println) // (-1,0) (0,1) (1,2) (2,3) (3,4) (3,5) (12,6)
The trick is to sort your dataset using sortBy, then zip the entries with their index using zipWithIndex and then get the middle entry, note that I set an odd number of samples, for simplicity but the essence is there, besides you have to do this with every column of your dataset.

How to declare a sparse Vector in Spark with Scala?

I'm trying to create a sparse Vector (the mllib.linalg.Vectors class, not the default one) but I can't understand how to use Seq. I have a small test file with three numbers/line, which I convert to an rdd, split the text in doubles and then group the lines by their first column.
Test file
1 2 4
1 3 5
1 4 8
2 7 5
2 8 4
2 9 10
Code
val data = sc.textFile("/home/savvas/DWDM/test.txt")
val data2 = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val grouped = data2.groupBy( _(0) )
This results in grouped having these values
(2.0,CompactBuffer([2.0,7.0,5.0], [2.0,8.0,4.0], [2.0,9.0,10.0]))
(1.0,CompactBuffer([1.0,2.0,4.0], [1.0,3.0,5.0], [1.0,4.0,8.0]))
But I can't seem to figure out the next step. I need to take each line of grouped and create a vector for it, so that each line of the new RDD has a vector with the third value of the CompactBuffer in the index specified by the second value. In short, what I mean is that I want my data in the example like this.
[0, 0, 0, 0, 0, 0, 5.0, 4.0, 10.0, 0]
[0, 4.0, 5.0, 8.0, 0, 0, 0, 0, 0, 0]
I know I need to use a sparse vector, and that there are three ways to construct it. I've tried using a Seq with a tuple2(index, value) , but I cannot understand how to create such a Seq.
One possible solution is something like below. First lets convert data to expected types:
import org.apache.spark.rdd.RDD
val pairs: RDD[(Double, (Int, Double))] = data.map(_.split(" ") match {
case Array(label, idx, value) => (label.toDouble, (idx.toInt, value.toDouble))
})
next find a maximum index (size of the vectors):
val nCols = pairs.map{case (_, (i, _)) => i}.max + 1
group and convert:
import org.apache.spark.mllib.linalg.SparseVector
def makeVector(xs: Iterable[(Int, Double)]) = {
val (indices, values) = xs.toArray.sortBy(_._1).unzip
new SparseVector(nCols, indices.toArray, values.toArray)
}
val transformed: RDD[(Double, SparseVector)] = pairs
.groupByKey
.mapValues(makeVector)
Another way you can handle this, assuming that the first elements can be safely converted to and from integer, is to use CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = data.map(_.split(" ") match {
case Array(label, idx, value) =>
MatrixEntry(label.toInt, idx.toInt, value.toDouble)
})
val transformed: RDD[(Double, SparseVector)] = new CoordinateMatrix(entries)
.toIndexedRowMatrix
.rows
.map(row => (row.index.toDouble, row.vector))

How to transpose an RDD in Spark

I have an RDD like this:
1 2 3
4 5 6
7 8 9
It is a matrix. Now I want to transpose the RDD like this:
1 4 7
2 5 8
3 6 9
How can I do this?
Say you have an N×M matrix.
If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
val transposed = sc.parallelize(rdd.collect.toSeq.transpose)
If N or M is so large that you cannot hold N or M entries in memory, then you cannot have an RDD line of this size. Either the original or the transposed matrix is impossible to represent in this case.
N and M may be of a medium size: you can hold N or M entries in memory, but you cannot hold N×M entries. In this case you have to blow up the matrix and put it together again:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
// Split the matrix into one number per line.
val byColumnAndRow = rdd.zipWithIndex.flatMap {
case (row, rowIndex) => row.zipWithIndex.map {
case (number, columnIndex) => columnIndex -> (rowIndex, number)
}
}
// Build up the transposed matrix. Group and sort by column index first.
val byColumn = byColumnAndRow.groupByKey.sortByKey().values
// Then sort by row index.
val transposed = byColumn.map {
indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
}
A first draft without using collect(), so everything runs worker side and nothing is done on driver:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
rdd.flatMap(row => (row.map(col => (col, row.indexOf(col))))) // flatMap by keeping the column position
.map(v => (v._2, v._1)) // key by column position
.groupByKey.sortByKey // regroup on column position, thus all elements from the first column will be in the first row
.map(_._2) // discard the key, keep only value
The problem with this solution is that the columns in the transposed matrix will end up shuffled if the operation is performed in a distributed system. Will think of an improved version
My idea is that in addition to attach the 'column number' to each element of the matrix, we attach also the 'row number'. So we could key by column position and regroup by key like in the example, but then we could reorder each row on the row number and then strip row/column numbers from the result.
I just don't have a way to know the row number when importing a file into an RDD.
You might think it's heavy to attach a column and a row number to each matrix element, but i guess that's the price to pay to have the possibility to process your input as chunks in a distributed fashion and thus handle huge matrices.
Will update the answer when i find a solution to the ordering problem.
As of Spark 1.6 you can use the pivot operation on DataFrames, depending on the actual shape of your data, if you put it into a DF you could pivot columns to rows, the following databricks blog is very useful as it describes in detail a number of pivoting use cases with code examples