Transform RDD into RowMatrix for PCA - scala

The original data I have looks like this:
RDD data:
key -> index
1 -> 2
1 -> 3
1 -> 5
2 -> 1
2 -> 3
2 -> 4
How can I convert the RDD to the following format?
key -> index1, index2, index3, index4, index5
1 -> 0,1,1,0,1
2 -> 1,0,1,1,0
My current method is:
val vectors = filtered_data_by_key.map( x => {
var temp = Array[AnyVal]()
x._2.copyToArray(temp)
(x._1, Vectors.sparse(filtered_key_size, temp.map(_.asInstanceOf[Int]), Array.fill(filtered_key_size)(1) ))
})
I got some strange error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 54.0 failed 1 times, most recent failure: Lost task 3.0 in stage 54.0 (TID 75, localhost): java.lang.IllegalArgumentException: requirement failed
When I try to debug this program using the following code:
val vectors = filtered_data_by_key.map( x => {
val temp = Array[AnyVal]()
val t = x._2.copyToArray(temp)
(x._1, temp)
})
I found temp is empty, so the problem is in copyToArray().
I am not sure how to solve this.

I don't understand the question completely. Why are your keys important? And what is the maximum index value? In your code you arre using distinct number of keys as the maximum value of index but I believe that is a mistake.
But I will assume the maximum index value is 5. In that case, I believe this would be what you're looking for:
val vectors = data_by_key.map({case(k,it)=>Vectors.sparse(5,it.map(x=>x-1).toArray,
Array.fill(it.size)(1))})
val rm = new RowMatrix(vectors)
I decreased index number by one because they should start with 0.
The error 'requirement failed' is due to your index and values vectors not having the same size.

Related

How to find the total average of filtered values in scala

The following code allows me the sum per filter key. How do I sum and average all the values together. i.e combine the results of all filtered values.
val f= p.groupBy(d => (d.Id))
.mapValues(totavg =>
(totavg.groupBy(_.Day).filterKeys(Set(2,3,4)).mapValues(_.map(_.Amount))
Sample output:
Map(A9 -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73))
I would like to add all values together and compute total average.
i.e (473.3+676.48+685.45+812.73)/4
For the given Map, you can apply flatMap twice to return the sequence of values firstly, then calculate the average:
val m = Map("A9" -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73)))
val s = m.flatMap(_._2.flatMap(_._2))
// s: scala.collection.immutable.Iterable[Double] = List(473.3, 676.48, 685.45, 812.73)
s.sum/s.size
// res14: Double = 661.99

Calculate occurrences per group of events - Spark

having the following rdd
BBBBBBAAAAAAABAABBBBBBBB
AAAAABBAAAABBBAAABAAAAAB
I need to calculate the numbers of iterations per group of event, so, for this example the expected output should be:
BBBBBBAAAAAAABAABBBBBBBB A -> 2 B -> 3
AAAAABBAAAABBBAAABBCCCCC A -> 3 B -> 4 C-> 1
Final Output -> A -> 5 B -> 7 C-> 1
I have implemented the splitting and them a sliding for each character to try to obtain the values, but I cannot obtain the expected result.
Thanks,
val baseRDD = sc.parallelize(Seq("BBBBBBAAAAAAABAABBBBBBBB", "AAAAABBAAAABBBAAABBCCCC"))
baseRDD.flatMap(x => "(\\w)\\1*".r.findAllMatchIn(x).map(x => (x.matched.charAt(0), 1)).toList).reduceByKey((accum, current) => accum + current).foreach(println(_))
Result
(C,1)
(B,6)
(A,5)
Hope this is what you wanted.

Spark SparkPi example

In the SparkPi example that comes with the distribution of Spark, is the reduce on the RDD, executed in parallel (each slice calculates its total), or not?
val count: Int = spark.sparkContext.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
Yes, it is.
By default, this example will operate on 2 slices. As a result, your collection will be split into 2 parts. Then Spark will execute the map transformation and reduce action on each partition in parallel. Finally, Spark will merge the individual results into the final value.
You can observe 2 tasks in the console output if the example is executed using the default config.

Assigning indexes to keys in Spark

This is my first post here and I hope I followed the guidelines properly.
I'm currently working with an RDD of spark.mllib.recommendation.Rating (key1, key2, value) and I'd like to apply Spark's MLLib SVD to it as in this example. To do so, I need to create a (sparse) RowMatrix. I'm able to do it by applying
val inputData = data.map{ case Rating(key1, key2, ecpm) => (key1, key2, ecpm)}
// Number of columns
val nCol = inputData.map(_._2).distinct().count().toInt
// Construct rows of the RowMatrix
val dataRows = inputData.groupBy(_._1).map[(Long, Vector)]{ row =>
val (indices, values) = row._2.map(e => (e._2, e._3)).unzip
(row._1, new SparseVector(nCol, indices.toArray, values.toArray))
}
// Compute 20 largest singular values and corresponding singular vectors
val svd = new RowMatrix(dataRows.map(_._2).persist()).computeSVD(20, computeU = true)
My problem is, when I try to run this code, I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 72 in stage 12.0 failed 4 times, most recent failure: Lost task 72.3 in stage 12.0 (TID 2329, spark7): Java.lang.ArrayIndexOutOfBoundsException: 1085194
I guess this ArrayIndexOutOfBoundsException error comes from the fact that my key1 and key2 keys are integers that can possibly be big (that is, too big for a RowMatrix object indices). So what I have tried to do is to assign new indices to key1 and key2 which respectively lie in [1, n_key1] and [1, n_key2]. I have seen a few related topics (like this one or this one) using methods like zipWithIndex or zipWithUniqueId but I don't think it really helps in my case. I was thinking of applying something like
inputData.map{(key1, key2, value) => key1}.distinct().zipWithIndex()
and the same for key2. This would give me indices for both keys but then I don't know how I can recover an RDD of the same shape as inputData. I'm quite new to Scala/Spark and I can't think of a way to do it. Any though on how I could solve my problem, that is, how to replace the key1 and key2 keys by some indices in my RDD? Note that key1 and key2 are not unique for all samples, there may be repetitions.
EDIT: My data looks like this:
scala> data.take(5)
res3: Array[org.apache.spark.mllib.recommendation.Rating] = Array(Rating(39150941,1020026,0.0), Rating(33640847,1029671,0.0), Rating(7447392,988161,0.0), Rating(41696301,1130435,0.0), Rating(42941712,461150,0.0))

transforming from native matrix format, scalding

So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))