How to transform a dense matrix to rdd in Scala Spark? - scala

I have dense matrix:
-0.1503191229976037 -0.17794560268465542 0.3372516173766848
-0.6265768782935162 -0.6986084179343495 -1.6553741696973772
How do I convert it to RDD of format:
0, 0, -0.1503191229976037
0, 1, -0.17794560268465542
0, 2, 0.3372516173766848
1, 0, -0.6265768782935162
1, 1, -0.6986084179343495
1, 2, -1.6553741696973772
The first two values are indices.
The type of my input matrix is:
org.apache.spark.mllib.linalg.DenseMatrix
The expected output type is: org.apache.spark.rdd.RDD[scala.Tuple2[scala.Tuple2[Int, Int], Double]]
How do I do it on Spark using Scala?

Assuming that your vectors are represented by the actual data structure Vector:
val vectors: List[Vector[Double]] = ???
val vecsWithIndices = for {
(vIdx, vec) <- Stream.from(0).zip(vectors)
i <- 0 until 3
} yield (vIdx, i, vec(i))
val rdd = spark.parallelize(vecsWithIndices)
The Stream.from(0) generates the index of the vector, the i runs over components of the vector.

Solved it like this:
val denseMatrix=for (
i <- 0 to 2;
j <- 0 to 2
) yield ((i, j), z.apply(i,j))
val rdd = sc.parallelize(z1)
Please let me know if there is a better way. Thank you.

Related

Checking multiple array elements and return true if all matched

I wanted to find if all emement of below array matched to each other:
val a = Array(1,1,1)
val b = Array(1,0,1)
val c = Array(0,1,1)
here output should be
Array(0,0,1)
as all the value of a(2),b(2) and c(2) is 1 however for all cases it's 0. Is there any functional way of solving this in Scala?
If the arrays are all the same size, then one approach is to transpose the arrays, then map-and-reduce the result with Java's bitwise AND operator &:
val a = Array(1, 1, 1)
val b = Array(1, 0, 1)
val c = Array(0, 1, 1)
val result = Array(a, b, c).transpose.map(_.reduce(_ & _)) // Array(0, 0, 1)

Spark Scala Generating Random RDD with (1's and 0's )?

How does one create an RDD filled with values from an array say (0,1) - filling random 1000 values as 1 and remaining 0.
I know I can filter and do this but it won't be random. I want it to be as random as possible
var populationMatrix = new IndexedRowMatrix(RandomRDDs.uniformVectorRDD(sc, populationSize, chromosomeLength)
I was exploring random RDDs in spark but could find something that meets my needs .
Not really sure if this is what you are looking for, but with this code you are able to create an RDD array with random numbers between 0 and 1s:
import scala.util.Random
val arraySize = 15 // Total number of elements that you want
val numberOfOnes = 10 // From that total, how many do you want to be ones
val listOfOnes = List.fill(numberOfOnes)(1) // List of 1s
val listOfZeros = List.fill(arraySize - numberOfOnes)(0) // Rest list of 0s
val listOfOnesAndZeros = listOfOnes ::: listOfZeros // Merge lists
val randomList = Random.shuffle(listOfOnesAndZeros) // Random shuffle
val randomRDD = sc.parallelize(randomList) // RDD creation
randomRDD.collect() // Array[Int] = Array(1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1)
Or, if you want to use only RDDs:
val arraySize = 15
val numberOfOnes = 10
val rddOfOnes = spark.range(numberOfOnes).map(_ => 1).rdd
val rddOfZeros = spark.range(arraySize - numberOfOnes).map(_ => 0).rdd
val rddOfOnesAndZeros = rddOfOnes ++ rddOfZeros
val shuffleResult = rddOfOnesAndZeros.mapPartitions(iter => {
val rng = new scala.util.Random()
iter.map((rng.nextInt, _))
}).partitionBy(new org.apache.spark.HashPartitioner(rddOfOnesAndZeros.partitions.size)).values
shuffleResult.collect() // Array[Int] = Array(0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)
Let me know if it was what you need it.

convert bipartite graph to adjacency matrix spark scala

I'm trying to convert edge list which is in the following format
data = [('a', 'developer'),
('b', 'tester'),
('b', 'developer'),
('c','developer'),
('c', 'architect')]
where the adjacency matrix will be in the form of
developer tester architect
a 1 0 0
b 1 1 0
c 1 0 1
I want to store the matrix in the following format
1 0 0
1 1 0
1 0 1
I've tried it using GraphX
def pageHash(title:String ) = title.toLowerCase.replace(" ","").hashCode.toLong
val edges: RDD[Edge[String]] = sc.textFile("/user/query.csv").map { line =>
val row = line.split(",")
Edge(pageHash(row(0)), pageHash(row(1)), "1")
}
val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1)
I'm able to create the graph but not able to convert to adjacent matrix representation.
One possible way to approach is something this:
Convert RDD to DataFrame
val rdd = sc.parallelize(Seq(
("a", "developer"), ("b", "tester"), ("b", "developer"),
("c","developer"), ("c", "architect")))
val df = rdd.toDF("row", "col")
Index columns:
import org.apache.spark.ml.feature.StringIndexer
val indexers = Seq("row", "col").map(x =>
new StringIndexer().setInputCol(x).setOutputCol(s"${x}_idx").fit(df)
)
Transform data and create RDD[MatrixEntry]:
import org.apache.spark.functions.lit
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
val entries = indexers.foldLeft(df)((df, idx) => idx.transform(df))
.select($"row_idx", $"col_idx", lit(1.0))
.as[MatrixEntry] // Spark 1.6. For < 1.5 map manually
.rdd
Create matrix
new CoordinateMatrix(entries)
This matrix can be further converted to any other type of distributed matrix including RowMatrix and IndexedRowMatrix.

Transform a Scala Stream to a new Stream which is the sum of the current element and the previous element

How to transform a Scala Stream of integers so that we have a new Stream where the elements are the sum of this element and the previous element.
By example if the input stream is 1, 2, 3, 4 ... then the output stream is 1, 3, 5, 7.
Also a second question, how would you make the sum use the previous one in the output stream so that the output would be 1, (2+(1)), (3+(2+1)), (4+(3+(2+1))).
Just zip your stream with a shifted version of itself and sum the two elements.
val s1 = Stream.from(0) // 0, 1, 2, 3, ...
val s2 = Stream.from(1) // 1, 2, 3, 4, ...
val sumOfTwo = s1.zip(s2).map{ case (a,b) => a+b } // 1, 3, 5, 7, ...
To compute the total sum, just use the scan function that acts like a fold but returning elements at each step.
val totalSum = s1.scan(0)((ctr, el) => ctr + el) // 0, 1, 3, 6, 10, ...
This answer computes the cumulative sum by using a variable for the accumulated result instead of scan(). Example program:
import scala.collection.immutable.Stream
object Main extends App {
// 1, 2, 3, ...
val naturals = Stream.from(1)
// cumulative sum (see https://stackoverflow.com/a/8567134/1071311)
def sumUp(s : Stream[Int], acc : Int = 0) : Stream[Int] =
Stream.cons(s.head + acc, sumUp(s.tail, s.head + acc))
val firstFive = sumUp(naturals, 0).take(5)
firstFive.foreach(println _)
}
Output:
1
3
6
10
15

Sum array of vectors by one of fields - scala

I have an array of vectors in scala:
import org.apache.mahout.math.{ VectorWritable, Vector, DenseVector }
import org.apache.mahout.clustering.dirichlet.UncommonDistributions
val data = new ArrayBuffer[Vector]()
for (i <- 100 to num) {
data += new DenseVector(Array[Double](
i % 30,
UncommonDistributions.rNorm(100, 100),
UncommonDistributions.rNorm(100, 100)
)
}
Lets say I want to sum second and third fields grouping by first row.
What is the better way to do that?
I would suggest to use the groupBy method present in Collections:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.Vector#groupBy[K](f:A=>K):scala.collection.immutable.Map[K,Repr]
This will create a Map of Vectors base on a discriminator you specify.
Edit: Some code example:
// I created a different Array of Vector as I don't have Mahout dependencies
// But the output is similar
// A List of Vectors with 3 values inside
val num = 100
val data = (0 to num).toList.map(n => {
Vector(n % 30, n / 100, n * 100)
})
// The groupBy will create a Map of Vectors where the Key is the result of the function
// And here, the function return the first value of the Vector
val group = data.groupBy(v => { v.apply(0) })
// Also a subset of the result:
// group:
// scala.collection.immutable.Map[Int,List[scala.collection.immutable.Vector[Int]]] = Map(0 -> List(Vector(0, 0, 0), Vector(0, 0, 3000), Vector(0, 0, 6000), Vector(0, 0, 9000)), 5 -> List(Vector(5, 0, 500), Vector(5, 0, 3500), Vector(5, 0, 6500), Vector(5, 0, 9500)))
use groupBy function on list, and then map each group - simply in one line of code:
data groupBy (_(0)) map { case (k,v) => k -> (v map (_(2)) sum) }