Spark Scala Generating Random RDD with (1's and 0's )? - scala

How does one create an RDD filled with values from an array say (0,1) - filling random 1000 values as 1 and remaining 0.
I know I can filter and do this but it won't be random. I want it to be as random as possible
var populationMatrix = new IndexedRowMatrix(RandomRDDs.uniformVectorRDD(sc, populationSize, chromosomeLength)
I was exploring random RDDs in spark but could find something that meets my needs .

Not really sure if this is what you are looking for, but with this code you are able to create an RDD array with random numbers between 0 and 1s:
import scala.util.Random
val arraySize = 15 // Total number of elements that you want
val numberOfOnes = 10 // From that total, how many do you want to be ones
val listOfOnes = List.fill(numberOfOnes)(1) // List of 1s
val listOfZeros = List.fill(arraySize - numberOfOnes)(0) // Rest list of 0s
val listOfOnesAndZeros = listOfOnes ::: listOfZeros // Merge lists
val randomList = Random.shuffle(listOfOnesAndZeros) // Random shuffle
val randomRDD = sc.parallelize(randomList) // RDD creation
randomRDD.collect() // Array[Int] = Array(1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1)
Or, if you want to use only RDDs:
val arraySize = 15
val numberOfOnes = 10
val rddOfOnes = spark.range(numberOfOnes).map(_ => 1).rdd
val rddOfZeros = spark.range(arraySize - numberOfOnes).map(_ => 0).rdd
val rddOfOnesAndZeros = rddOfOnes ++ rddOfZeros
val shuffleResult = rddOfOnesAndZeros.mapPartitions(iter => {
val rng = new scala.util.Random()
iter.map((rng.nextInt, _))
}).partitionBy(new org.apache.spark.HashPartitioner(rddOfOnesAndZeros.partitions.size)).values
shuffleResult.collect() // Array[Int] = Array(0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)
Let me know if it was what you need it.

Related

How to transform a dense matrix to rdd in Scala Spark?

I have dense matrix:
-0.1503191229976037 -0.17794560268465542 0.3372516173766848
-0.6265768782935162 -0.6986084179343495 -1.6553741696973772
How do I convert it to RDD of format:
0, 0, -0.1503191229976037
0, 1, -0.17794560268465542
0, 2, 0.3372516173766848
1, 0, -0.6265768782935162
1, 1, -0.6986084179343495
1, 2, -1.6553741696973772
The first two values are indices.
The type of my input matrix is:
org.apache.spark.mllib.linalg.DenseMatrix
The expected output type is: org.apache.spark.rdd.RDD[scala.Tuple2[scala.Tuple2[Int, Int], Double]]
How do I do it on Spark using Scala?
Assuming that your vectors are represented by the actual data structure Vector:
val vectors: List[Vector[Double]] = ???
val vecsWithIndices = for {
(vIdx, vec) <- Stream.from(0).zip(vectors)
i <- 0 until 3
} yield (vIdx, i, vec(i))
val rdd = spark.parallelize(vecsWithIndices)
The Stream.from(0) generates the index of the vector, the i runs over components of the vector.
Solved it like this:
val denseMatrix=for (
i <- 0 to 2;
j <- 0 to 2
) yield ((i, j), z.apply(i,j))
val rdd = sc.parallelize(z1)
Please let me know if there is a better way. Thank you.

Operate on neighbor elements in RDD in Spark

As I have a collection:
List(1, 3,-1, 0, 2, -4, 6)
It's easy to make it sorted as:
List(-4, -1, 0, 1, 2, 3, 6)
Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:
for(i <- 0 to list.length -2) yield {
list(i + 1) - list(i)
}
and get a vector:
Vector(3, 1, 1, 1, 1, 3)
That is, I want to make the next element minus the current element.
But how to implement this in RDD on Spark?
I know for the collection:
List(-4, -1, 0, 1, 2, 3, 6)
There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?
The most efficient solution is to use sliding method:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
.sortBy(identity)
.sliding(2)
.map{case Array(x, y) => y - x}
Suppose you have something like
val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)
Let's create first collection with index as key like Ton Torres suggested
val original = seq.zipWithIndex.map(_.swap)
Now we can build collection shifted by one element.
val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)
Next we can calculate needed differences ordered by index descending
val diffs = original.join(shifted)
.sortBy(_._1, ascending = false)
.map { case (idx, (v1, v2)) => v2 - v1 }
So
println(diffs.collect.toSeq)
shows
WrappedArray(3, 1, 1, 1, 1, 3)
Note that you can skip the sortBy step if reversing is not critical.
Also note that for local collection this could be computed much more simple like:
val elems = List(1, 3, -1, 0, 2, -4, 6).sorted
(elems.tail, elems).zipped.map(_ - _).reverse
But in case of RDD the zip method requires each collection should contain equal element count for each partition. So if you would implement tail like
val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)
tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition.

Specify subset of elements in Spark RDD (Scala)

My dataset is a RDD[Array[String]] with more than 140 columns. How can I select a subset of columns without hard-coding the column numbers (.map(x => (x(0),x(3),x(6)...))?
This is what I've tried so far (with success):
val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))
However, I need more than a few columns, and would like to avoid hard-coding them.
This is what I've tried so far (that I think would be better, but has failed):
// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Error output from attempt 1:
<console>:28: error: type mismatch;
found : List[Int]
required: Int
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Attempt 2
colIndices map peopleTups.lift
// Attempt 3
colIndices map peopleTups
// Attempt 4
colIndices.map(index => peopleTups.apply(index))
I found this question and tried it, but because I'm looking at an RDD instead of an array, it didn't work: How can I select a non-sequential subset elements from an array using Scala and Spark?
You should map over the RDD instead of the indices.
val list = List.fill(2)(Array.range(1, 6))
// List(Array(1, 2, 3, 4, 5), Array(1, 2, 3, 4, 5))
val rdd = sc.parallelize(list) // RDD[Array[Int]]
val indices = Array(0, 2, 3)
val selectedColumns = rdd.map(array => indices.map(array)) // RDD[Array[Int]]
selectedColumns.collect()
// Array[Array[Int]] = Array(Array(1, 3, 4), Array(1, 3, 4))
What about this?
val data = sc.parallelize(List("a,b,c,d,e", "f,g,h,i,j"))
val indices = List(0,3,4)
data.map(_.split(",")).map(ss => indices.map(ss(_))).collect
This should give
res1: Array[List[String]] = Array(List(a, d, e), List(f, i, j))

Transform a Scala Stream to a new Stream which is the sum of the current element and the previous element

How to transform a Scala Stream of integers so that we have a new Stream where the elements are the sum of this element and the previous element.
By example if the input stream is 1, 2, 3, 4 ... then the output stream is 1, 3, 5, 7.
Also a second question, how would you make the sum use the previous one in the output stream so that the output would be 1, (2+(1)), (3+(2+1)), (4+(3+(2+1))).
Just zip your stream with a shifted version of itself and sum the two elements.
val s1 = Stream.from(0) // 0, 1, 2, 3, ...
val s2 = Stream.from(1) // 1, 2, 3, 4, ...
val sumOfTwo = s1.zip(s2).map{ case (a,b) => a+b } // 1, 3, 5, 7, ...
To compute the total sum, just use the scan function that acts like a fold but returning elements at each step.
val totalSum = s1.scan(0)((ctr, el) => ctr + el) // 0, 1, 3, 6, 10, ...
This answer computes the cumulative sum by using a variable for the accumulated result instead of scan(). Example program:
import scala.collection.immutable.Stream
object Main extends App {
// 1, 2, 3, ...
val naturals = Stream.from(1)
// cumulative sum (see https://stackoverflow.com/a/8567134/1071311)
def sumUp(s : Stream[Int], acc : Int = 0) : Stream[Int] =
Stream.cons(s.head + acc, sumUp(s.tail, s.head + acc))
val firstFive = sumUp(naturals, 0).take(5)
firstFive.foreach(println _)
}
Output:
1
3
6
10
15

Sum array of vectors by one of fields - scala

I have an array of vectors in scala:
import org.apache.mahout.math.{ VectorWritable, Vector, DenseVector }
import org.apache.mahout.clustering.dirichlet.UncommonDistributions
val data = new ArrayBuffer[Vector]()
for (i <- 100 to num) {
data += new DenseVector(Array[Double](
i % 30,
UncommonDistributions.rNorm(100, 100),
UncommonDistributions.rNorm(100, 100)
)
}
Lets say I want to sum second and third fields grouping by first row.
What is the better way to do that?
I would suggest to use the groupBy method present in Collections:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.Vector#groupBy[K](f:A=>K):scala.collection.immutable.Map[K,Repr]
This will create a Map of Vectors base on a discriminator you specify.
Edit: Some code example:
// I created a different Array of Vector as I don't have Mahout dependencies
// But the output is similar
// A List of Vectors with 3 values inside
val num = 100
val data = (0 to num).toList.map(n => {
Vector(n % 30, n / 100, n * 100)
})
// The groupBy will create a Map of Vectors where the Key is the result of the function
// And here, the function return the first value of the Vector
val group = data.groupBy(v => { v.apply(0) })
// Also a subset of the result:
// group:
// scala.collection.immutable.Map[Int,List[scala.collection.immutable.Vector[Int]]] = Map(0 -> List(Vector(0, 0, 0), Vector(0, 0, 3000), Vector(0, 0, 6000), Vector(0, 0, 9000)), 5 -> List(Vector(5, 0, 500), Vector(5, 0, 3500), Vector(5, 0, 6500), Vector(5, 0, 9500)))
use groupBy function on list, and then map each group - simply in one line of code:
data groupBy (_(0)) map { case (k,v) => k -> (v map (_(2)) sum) }