I have an array of vectors in scala:
import org.apache.mahout.math.{ VectorWritable, Vector, DenseVector }
import org.apache.mahout.clustering.dirichlet.UncommonDistributions
val data = new ArrayBuffer[Vector]()
for (i <- 100 to num) {
data += new DenseVector(Array[Double](
i % 30,
UncommonDistributions.rNorm(100, 100),
UncommonDistributions.rNorm(100, 100)
)
}
Lets say I want to sum second and third fields grouping by first row.
What is the better way to do that?
I would suggest to use the groupBy method present in Collections:
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.Vector#groupBy[K](f:A=>K):scala.collection.immutable.Map[K,Repr]
This will create a Map of Vectors base on a discriminator you specify.
Edit: Some code example:
// I created a different Array of Vector as I don't have Mahout dependencies
// But the output is similar
// A List of Vectors with 3 values inside
val num = 100
val data = (0 to num).toList.map(n => {
Vector(n % 30, n / 100, n * 100)
})
// The groupBy will create a Map of Vectors where the Key is the result of the function
// And here, the function return the first value of the Vector
val group = data.groupBy(v => { v.apply(0) })
// Also a subset of the result:
// group:
// scala.collection.immutable.Map[Int,List[scala.collection.immutable.Vector[Int]]] = Map(0 -> List(Vector(0, 0, 0), Vector(0, 0, 3000), Vector(0, 0, 6000), Vector(0, 0, 9000)), 5 -> List(Vector(5, 0, 500), Vector(5, 0, 3500), Vector(5, 0, 6500), Vector(5, 0, 9500)))
use groupBy function on list, and then map each group - simply in one line of code:
data groupBy (_(0)) map { case (k,v) => k -> (v map (_(2)) sum) }
Related
Say I have a list of tuples:
val ranges= List((1,4), (5,8), (9,10))
and a list of numbers
val nums = List(2,2,3,7,8,9)
I want to make a map from tuple in ranges to how many times a given number from nums fall into the interval of that tuple.
Output:
Map ((1,4) -> 3, (5,8) -> 2, (9,10) -> 1)
What is the best way to go about it in Scala
I have been trying to use for loops and keeping a counter but am falling short.
Something like this:
val ranges = List((1, 4), (5, 8), (9, 10))
val nums = List(2, 2, 3, 7, 8, 9)
val occurences = ranges.map { case (l, r) => nums.count((l to r) contains _) }
val map = (ranges zip occurences).toMap
println(map) // Map((1,4) -> 3, (5,8) -> 2, (9,10) -> 1)
Basically it first calculates the number of occurrences, [3, 2, 1]. From there it's easy to construct a map. And the way it calculates the occurrences is:
go through the list of ranges
transform each range into number of occurrences for that range, which is done like this :
how many numbers from the list nums are contained in that range?
Here is an efficient single-pass solution:
ranges
.map(r => r -> nums.count(n => n >= r._1 && n <= r._2))
.toMap
This avoids the overhead of creating a list of numbers and then zipping them with the ranges in a separate step.
This is a version that uses more Scala features but is a bit too fancy:
(for {
r <- ranges
range = r._1 to r._2
} yield r -> nums.count(range.contains)
).toMap
This is also less efficient because contains has to allow for ranges with a step value and is therefore more complicated.
And here is an even more efficient version that avoids any temporary data structures:
val result: Map[(Int, Int), Int] =
ranges.map(r => r -> nums.count(n => n >= r._1 && n <= r._2))(collection.breakOut)
See this explanation of breakOut if you are not familiar with it. Using breakOut means that the map call will build the Map directly rather than creating a List that has to be converted to a Map using toMap.
How does one create an RDD filled with values from an array say (0,1) - filling random 1000 values as 1 and remaining 0.
I know I can filter and do this but it won't be random. I want it to be as random as possible
var populationMatrix = new IndexedRowMatrix(RandomRDDs.uniformVectorRDD(sc, populationSize, chromosomeLength)
I was exploring random RDDs in spark but could find something that meets my needs .
Not really sure if this is what you are looking for, but with this code you are able to create an RDD array with random numbers between 0 and 1s:
import scala.util.Random
val arraySize = 15 // Total number of elements that you want
val numberOfOnes = 10 // From that total, how many do you want to be ones
val listOfOnes = List.fill(numberOfOnes)(1) // List of 1s
val listOfZeros = List.fill(arraySize - numberOfOnes)(0) // Rest list of 0s
val listOfOnesAndZeros = listOfOnes ::: listOfZeros // Merge lists
val randomList = Random.shuffle(listOfOnesAndZeros) // Random shuffle
val randomRDD = sc.parallelize(randomList) // RDD creation
randomRDD.collect() // Array[Int] = Array(1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1)
Or, if you want to use only RDDs:
val arraySize = 15
val numberOfOnes = 10
val rddOfOnes = spark.range(numberOfOnes).map(_ => 1).rdd
val rddOfZeros = spark.range(arraySize - numberOfOnes).map(_ => 0).rdd
val rddOfOnesAndZeros = rddOfOnes ++ rddOfZeros
val shuffleResult = rddOfOnesAndZeros.mapPartitions(iter => {
val rng = new scala.util.Random()
iter.map((rng.nextInt, _))
}).partitionBy(new org.apache.spark.HashPartitioner(rddOfOnesAndZeros.partitions.size)).values
shuffleResult.collect() // Array[Int] = Array(0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)
Let me know if it was what you need it.
I have dense matrix:
-0.1503191229976037 -0.17794560268465542 0.3372516173766848
-0.6265768782935162 -0.6986084179343495 -1.6553741696973772
How do I convert it to RDD of format:
0, 0, -0.1503191229976037
0, 1, -0.17794560268465542
0, 2, 0.3372516173766848
1, 0, -0.6265768782935162
1, 1, -0.6986084179343495
1, 2, -1.6553741696973772
The first two values are indices.
The type of my input matrix is:
org.apache.spark.mllib.linalg.DenseMatrix
The expected output type is: org.apache.spark.rdd.RDD[scala.Tuple2[scala.Tuple2[Int, Int], Double]]
How do I do it on Spark using Scala?
Assuming that your vectors are represented by the actual data structure Vector:
val vectors: List[Vector[Double]] = ???
val vecsWithIndices = for {
(vIdx, vec) <- Stream.from(0).zip(vectors)
i <- 0 until 3
} yield (vIdx, i, vec(i))
val rdd = spark.parallelize(vecsWithIndices)
The Stream.from(0) generates the index of the vector, the i runs over components of the vector.
Solved it like this:
val denseMatrix=for (
i <- 0 to 2;
j <- 0 to 2
) yield ((i, j), z.apply(i,j))
val rdd = sc.parallelize(z1)
Please let me know if there is a better way. Thank you.
I have the following simple code in Java. This code creates and fills the Map with 0 values.
Map<Integer,Integer> myMap = new HashMap<Integer,Integer>();
for (int i=0; i<=20; i++) { myMap.put(i, 0); }
I want to create a similar RDD using Spark and Scala. I tried this approach, but it returns me RDD[(Any) => (Any,Int)] instead of RDD[Map(Int,Int)]. What am I doing wrong?
val data = (0 to 20).map(_ => (_,0))
val myMapRDD = sparkContext.parallelize(data)
What you are creating are tuples. Instead you need to create Map and parallelize as below
val data = (0 to 20).map(x => Map(x -> 0)) //data: scala.collection.immutable.IndexedSeq[scala.collection.immutable.Map[Int,Int]] = Vector(Map(0 -> 0), Map(1 -> 0), Map(2 -> 0), Map(3 -> 0), Map(4 -> 0), Map(5 -> 0), Map(6 -> 0), Map(7 -> 0), Map(8 -> 0), Map(9 -> 0), Map(10 -> 0), Map(11 -> 0), Map(12 -> 0), Map(13 -> 0), Map(14 -> 0), Map(15 -> 0), Map(16 -> 0), Map(17 -> 0), Map(18 -> 0), Map(19 -> 0), Map(20 -> 0))
val myMapRDD = sparkContext.parallelize(data) //myMapRDD: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = ParallelCollectionRDD[0] at parallelize at test.sc:19
In Scala, (0 to 20).map(_ => (_, 0)) would not compile, as it has invalid placeholder syntax. I believe you might be looking for something like below instead:
val data = (0 to 20).map( _->0 )
which would generate a list of key-value pairs, and is really just a placeholder shorthand for:
val data = (0 to 20).map( n => n->0 )
// data: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector(
// (0,0), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0), (9,0), (10,0),
// (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0), (19,0), (20,0)
// )
A RDD is an immtable collection (e.g. Seq, Array) of data. To create a RDD of Map[Int,Int], you would expand data inside a Map which in turn gets placed inside a Seq collection:
val rdd = sc.parallelize(Seq(Map(data: _*)))
rdd.collect
// res1: Array[scala.collection.immutable.Map[Int,Int]] = Array(
// Map(0 -> 0, 5 -> 0, 10 -> 0, 14 -> 0, 20 -> 0, 1 -> 0, 6 -> 0, 9 -> 0, 13 -> 0, 2 -> 0, 17 -> 0,
// 12 -> 0, 7 -> 0, 3 -> 0, 18 -> 0, 16 -> 0, 11 -> 0, 8 -> 0, 19 -> 0, 4 -> 0, 15 -> 0)
// )
Note that, as is, this RDD consists of only a single Map, and certainly you can assemble as many Maps as you wish in a RDD.
val rdd2 = sc.parallelize(Seq(
Map((0 to 4).map( _->0 ): _*),
Map((5 to 9).map( _->0 ): _*),
Map((10 to 14).map( _->0 ): _*),
Map((15 to 19).map( _->0 ): _*)
))
You can't parallelize a Map, as parallelize takes a Seq. What you can achieve is creating an RDD[(Int, Int)], which however does not enforce the uniqueness of keys. To perform operation by key, you can leverage PairRDDFunctions, that despite this limitation can end up being useful for your use case.
Let's try at least to get an RDD[(Int, Int)].
You used a slightly "wrong" syntax when applying the map to your range.
The _ placeholder can have different meanings depending on the context. The two meanings that got mixed up in your snippet of code are:
a placeholder for an anonymous function parameter that is not going to be used (as in (_ => 42), a function which ignores its input and always returns 42)
a positional placeholder for arguments in anonymous functions (as in (_, 42) a function that takes one argument and returns a tuple where the first element is the input and the second is the number 42)
The above examples are simplified and do not account for type inference as they only wish to point out two of the meanings of the _ placeholder that got mixed up in your snippet of code.
The first step is to use one of the two following functions to create the pairs that are going to be part of the map, either
a => (a, 0)
or
(_, 0)
and after parallelizing it you can get the RDD[(Int, Int)], as follows:
val pairRdd = sc.parallelize((0 to 20).map((_, 0)))
I believe it's worth noting here, that mapping on the local collection is going to be executed eagerly and bound to your driver's resources, while you can obtain the same final result by parallelizing the collection first and then mapping the pair-creating function on the RDD.
Now, as mentioned you don't have a distributed map, but rather a collection of key-value pairs where the key uniqueness is not enforced. But you can work pretty seamlessly with those values using PairRDDFunctions, which you obtain automatically by importing org.apache.spark.rdd.RDD.rddToPairRDDFunctions (or without having to do anything in the spark-shell as the import has already been done for your), which will decorate your RDD leveraging Scala's implicit conversions.
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
pairRdd.mapValues(_ + 1).foreach(println)
will print the following
(10,1)
(11,1)
(12,1)
(13,1)
(14,1)
(15,1)
(16,1)
(17,1)
(18,1)
(19,1)
(20,1)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
(7,1)
(8,1)
(9,1)
You can learn more about working with key-value pairs with the RDD API on the official documentation.
As I have a collection:
List(1, 3,-1, 0, 2, -4, 6)
It's easy to make it sorted as:
List(-4, -1, 0, 1, 2, 3, 6)
Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:
for(i <- 0 to list.length -2) yield {
list(i + 1) - list(i)
}
and get a vector:
Vector(3, 1, 1, 1, 1, 3)
That is, I want to make the next element minus the current element.
But how to implement this in RDD on Spark?
I know for the collection:
List(-4, -1, 0, 1, 2, 3, 6)
There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?
The most efficient solution is to use sliding method:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
.sortBy(identity)
.sliding(2)
.map{case Array(x, y) => y - x}
Suppose you have something like
val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)
Let's create first collection with index as key like Ton Torres suggested
val original = seq.zipWithIndex.map(_.swap)
Now we can build collection shifted by one element.
val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)
Next we can calculate needed differences ordered by index descending
val diffs = original.join(shifted)
.sortBy(_._1, ascending = false)
.map { case (idx, (v1, v2)) => v2 - v1 }
So
println(diffs.collect.toSeq)
shows
WrappedArray(3, 1, 1, 1, 1, 3)
Note that you can skip the sortBy step if reversing is not critical.
Also note that for local collection this could be computed much more simple like:
val elems = List(1, 3, -1, 0, 2, -4, 6).sorted
(elems.tail, elems).zipped.map(_ - _).reverse
But in case of RDD the zip method requires each collection should contain equal element count for each partition. So if you would implement tail like
val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)
tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition.