how to do vector (vertical) sum in scala with Spark1.6 - scala

I have an RDD (long, vector). I want to do sum over all the vectors. How to achieve it in spark 1.6?
For example, input data is like
(1,[0.1,0.2,0.7])
(2,[0.2,0.4,0.4])
It then produces results like
[0.3,0.6,1.1]
regardless of the first value in long

If you have an RDD[Long, Vector] like this:
val myRdd = sc.parallelize(List((1l, Vectors.dense(0.1, 0.2, 0.7)),(2l, Vectors.dense(0.2, 0.4, 0.4))))
You can reduce the values (vectors) in order to get the sum:
val res = myRdd
.values
.reduce {case (a:(Vector), b:(Vector)) =>
Vectors.dense((a.toArray, b.toArray).zipped.map(_ + _))}
I get the following result with a floating point error:
[0.30000000000000004,0.6000000000000001,1.1]
source: this

you can refer spark example,about:
val model = pipeline.fit(df)
val documents = model.transform(df)
.select("features")
.rdd
.map { case Row(features: MLVector) => Vectors.fromML(features) }
.zipWithIndex()
.map(_.swap)
(documents,
model.stages(2).asInstanceOf[CountVectorizerModel].vocabulary,
//vocabulary
documents.map(_._2.numActives).sum().toLong)
//total token count

Related

Spark - how to get top N of rdd as a new rdd (without collecting at the driver)

I am wondering how to filter an RDD that has one of the top N values. Usually I would sort the RDD and take the top N items as an array in the driver to find the Nth value that can be broadcasted to filter the rdd like so:
val topNvalues = sc.broadcast(rdd.map(_.fieldToThreshold).distict.sorted.take(N))
val threshold = topNvalues.last
val rddWithTopNValues = rdd.filter(_.fieldToThreshold >= threshold)
but in this case my N is too large, so how can I do this purely with RDDs like so?:
def getExpensiveItems(itemPrices: RDD[(Int, Float)], count: Int): RDD[(Int, Float)] = {
val sortedPrices = itemPrices.sortBy(-_._2).map(_._1).distinct
// How to do this without collecting results to driver??
val highPrices = itemPrices.getTopNValuesWithoutCollect(count)
itemPrices.join(highPrices.keyBy(x => x)).map(_._2._1)
}
Use zipWithIndex on the sorted rdd and then filter by the index up to n items. To illustrate the case consider this rrd sorted in descending order,
val rdd = sc.parallelize((1 to 10).map( _ => math.random)).sortBy(-_)
Then
rdd.zipWithIndex.filter(_._2 < 4)
delivers the first top four items without collecting the rdd to the driver.

How can I deal with each adjoin two element difference greater than threshold from Spark RDD

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:
[2,3,5,8,19,3,5,89,20,17]
I want to subtract each two adjoin element like this:
a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)
If the result greater than the threshold of 10,than output the collection,like this:
[19,89]
How can I do this with scala from RDD?
If you have data as
val data = Seq(2,3,5,8,19,3,5,89,20,17)
you can create rdd as
val rdd = sc.parallelize(data)
What you desire can be achieved by doing the following
import org.apache.spark.mllib.rdd.RDDFunctions._
val finalrdd = rdd
.sliding(2)
.map(x => (x(1), x(1)-x(0)))
.filter(y => y._2 > 10)
.map(z => z._1)
Doing
finalrdd.foreach(println)
should print
19
89
You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10
val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))
val first = rdd.first()
rdd.zip(rdd.filter(r => r != first))
.map( k => ((k._2 - k._1), k._2))
.filter(k => k._1 > 10 )
.map(t => t._2).foreach(println)
Hope this helps!

Spark Dataframe Scala: groupby doesn't work after UnionAll

I used unionAll to combine the source DF (with negative weights) and the target DF (with positive weights) into a node DF. Then I perform groupby to sum all the weights of the same nodes, but i don't know why groupby didn't work for the unioned DF at all. Did anyone face the same problem ?:
val src = file.map(_.split("\t")).map(p => node(p(0), (0-p(2).trim.toInt))).toDF()
val target = file.map(_.split("\t")).map(p => node(p(1), p(2).trim.toInt)).toDF()
val srcfl = src.filter(src("weight") != -1)
val targetfl = target.filter(target("weight") != 1)
val nodes = srcfl.unionAll(targetfl)
nodes.groupBy("name").sum()
nodes.map(x => x.mkString("\t")).saveAsTextFile("hdfs://localhost:8020" + args(1))
You're simply ignoring the result of the groupBy operation: just like all DataFrame transformations, .groupBy(...).sum() doesn't mutate the original DataFrame (nodes), it produces a new one. I suspect that if you actually use the return value from sum() you'll see the result you're looking for:
val result = nodes.groupBy("name").sum()
result.map(x => x.mkString("\t")).saveAsTextFile("hdfs://localhost:8020" + args(1))

Spark Scala: Split each line between multiple RDDs

I have a file on HDFS in the form of:
61,139,75
63,140,77
64,129,82
68,128,56
71,140,47
73,141,38
75,128,59
64,129,61
64,129,80
64,129,99
I create an RDD from it and and zip the elements with their index:
val data = sc.textFile("hdfs://localhost:54310/usrp/sample.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value,index) => (index,value)}
Now I need to create rdd1 with the index and the first two elements of each line. Then need to create rdd2 with the index and third element of each row. I am new to Scala, can you please help me with how to do this ?
This does not work since y is not of type Vector but org.apache.spark.mllib.linalg.Vector
val rdd1 = indexedData.map{case (x,y) => (x,y.take(2))}
Basically how to get he first two elements of such a vector ?
Thanks.
You can make use of DenseVector's unapply method to get the underlying Array[Double] in your pattern-matching, and then call take/drop on the Array, re-wrapping it with a Vector:
val rdd1 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.take(2))) }
val rdd2 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.drop(2))) }
As you can see - this means the original DenseVector you created isn't really that useful, so if you're not going to use indexedData anywhere else, it might be better to create indexedData as a RDD[(Long, Array[Double])] in the first place:
val points = data.map(s => s.split(',').map(_.toDouble))
val indexedData: RDD[(Long, Array[Double])] = points.zipWithIndex().map(_.swap)
val rdd1 = indexedData.mapValues(arr => Vectors.dense(arr.take(2)))
val rdd2 = indexedData.mapValues(arr => Vectors.dense(arr.drop(2)))
Last tip: you probably want to call .cache() on indexedData before scanning it twice to createrdd1 and rdd2 - otherwise the file will be loaded and parsed twice.
You can achieve the above output by following the below steps:
Original Data:
indexedData.foreach(println)
(0,[61.0,139.0,75.0])
(1,[63.0,140.0,77.0])
(2,[64.0,129.0,82.0])
(3,[68.0,128.0,56.0])
(4,[71.0,140.0,47.0])
(5,[73.0,141.0,38.0])
(6,[75.0,128.0,59.0])
(7,[64.0,129.0,61.0])
(8,[64.0,129.0,80.0])
(9,[64.0,129.0,99.0])
RRD1 Data:
Having index along with first two elements of each line.
val rdd1 = indexedData.map{case (x,y) => (x, (y.toArray(0), y.toArray(1)))}
rdd1.foreach(println)
(0,(61.0,139.0))
(1,(63.0,140.0))
(2,(64.0,129.0))
(3,(68.0,128.0))
(4,(71.0,140.0))
(5,(73.0,141.0))
(6,(75.0,128.0))
(7,(64.0,129.0))
(8,(64.0,129.0))
(9,(64.0,129.0))
RRD2 Data:
Having index along with third element of row.
val rdd2 = indexedData.map{case (x,y) => (x, y.toArray(2))}
rdd2.foreach(println)
(0,75.0)
(1,77.0)
(2,82.0)
(3,56.0)
(4,47.0)
(5,38.0)
(6,59.0)
(7,61.0)
(8,80.0)
(9,99.0)

Spark: Efficient mass lookup in pair RDD's

In Apache Spark I have two RDD's. The first data : RDD[(K,V)] containing data in key-value form. The second pairs : RDD[(K,K)] contains a set of interesting key-pairs of this data.
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))], such that it contains all the elements from pairs as the key-tuple and their corresponding values (from data) as the value-tuple?
Some properties of the data:
The keys in data are unique
All entries in pairs are unique
For all pairs (k1,k2) in pairs it is guaranteed that k1 <= k2
The size of 'pairs' is only a constant the size of data |pairs| = O(|data|)
Current data sizes (expected to grow): |data| ~ 10^8, |pairs| ~ 10^10
Current attempts
Here is some example code in Scala:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
}
}
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
pairsWithData.foreach(println)
Attempt 1
First I tried just using the lookup function on data, but that throws an runtime error when executed. It seems like self is null in the PairRDDFunctions trait.
In addition I am not sure about the performance of lookup. The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. This sounds like n lookups takes O(n*|partition|) time at best, which I suspect could be optimized.
Attempt 2
This attempt works, but I create |data|^2 pairs which will kill performance. I do not expect Spark to be able to optimize that away.
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation).
In the lookup 2, I don't think it's necessary to perform full cartesian...
You can do it like this:
val firstjoin = pairs.map({case (k1,k2) => (k1, (k1,k2))})
.join(data)
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result = firstjoin.map({case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.join(data)
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form:
val firstjoin = pairs.map(x => (x._1, x)).join(data).map(_._2)
val result = firstjoin.map({case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong...