How can I deal with each adjoin two element difference greater than threshold from Spark RDD - scala

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:
[2,3,5,8,19,3,5,89,20,17]
I want to subtract each two adjoin element like this:
a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)
If the result greater than the threshold of 10,than output the collection,like this:
[19,89]
How can I do this with scala from RDD?

If you have data as
val data = Seq(2,3,5,8,19,3,5,89,20,17)
you can create rdd as
val rdd = sc.parallelize(data)
What you desire can be achieved by doing the following
import org.apache.spark.mllib.rdd.RDDFunctions._
val finalrdd = rdd
.sliding(2)
.map(x => (x(1), x(1)-x(0)))
.filter(y => y._2 > 10)
.map(z => z._1)
Doing
finalrdd.foreach(println)
should print
19
89

You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10
val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))
val first = rdd.first()
rdd.zip(rdd.filter(r => r != first))
.map( k => ((k._2 - k._1), k._2))
.filter(k => k._1 > 10 )
.map(t => t._2).foreach(println)
Hope this helps!

Related

Spark - how to get top N of rdd as a new rdd (without collecting at the driver)

I am wondering how to filter an RDD that has one of the top N values. Usually I would sort the RDD and take the top N items as an array in the driver to find the Nth value that can be broadcasted to filter the rdd like so:
val topNvalues = sc.broadcast(rdd.map(_.fieldToThreshold).distict.sorted.take(N))
val threshold = topNvalues.last
val rddWithTopNValues = rdd.filter(_.fieldToThreshold >= threshold)
but in this case my N is too large, so how can I do this purely with RDDs like so?:
def getExpensiveItems(itemPrices: RDD[(Int, Float)], count: Int): RDD[(Int, Float)] = {
val sortedPrices = itemPrices.sortBy(-_._2).map(_._1).distinct
// How to do this without collecting results to driver??
val highPrices = itemPrices.getTopNValuesWithoutCollect(count)
itemPrices.join(highPrices.keyBy(x => x)).map(_._2._1)
}
Use zipWithIndex on the sorted rdd and then filter by the index up to n items. To illustrate the case consider this rrd sorted in descending order,
val rdd = sc.parallelize((1 to 10).map( _ => math.random)).sortBy(-_)
Then
rdd.zipWithIndex.filter(_._2 < 4)
delivers the first top four items without collecting the rdd to the driver.

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

Spark Scala: Split each line between multiple RDDs

I have a file on HDFS in the form of:
61,139,75
63,140,77
64,129,82
68,128,56
71,140,47
73,141,38
75,128,59
64,129,61
64,129,80
64,129,99
I create an RDD from it and and zip the elements with their index:
val data = sc.textFile("hdfs://localhost:54310/usrp/sample.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value,index) => (index,value)}
Now I need to create rdd1 with the index and the first two elements of each line. Then need to create rdd2 with the index and third element of each row. I am new to Scala, can you please help me with how to do this ?
This does not work since y is not of type Vector but org.apache.spark.mllib.linalg.Vector
val rdd1 = indexedData.map{case (x,y) => (x,y.take(2))}
Basically how to get he first two elements of such a vector ?
Thanks.
You can make use of DenseVector's unapply method to get the underlying Array[Double] in your pattern-matching, and then call take/drop on the Array, re-wrapping it with a Vector:
val rdd1 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.take(2))) }
val rdd2 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.drop(2))) }
As you can see - this means the original DenseVector you created isn't really that useful, so if you're not going to use indexedData anywhere else, it might be better to create indexedData as a RDD[(Long, Array[Double])] in the first place:
val points = data.map(s => s.split(',').map(_.toDouble))
val indexedData: RDD[(Long, Array[Double])] = points.zipWithIndex().map(_.swap)
val rdd1 = indexedData.mapValues(arr => Vectors.dense(arr.take(2)))
val rdd2 = indexedData.mapValues(arr => Vectors.dense(arr.drop(2)))
Last tip: you probably want to call .cache() on indexedData before scanning it twice to createrdd1 and rdd2 - otherwise the file will be loaded and parsed twice.
You can achieve the above output by following the below steps:
Original Data:
indexedData.foreach(println)
(0,[61.0,139.0,75.0])
(1,[63.0,140.0,77.0])
(2,[64.0,129.0,82.0])
(3,[68.0,128.0,56.0])
(4,[71.0,140.0,47.0])
(5,[73.0,141.0,38.0])
(6,[75.0,128.0,59.0])
(7,[64.0,129.0,61.0])
(8,[64.0,129.0,80.0])
(9,[64.0,129.0,99.0])
RRD1 Data:
Having index along with first two elements of each line.
val rdd1 = indexedData.map{case (x,y) => (x, (y.toArray(0), y.toArray(1)))}
rdd1.foreach(println)
(0,(61.0,139.0))
(1,(63.0,140.0))
(2,(64.0,129.0))
(3,(68.0,128.0))
(4,(71.0,140.0))
(5,(73.0,141.0))
(6,(75.0,128.0))
(7,(64.0,129.0))
(8,(64.0,129.0))
(9,(64.0,129.0))
RRD2 Data:
Having index along with third element of row.
val rdd2 = indexedData.map{case (x,y) => (x, y.toArray(2))}
rdd2.foreach(println)
(0,75.0)
(1,77.0)
(2,82.0)
(3,56.0)
(4,47.0)
(5,38.0)
(6,59.0)
(7,61.0)
(8,80.0)
(9,99.0)

how to do vector (vertical) sum in scala with Spark1.6

I have an RDD (long, vector). I want to do sum over all the vectors. How to achieve it in spark 1.6?
For example, input data is like
(1,[0.1,0.2,0.7])
(2,[0.2,0.4,0.4])
It then produces results like
[0.3,0.6,1.1]
regardless of the first value in long
If you have an RDD[Long, Vector] like this:
val myRdd = sc.parallelize(List((1l, Vectors.dense(0.1, 0.2, 0.7)),(2l, Vectors.dense(0.2, 0.4, 0.4))))
You can reduce the values (vectors) in order to get the sum:
val res = myRdd
.values
.reduce {case (a:(Vector), b:(Vector)) =>
Vectors.dense((a.toArray, b.toArray).zipped.map(_ + _))}
I get the following result with a floating point error:
[0.30000000000000004,0.6000000000000001,1.1]
source: this
you can refer spark example,about:
val model = pipeline.fit(df)
val documents = model.transform(df)
.select("features")
.rdd
.map { case Row(features: MLVector) => Vectors.fromML(features) }
.zipWithIndex()
.map(_.swap)
(documents,
model.stages(2).asInstanceOf[CountVectorizerModel].vocabulary,
//vocabulary
documents.map(_._2.numActives).sum().toLong)
//total token count