How can I deal with each adjoin two element difference greater than threshold from Spark RDD - scala

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:
I want to subtract each two adjoin element like this:
a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)
If the result greater than the threshold of 10,than output the collection,like this:
How can I do this with scala from RDD?

If you have data as
val data = Seq(2,3,5,8,19,3,5,89,20,17)
you can create rdd as
val rdd = sc.parallelize(data)
What you desire can be achieved by doing the following
import org.apache.spark.mllib.rdd.RDDFunctions._
val finalrdd = rdd
.map(x => (x(1), x(1)-x(0)))
.filter(y => y._2 > 10)
.map(z => z._1)
should print

You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10
val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))
val first = rdd.first() => r != first))
.map( k => ((k._2 - k._1), k._2))
.filter(k => k._1 > 10 )
.map(t => t._2).foreach(println)
Hope this helps!


Spark - how to get top N of rdd as a new rdd (without collecting at the driver)

I am wondering how to filter an RDD that has one of the top N values. Usually I would sort the RDD and take the top N items as an array in the driver to find the Nth value that can be broadcasted to filter the rdd like so:
val topNvalues = sc.broadcast(
val threshold = topNvalues.last
val rddWithTopNValues = rdd.filter(_.fieldToThreshold >= threshold)
but in this case my N is too large, so how can I do this purely with RDDs like so?:
def getExpensiveItems(itemPrices: RDD[(Int, Float)], count: Int): RDD[(Int, Float)] = {
val sortedPrices = itemPrices.sortBy(-_._2).map(_._1).distinct
// How to do this without collecting results to driver??
val highPrices = itemPrices.getTopNValuesWithoutCollect(count)
itemPrices.join(highPrices.keyBy(x => x)).map(_._2._1)
Use zipWithIndex on the sorted rdd and then filter by the index up to n items. To illustrate the case consider this rrd sorted in descending order,
val rdd = sc.parallelize((1 to 10).map( _ => math.random)).sortBy(-_)
rdd.zipWithIndex.filter(_._2 < 4)
delivers the first top four items without collecting the rdd to the driver.

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo =":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] ={str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 ={case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
Rearrange the rdd in order to look better{case (d, (s, c)) => (s, c, d)}
Now we are groupy by char
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row{case (s, list) => (s, list.maxBy(_._2))}
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length) => (parts(0), i._1, i._2))
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
Result is:

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

Spark Scala: Split each line between multiple RDDs

I have a file on HDFS in the form of:
I create an RDD from it and and zip the elements with their index:
val data = sc.textFile("hdfs://localhost:54310/usrp/sample.txt")
val points = => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData ={case (value,index) => (index,value)}
Now I need to create rdd1 with the index and the first two elements of each line. Then need to create rdd2 with the index and third element of each row. I am new to Scala, can you please help me with how to do this ?
This does not work since y is not of type Vector but org.apache.spark.mllib.linalg.Vector
val rdd1 ={case (x,y) => (x,y.take(2))}
Basically how to get he first two elements of such a vector ?
You can make use of DenseVector's unapply method to get the underlying Array[Double] in your pattern-matching, and then call take/drop on the Array, re-wrapping it with a Vector:
val rdd1 = { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.take(2))) }
val rdd2 = { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.drop(2))) }
As you can see - this means the original DenseVector you created isn't really that useful, so if you're not going to use indexedData anywhere else, it might be better to create indexedData as a RDD[(Long, Array[Double])] in the first place:
val points = => s.split(',').map(_.toDouble))
val indexedData: RDD[(Long, Array[Double])] = points.zipWithIndex().map(_.swap)
val rdd1 = indexedData.mapValues(arr => Vectors.dense(arr.take(2)))
val rdd2 = indexedData.mapValues(arr => Vectors.dense(arr.drop(2)))
Last tip: you probably want to call .cache() on indexedData before scanning it twice to createrdd1 and rdd2 - otherwise the file will be loaded and parsed twice.
You can achieve the above output by following the below steps:
Original Data:
RRD1 Data:
Having index along with first two elements of each line.
val rdd1 ={case (x,y) => (x, (y.toArray(0), y.toArray(1)))}
RRD2 Data:
Having index along with third element of row.
val rdd2 ={case (x,y) => (x, y.toArray(2))}

how to do vector (vertical) sum in scala with Spark1.6

I have an RDD (long, vector). I want to do sum over all the vectors. How to achieve it in spark 1.6?
For example, input data is like
It then produces results like
regardless of the first value in long
If you have an RDD[Long, Vector] like this:
val myRdd = sc.parallelize(List((1l, Vectors.dense(0.1, 0.2, 0.7)),(2l, Vectors.dense(0.2, 0.4, 0.4))))
You can reduce the values (vectors) in order to get the sum:
val res = myRdd
.reduce {case (a:(Vector), b:(Vector)) =>
Vectors.dense((a.toArray, b.toArray) + _))}
I get the following result with a floating point error:
source: this
you can refer spark example,about:
val model =
val documents = model.transform(df)
.map { case Row(features: MLVector) => Vectors.fromML(features) }
//total token count