How to split a spark dataframe with equal records - pyspark

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?

In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

Related

Order Spark RDD based on ordering in another RDD

I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)

Spark find previous value on each iteration of RDD

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,
EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

How can I deal with each adjoin two element difference greater than threshold from Spark RDD

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:
[2,3,5,8,19,3,5,89,20,17]
I want to subtract each two adjoin element like this:
a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)
If the result greater than the threshold of 10,than output the collection,like this:
[19,89]
How can I do this with scala from RDD?
If you have data as
val data = Seq(2,3,5,8,19,3,5,89,20,17)
you can create rdd as
val rdd = sc.parallelize(data)
What you desire can be achieved by doing the following
import org.apache.spark.mllib.rdd.RDDFunctions._
val finalrdd = rdd
.sliding(2)
.map(x => (x(1), x(1)-x(0)))
.filter(y => y._2 > 10)
.map(z => z._1)
Doing
finalrdd.foreach(println)
should print
19
89
You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10
val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))
val first = rdd.first()
rdd.zip(rdd.filter(r => r != first))
.map( k => ((k._2 - k._1), k._2))
.filter(k => k._1 > 10 )
.map(t => t._2).foreach(println)
Hope this helps!

RDD split and do aggregation on new RDDs

I have an RDD of (String,String,Int).
I want to reduce it based on the first two strings
And Then based on the first String I want to group the (String,Int) and sort them
After sorting I need to group them into small groups each containing n elements.
I have done the code below. The problem is the number of elements in the step 2 is very large for a single key
and the reduceByKey(x++y) takes a lot of time.
//Input
val data = Array(
("c1","a1",1), ("c1","b1",1), ("c2","a1",1),("c1","a2",1), ("c1","b2",1),
("c2","a2",1), ("c1","a1",1), ("c1","b1",1), ("c2","a1",1))
val rdd = sc.parallelize(data)
val r1 = rdd.map(x => ((x._1, x._2), (x._3)))
val r2 = r1.reduceByKey((x, y) => x + y ).map(x => ((x._1._1), (x._1._2, x._2)))
// This is taking long time.
val r3 = r2.mapValues(x => ArrayBuffer(x)).reduceByKey((x, y) => x ++ y)
// from the list I will be doing grouping.
val r4 = r3.map(x => (x._1 , x._2.toList.sorted.grouped(2).toList))
Problem is the "c1" has lot of unique entries like b1 ,b2....million and reduceByKey is killing time because all the values are going to single node.
Is there a way to achieve this more efficiently?
// output
Array((c1,List(List((a1,2), (a2,1)), List((b1,2), (b2,1)))), (c2,List(List((a1,2), (a2,1)))))
There at least few problems with a way you group your data. The first problem is introduced by
mapValues(x => ArrayBuffer(x))
It creates a large amount of mutable objects which provide no additional value since you cannot leverage their mutability in the subsequent reduceByKey
reduceByKey((x, y) => x ++ y)
where each ++ creates a new collection and neither argument can be safely mutated. Since reduceByKey applies map side aggregation situation is even worse and pretty much creates GC hell.
Is there a way to achieve this more efficiently?
Unless you have some deeper knowledge about data distribution which can be used to define smarter partitioner the simplest improvement is to replace mapValues + reduceByKey with simple groupByKey:
val r3 = r2.groupByKey
It should be also possible to use a custom partitioner for both reduceByKey calls and mapPartitions with preservesPartitioning instead of map.
class FirsElementPartitioner(partitions: Int)
extends org.apache.spark.Partitioner {
def numPartitions = partitions
def getPartition(key: Any): Int = {
key.asInstanceOf[(Any, Any)]._1.## % numPartitions
}
}
val r2 = r1
.reduceByKey(new FirsElementPartitioner(8), (x, y) => x + y)
.mapPartitions(iter => iter.map(x => ((x._1._1), (x._1._2, x._2))), true)
// No shuffle required here.
val r3 = r2.groupByKey
It requires only a single shuffle and groupByKey is simply a local operations:
r3.toDebugString
// (8) MapPartitionsRDD[41] at groupByKey at <console>:37 []
// | MapPartitionsRDD[40] at mapPartitions at <console>:35 []
// | ShuffledRDD[39] at reduceByKey at <console>:34 []
// +-(8) MapPartitionsRDD[1] at map at <console>:28 []
// | ParallelCollectionRDD[0] at parallelize at <console>:26 []

word count(frequency) spark rdd scala

if I have an rdd accross cluster and I want to do the word count
not only count the appear times,
I want to get the frequency, which is defined as count/total count
What is the best and efficient way to do so in scala?
How can I do reduction job and calculate total number at the same time within one workflow?
BTW I know purely word count can be done in this way.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
but what is the difference if I use aggregate? in terms of spark job workflow
val result = pairs
.aggregate(Map[String, Int]())((acc, pair) =>
if(acc.contains(pair._1))
acc ++ Map[String, Int]((pair._1, acc(pair._1)+1))
else
acc ++ Map[String, Int]((pair._1, pair._2))
,
(a, b) =>
(a.toSeq ++ b.toSeq)
.groupBy(_._1)
.mapValues(_.map(_._2).reduce(_ + _))
)
You can use this
val total = counts.map(x => x._2).sum()
val freq = counts.map(x => (x._1, x._2/total))
There exists also the concept of Accumulator which is a write-only variable and you could use it to avoid using the sum() action, but your code would need a lot of change.