taking N values from each partition in Spark - scala

Assuming I am having the following data:
val DataSort = Seq(("a",5),("b",13),("b",2),("b",1),("c",4),("a",1),("b",15),("c",3),("c",1))
val DataSortRDD = sc.parallelize(DataSort,2)
And now there are two partitions with:
scala>DataSortRDD.glom().take(2).head
res53: Array[(String,Int)] = Array(("a",5),("b",13),("b",2),("b",1),("c",4))
scala>DataSortRDD.glom().take(2).tail
res54: Array[(String,Int)] = Array(Array(("a",1),("b",15),("c",3),("c",2),("c",1)))
It is assumed that in every partition the data is already sorted using something like sortWithinPartitions(col("src").desc,col("rank").desc)(thats for a dataframe but is just to illustrate).
What I want is from each partition get for each letter the first two values(if there are more than 2 values). So in this example the result in each partition should be:
scala>HypotheticalRDD.glom().take(2).head
Array(("a",5),("b",13),("b",2),("c",4))
scala>HypotheticalRDD.glom().take(2).tail
Array(Array(("a",1),("b",15),("c",3),("c",2)))
I Know that I have to use the mapPartition function but its not clear in my mind how can I iterate through the values in each partition and get the first 2. Any tip?
Edit: More precisely. I know that in each partition the data is already sorted by 'letter' first and after by 'count'. So my main idea is that the input function in mapPartition should iterate through the partition and yield the first two values of each letter. And this could be done by checking every iterate the .next() value. This is how I could write it in python:
def limit_on_sorted(iterator):
oldKey = None
cnt = 0
while True:
elem = iterator.next()
if not elem:
return
curKey = elem[0]
if curKey == oldKey:
cnt +=1
if cnt >= 2:
yield None
else:
oldKey = curKey
cnt = 0
yield elem
DataSortRDDpython.mapPartitions(limit_on_sorted,preservesPartitioning=True).filter(lambda x:x!=None)

Assuming you don't really care about the partitioning of the result, you can use mapPartitionsWithIndex to incorporate the partition ID into the key by which you groupBy, then you can easily take the first two items for each such key:
val result: RDD[(String, Int)] = DataSortRDD
.mapPartitionsWithIndex {
// add the partition ID into the "key" of every record:
case (partitionId, itr) => itr.map { case (k, v) => ((k, partitionId), v) }
}
.groupByKey() // groups by letter and partition id
// take only first two records, and drop partition id
.flatMap { case ((k, _), itr) => itr.take(2).toArray.map((k, _)) }
println(result.collect().toList)
// prints:
// List((a,5), (b,15), (b,13), (b,2), (a,1), (c,4), (c,3))
Do note that the end result is not partitioned in the same way (groupByKey changes the partitioning), I'm assuming this isn't critical to what you're trying to do (which, frankly, escapes me).
EDIT: if you want to avoid shuffling and perform all operations within each partition:
val result: RDD[(String, Int)] = DataSortRDD
.mapPartitions(_.toList.groupBy(_._1).mapValues(_.take(2)).values.flatten.iterator, true)

Related

Spark find previous value on each iteration of RDD

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,
EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

How do I read the value of a continuous index from spark RDD

I have a problem with Spark Scala get the first value from series key,I create a new RDD like this:
[(a,1),(a,2),(a,3),(a,4),(b,1),(b,2),(a,3),(a,4),(a,5),(b,8),(b,9)]
I want to fetch the result like this:
[(a,1),(b,1),(a,3),(b,8)]
How can I do this with scala from RDD
As mentioned in comments, in order to be able to use the order of the elements in an RDD, you'd have to somehow represent this order in the data itself. For that purpose exactly, zipWithIndex was created - the index is added to the data; Then, with some manipulation (join on an RDD with modified indices) we can get what you need:
// add index to RDD:
val withIndex = rdd.zipWithIndex().map(_.swap)
// create another RDD with indices increased by one, to later join each element with the previous one
val previous = withIndex.map { case (index, v) => (index + 1, v) }
// join RDDs, filter out those where previous "key" is identical
val result = withIndex.leftOuterJoin(previous).collect {
case (i, (left, None)) => (i, left) // keep first element in RDD
case (i, (left, Some((key, _)))) if left._1 != key => (i, left) // keep only elements where the previous key is different
}.sortByKey().values // if you want to preserve the original order...
result.collect().foreach(println)
// (a,1)
// (b,1)
// (a,3)
// (b,8)

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

How can I parse this raw transactional log data in a Spark DataFrame?

I need to process the Raw data with Spark 1.6 such that the first row shows up as a separate column along with rest of its corresponding rows. The first row will always have a 3 digit number followed by its corresponding rows which will always have a 8 digit number. However, the number of corresponding rows can vary. Here is a dummy example of Raw Data and desired output. How can I code this?
Raw Data
765
11111111
22222222
33333333
456
66666666
88888888
Desired output
765 11111111
765 22222222
765 33333333
456 66666666
456 88888888
The idea is to try to solve this problem per partition. This works great as long as each partition starts with a 3-digit number. If a partition doesn't have a 3-digit number we need to find the last 3-digit number of the previous partition.
First we need some utility functions:
The first (last3dig) finds the last 3-digit number of each partition. This will help us to have an initial 3-digit number for those partitions that don't start with one. If we apply this to each partition we get a list of such 3-digit numbers. Each element is the last 3-digit of its corresponding partition if it exists, else None.
The second (fillGaps) takes care of properly filling the gaps of the list of last 3-digits (from previous step). So if we have Some(1), None, None, Some(4) it will make it Some(1), Some(1), Some(1), Some(4).
The third (trasnformRow) will go over the initial RDD and use the utilities functions we created to populate the resulting RDD.
// Finds the last 3-digit number in a Stream (RDD partition)
def last3dig(x: Stream[String]): Option[String] = {
def help(y: Stream[String], sofar: Option[String]): Option[String] = {
y match {
case h #:: tl => if(h.length == 3) help(tl, Some(h)) else help(tl, sofar)
case Stream.Empty => sofar
}
}
help(x, None)
}
def fillGaps(data: Vector[(Int, Option[String])] ): Vector[(Int, Option[String])] =
data.foldLeft(Vector.empty[(Int,Option[String])]){
case (col,n) => if(n._2.isEmpty) (n._1, col.head._2) +: col else n +: col
}
def trasnformRow(x: Stream[String], filler: String): Stream[String] = x match {
case h #:: tl =>
if( h.length != 3 )
s"$h $filler" #:: trasnformRow(tl, filler)
else
trasnformRow(tl, h) // update the filler
case Stream.Empty => x
}
// toy data
val d = Seq(765, 11111111, 22222222, 33333333, 456, 66666666, 88888888).map(_.toString)
val rdd = sc.makeRDD(d,2)
val mappings = rdd.mapPartitionsWithIndex {
case(i,iter) => Iterator( (i, last3dig(iter.toStream)) )
}.collect().toVector
val filledMappings = fillGaps(mappings)
val mm = sc.broadcast(filledMappings.toMap)
val finalResult = rdd.mapPartitionsWithIndex {
case (i,iter) => trasnformRow(iter.toStream, if(i>0) mm.value(i-1).get else "").toIterator
}.collect() // remove collect() for large dataset

Spark: Efficient mass lookup in pair RDD's

In Apache Spark I have two RDD's. The first data : RDD[(K,V)] containing data in key-value form. The second pairs : RDD[(K,K)] contains a set of interesting key-pairs of this data.
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))], such that it contains all the elements from pairs as the key-tuple and their corresponding values (from data) as the value-tuple?
Some properties of the data:
The keys in data are unique
All entries in pairs are unique
For all pairs (k1,k2) in pairs it is guaranteed that k1 <= k2
The size of 'pairs' is only a constant the size of data |pairs| = O(|data|)
Current data sizes (expected to grow): |data| ~ 10^8, |pairs| ~ 10^10
Current attempts
Here is some example code in Scala:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
}
}
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
pairsWithData.foreach(println)
Attempt 1
First I tried just using the lookup function on data, but that throws an runtime error when executed. It seems like self is null in the PairRDDFunctions trait.
In addition I am not sure about the performance of lookup. The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. This sounds like n lookups takes O(n*|partition|) time at best, which I suspect could be optimized.
Attempt 2
This attempt works, but I create |data|^2 pairs which will kill performance. I do not expect Spark to be able to optimize that away.
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation).
In the lookup 2, I don't think it's necessary to perform full cartesian...
You can do it like this:
val firstjoin = pairs.map({case (k1,k2) => (k1, (k1,k2))})
.join(data)
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result = firstjoin.map({case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.join(data)
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form:
val firstjoin = pairs.map(x => (x._1, x)).join(data).map(_._2)
val result = firstjoin.map({case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong...