how to merge 2 different rdd in spark using scala - scala

I m trying to merge 2 rdds to one. If my rdd1 consists of 2 records of 2 elements both are strings ex:
key_A:value_A and Key_B:value_B
rdd2 also consists of 1 record of 2 elements both of which are strings
key_C:value_c
my final rdd would look like this:
key_A :value_A , Key_B :value_B , key_C :value_c
we can use union method of rdd but its not working . Plz kindly help
while using union of 2 rdds should the row of the 2 differnt rdd contain the same no of elments or there size can differ.......??

Try with join:
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
See the associated section of the docs

union is working.
Sample code is:
val rdd = sparkContext.parallelize(1 to 10, 3)
val pairRDD = rdd.map { x => (x, x) }
val rdd1 = sparkContext.parallelize(11 to 20, 3)
val pairRDD1 = rdd1.map { x => (x, x) }
pairRDD.union(pairRDD1).foreach(tuple => {
println(tuple._1)
println(tuple._2)
})

Related

How can I deal with each adjoin two element difference greater than threshold from Spark RDD

I have a problem with Spark Scala which get the value of each adjoin two element difference greater than threshold,I create a new RDD like this:
[2,3,5,8,19,3,5,89,20,17]
I want to subtract each two adjoin element like this:
a.apply(1)-a.apply(0) ,a.apply(2)-a.apply(1),…… a.apply(a.lenght)-a.apply(a.lenght-1)
If the result greater than the threshold of 10,than output the collection,like this:
[19,89]
How can I do this with scala from RDD?
If you have data as
val data = Seq(2,3,5,8,19,3,5,89,20,17)
you can create rdd as
val rdd = sc.parallelize(data)
What you desire can be achieved by doing the following
import org.apache.spark.mllib.rdd.RDDFunctions._
val finalrdd = rdd
.sliding(2)
.map(x => (x(1), x(1)-x(0)))
.filter(y => y._2 > 10)
.map(z => z._1)
Doing
finalrdd.foreach(println)
should print
19
89
You can create another RDD from the original dataframe and zip those two RDD which creates a tuple like (2,3)(3,5)(5,8) and filter the subtracted result if it is greater than 10
val rdd = spark.sparkContext.parallelize(Seq(2,3,5,8,19,3,5,89,20,17))
val first = rdd.first()
rdd.zip(rdd.filter(r => r != first))
.map( k => ((k._2 - k._1), k._2))
.filter(k => k._1 > 10 )
.map(t => t._2).foreach(println)
Hope this helps!

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

Spark Split RDD into chunks and concatenate

I have a relatively simple problem.
I have an large Spark RDD[String] (containing JSON). In my use case I want to group (concatenate) N strings together into a new RDD[String], so that it will have the size of oldRDD.size/N.
pseudo example:
val oldRDD : RDD[String] = ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}']
val newRDD : RDD[String] = someTransformation(oldRDD, ",", 2)
newRDD = ['{"id": 1},{"id": 2}','{"id": 3},{"id": 4}']
val anotherRDD : RDD[String] = someTransformation(oldRDD, ",", 3)
anotherRDD = ['{"id": 1},{"id": 2},{"id": 3}','{"id": 4}']
I already looked for a similar case, but couldnt find anything.
Thanks!
Here you have to use zipWithIndex function and then calculate group.
For example, index = 3 and n (number of groups) = 2 gives you 2nd group. 3 / 2 = 1 (integer divide), so 0-based 2nd group
val n = 3;
val newRDD1 = oldRDD.zipWithIndex() // creates tuples (element, index)
// map to tuple (group, content)
.map(x => (x._2 / n, x._1))
// merge
.reduceByKey(_ + ", " + _)
// remove key
.map(x => x._2)
One note: order of "zipWithIndex" is internal order. It can make no sense in business logic, you must check if order is ok in your case. If not, sort RDD and then use zipWithIndex

Apache Spark Scala : How to maintain order of values while grouping rdd by key

May be i am asking very basic question apology for that, but i didn't find it's answer on internet. I have paired RDD want to use something like aggragateByKey and concatenating all the values by a key. Value which occur first in input RDD should come first in the aggragated RDD.
Input RDD [Int, Int]
2 20
1 10
2 8
2 25
Output RDD (Aggregated RDD)
2 20 8 25
1 10
I tried aggregateByKey and gropByKey, both are giving me ouput, but order of values is not maintained. So please suggest something in this.
Since groupByKey and aggregateByKey indeed cannot preserve order - you'll have to artificially add a "hint" to each record so that you can order by that hint yourself after the grouping:
val input = sc.parallelize(Seq((2, 20), (1, 10), (2, 8), (2, 25)))
val withIndex: RDD[(Int, (Long, Int))] = input
.zipWithIndex() // adds index to each record, will be used to order result
.map { case ((k, v), i) => (k, (i, v)) } // restructure into (key, (index, value))
val result: RDD[(Int, List[Int])] = withIndex
.groupByKey()
.map { case (k, it) => (k, it.toList.sortBy(_._1).map(_._2)) } // order values and remove index

Spark: Efficient mass lookup in pair RDD's

In Apache Spark I have two RDD's. The first data : RDD[(K,V)] containing data in key-value form. The second pairs : RDD[(K,K)] contains a set of interesting key-pairs of this data.
How can I efficiently construct an RDD pairsWithData : RDD[((K,K)),(V,V))], such that it contains all the elements from pairs as the key-tuple and their corresponding values (from data) as the value-tuple?
Some properties of the data:
The keys in data are unique
All entries in pairs are unique
For all pairs (k1,k2) in pairs it is guaranteed that k1 <= k2
The size of 'pairs' is only a constant the size of data |pairs| = O(|data|)
Current data sizes (expected to grow): |data| ~ 10^8, |pairs| ~ 10^10
Current attempts
Here is some example code in Scala:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
// This kind of show the idea, but fails at runtime.
def massPairLookup1(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
keyPairs map {case (k1,k2) =>
val v1 : String = data lookup k1 head;
val v2 : String = data lookup k2 head;
((k1, k2), (v1,v2))
}
}
// Works but is O(|data|^2)
def massPairLookup2(keyPairs : RDD[(Int, Int)], data : RDD[(Int, String)]) = {
// Construct all possible pairs of values
val cartesianData = data cartesian data map {case((k1,v1),(k2,v2)) => ((k1,k2),(v1,v2))}
// Select only the values who's keys are in keyPairs
keyPairs map {(_,0)} join cartesianData mapValues {_._2}
}
// Example function that find pairs of keys
// Runs in O(|data|) in real life, but cannot maintain the values
def relevantPairs(data : RDD[(Int, String)]) = {
val keys = data map (_._1)
keys cartesian keys filter {case (x,y) => x*y == 12 && x < y}
}
// Example run
val data = sc parallelize(1 to 12) map (x => (x, "Number " + x))
val pairs = relevantPairs(data)
val pairsWithData = massPairLookup2(pairs, data)
// Print:
// ((1,12),(Number1,Number12))
// ((2,6),(Number2,Number6))
// ((3,4),(Number3,Number4))
pairsWithData.foreach(println)
Attempt 1
First I tried just using the lookup function on data, but that throws an runtime error when executed. It seems like self is null in the PairRDDFunctions trait.
In addition I am not sure about the performance of lookup. The documentation says This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. This sounds like n lookups takes O(n*|partition|) time at best, which I suspect could be optimized.
Attempt 2
This attempt works, but I create |data|^2 pairs which will kill performance. I do not expect Spark to be able to optimize that away.
Your lookup 1 doesn't work because you cannot perform RDD transformations inside workers (inside another transformation).
In the lookup 2, I don't think it's necessary to perform full cartesian...
You can do it like this:
val firstjoin = pairs.map({case (k1,k2) => (k1, (k1,k2))})
.join(data)
.map({case (_, ((k1, k2), v1)) => ((k1, k2), v1)})
val result = firstjoin.map({case ((k1,k2),v1) => (k2, ((k1,k2),v1))})
.join(data)
.map({case(_, (((k1,k2), v1), v2))=>((k1, k2), (v1, v2))})
Or in a more dense form:
val firstjoin = pairs.map(x => (x._1, x)).join(data).map(_._2)
val result = firstjoin.map({case (x,y) => (x._2, (x,y))})
.join(data).map({case(x, (y, z))=>(y._1, (y._2, z))})
I don't think you can do it more efficiently, but I might be wrong...