Join two ordinary RDDs with/without Spark SQL - scala

I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join operation of two tables. I wonder if this is possible only through Spark SQL or there are other ways of doing it.
As a concrete example, consider
RDD r1 with primary key ITEM_ID:
(ITEM_ID, ITEM_NAME, ITEM_UNIT, COMPANY_ID)
and RDD r2 with primary key COMPANY_ID:
(COMPANY_ID, COMPANY_NAME, COMPANY_CITY)
I want to join r1 and r2.
How can this be done?

Soumya Simanta gave a good answer. However, the values in joined RDD are Iterable, so the results may not be very similar to ordinary table joining.
Alternatively, you can:
val mappedItems = items.map(item => (item.companyId, item))
val mappedComp = companies.map(comp => (comp.companyId, comp))
mappedItems.join(mappedComp).take(10).foreach(println)
The output would be:
(c1,(Item(1,first,2,c1),Company(c1,company-1,city-1)))
(c1,(Item(2,second,2,c1),Company(c1,company-1,city-1)))
(c2,(Item(3,third,2,c2),Company(c2,company-2,city-2)))

(Using Scala)
Let say you have two RDDs:
emp: (empid, ename, dept)
dept: (dname, dept)
Following is another way:
//val emp = sc.parallelize(Seq((1,"jordan",10), (2,"ricky",20), (3,"matt",30), (4,"mince",35), (5,"rhonda",30)))
val emp = sc.parallelize(Seq(("jordan",10), ("ricky",20), ("matt",30), ("mince",35), ("rhonda",30)))
val dept = sc.parallelize(Seq(("hadoop",10), ("spark",20), ("hive",30), ("sqoop",40)))
//val shifted_fields_emp = emp.map(t => (t._3, t._1, t._2))
val shifted_fields_emp = emp.map(t => (t._2, t._1))
val shifted_fields_dept = dept.map(t => (t._2,t._1))
shifted_fields_emp.join(shifted_fields_dept)
// Create emp RDD
val emp = sc.parallelize(Seq((1,"jordan",10), (2,"ricky",20), (3,"matt",30), (4,"mince",35), (5,"rhonda",30)))
// Create dept RDD
val dept = sc.parallelize(Seq(("hadoop",10), ("spark",20), ("hive",30), ("sqoop",40)))
// Establishing that the third field is to be considered as the Key for the emp RDD
val manipulated_emp = emp.keyBy(t => t._3)
// Establishing that the second field need to be considered as the Key for dept RDD
val manipulated_dept = dept.keyBy(t => t._2)
// Inner Join
val join_data = manipulated_emp.join(manipulated_dept)
// Left Outer Join
val left_outer_join_data = manipulated_emp.leftOuterJoin(manipulated_dept)
// Right Outer Join
val right_outer_join_data = manipulated_emp.rightOuterJoin(manipulated_dept)
// Full Outer Join
val full_outer_join_data = manipulated_emp.fullOuterJoin(manipulated_dept)
// Formatting the Joined Data for better understandable (using map)
val cleaned_joined_data = join_data.map(t => (t._2._1._1, t._2._1._2, t._1, t._2._2._1))
This will give the output as:
// Print the output cleaned_joined_data on the console
scala> cleaned_joined_data.collect()
res13: Array[(Int, String, Int, String)] = Array((3,matt,30,hive), (5,rhonda,30,hive), (2,ricky,20,spark), (1,jordan,10,hadoop))

Something like this should work.
scala> case class Item(id:String, name:String, unit:Int, companyId:String)
scala> case class Company(companyId:String, name:String, city:String)
scala> val i1 = Item("1", "first", 2, "c1")
scala> val i2 = i1.copy(id="2", name="second")
scala> val i3 = i1.copy(id="3", name="third", companyId="c2")
scala> val items = sc.parallelize(List(i1,i2,i3))
items: org.apache.spark.rdd.RDD[Item] = ParallelCollectionRDD[14] at parallelize at <console>:20
scala> val c1 = Company("c1", "company-1", "city-1")
scala> val c2 = Company("c2", "company-2", "city-2")
scala> val companies = sc.parallelize(List(c1,c2))
scala> val groupedItems = items.groupBy( x => x.companyId)
groupedItems: org.apache.spark.rdd.RDD[(String, Iterable[Item])] = ShuffledRDD[16] at groupBy at <console>:22
scala> val groupedComp = companies.groupBy(x => x.companyId)
groupedComp: org.apache.spark.rdd.RDD[(String, Iterable[Company])] = ShuffledRDD[18] at groupBy at <console>:20
scala> groupedItems.join(groupedComp).take(10).foreach(println)
14/12/12 00:52:32 INFO DAGScheduler: Job 5 finished: take at <console>:35, took 0.021870 s
(c1,(CompactBuffer(Item(1,first,2,c1), Item(2,second,2,c1)),CompactBuffer(Company(c1,company-1,city-1))))
(c2,(CompactBuffer(Item(3,third,2,c2)),CompactBuffer(Company(c2,company-2,city-2))))

Spark SQL can perform join on SPARK RDDs.
Below code performs SQL join on Company and Items RDDs
object SparkSQLJoin {
case class Item(id:String, name:String, unit:Int, companyId:String)
case class Company(companyId:String, name:String, city:String)
def main(args: Array[String]) {
val sparkConf = new SparkConf()
val sc= new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.createSchemaRDD
val i1 = Item("1", "first", 1, "c1")
val i2 = Item("2", "second", 2, "c2")
val i3 = Item("3", "third", 3, "c3")
val c1 = Company("c1", "company-1", "city-1")
val c2 = Company("c2", "company-2", "city-2")
val companies = sc.parallelize(List(c1,c2))
companies.registerAsTable("companies")
val items = sc.parallelize(List(i1,i2,i3))
items.registerAsTable("items")
val result = sqlContext.sql("SELECT * FROM companies C JOIN items I ON C.companyId= I.companyId").collect
result.foreach(println)
}
}
Output is displayed as
[c1,company-1,city-1,1,first,1,c1]
[c2,company-2,city-2,2,second,2,c2]

Related

Separate RDD Based on Membership of Id in Another RDD

I have two case classes and one RDD of each.
case class Thing1(Id: String, a: String, b: String, c: java.util.Date, d: Double)
case class Thing2(Id: String, e: java.util.Date, f: Double)
val rdd1 = // Loads an rdd of type RDD[Thing1]
val rdd2 = // Loads an rdd of type RDD[Thing2]
I want to create 2 new RDD[Thing1]s, 1 that contains elements of rdd1 where the element has an Id present in rdd2, and another that contains elements of rdd1 where the element does not have an Id present in rdd2
Here's what I have tried (looked at this, Scala Spark contains vs. does not contain, and other stack overflow posts, but none have worked)
val rdd2_ids = rdd2.map(r => r.Id)
val rdd1_present = rdd1.filter{case r => rdd2 contains r.Id}
val rdd1_absent = rdd1.filter{case r => !(rdd2 contains r.Id)}
But this gets me the error error: value contains is not a member of org.apache.spark.rdd.RDD[String]
I have seen many questions on SO asking how to do similar things to what I am trying to do, but none have worked for me. I get the value _____ is not a member of org.apache.spark.rdd.RDD[String] error a lot.
Why are these other answers not working for me, and how can I achieve what I am trying to do?
I created two simple RDDs
val rdd1 = sc.parallelize(Array(
| Thing1(1,2),
| Thing1(2,3),
| Thing1(3,4) ))
rdd1: org.apache.spark.rdd.RDD[Thing1] = ParallelCollectionRDD[174] at parallelize
val rdd2 = sc.parallelize(Array(
| Thing2(1, "Two"),
| Thing2(2, "Three" ) ))
rdd2: org.apache.spark.rdd.RDD[Thing2] = ParallelCollectionRDD[175] at parallelize
Now you can join them by the respective element for which you want to find the common value in both :
val rdd1_present = rdd1.keyBy(_.a).join(rdd2.keyBy(_.a) ).map{ case(a, (t1, t2) ) => t1 }
//rdd1_present.collect
//Array[Thing1] = Array(Thing1(2,3), Thing1(1,2))
val rdd1_absent = rdd1.keyBy(_.a).subtractByKey(rdd1_present.keyBy(_.a) ).map{ case(a,t1) => t1 }
//rdd1_absent.collect
//Array[Thing1] = Array(Thing1(3,4))
Try full outer join-
val joined = rdd1.map(s=>(s.id,s)).fullOuterJoin(rdd2.map(s=>(s.id,s))).cache()
//only in left
joined.filter(s=> s._2._2.isEmpty).foreach(println)
//only in right
joined.filter(s=>s._2._1.isEmpty).foreach(println)
//in both
joined.filter(s=> !s._2._1.isEmpty && !s._2._2.isEmpty).foreach(println)

spark: join rdd based on sequence of another rdd

I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌​dd.leftOuterJoin(ite‌​m_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

Make .csv files from arrays in Scala

I have two tuples in Scala of the following form:
val array1 = (bucket1, Seq((dateA, Amount11), (dateB, Amount12), (dateC, Amount13)))
val array2 = (bucket2, Seq((dateA, Amount21), (dateB, Amount22), (dateC, Amount23)))
What is the quickest way to make a .csv file in Scala such that:
date* is pivot.
bucket* is column name.
Amount* fill the table.
It needs to look something like this:
Dates______________bucket1__________bucket2
dateA______________Amount11________Amount21
dateB______________Amount12________Amount22
dateC______________Amount13________Amount23
You can make it shorter by chaining some operations, but :
scala> val array1 = ("bucket1", Seq(("dateA", "Amount11"), ("dateB", "Amount12"), ("dateC", "Amount13")))
array1: (String, Seq[(String, String)]) =
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)))
scala> val array2 = ("bucket2", Seq(("dateA", "Amount21"), ("dateB", "Amount22"), ("dateC", "Amount23")))
array2: (String, Seq[(String, String)]) =
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
// Single array to work with
scala> val arrays = List(array1, array2)
arrays: List[(String, Seq[(String, String)])] = List(
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13))),
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
)
// Split between buckets and the values
scala> val (buckets, values) = arrays.unzip
buckets: List[String] = List(bucket1, bucket2)
values: List[Seq[(String, String)]] = List(
List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)),
List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23))
)
// Format the data
// Note that this does not keep the 'dateX' order
scala> val grouped = values.flatten
.groupBy(_._1)
.map { case (date, list) => date::(list.map(_._2)) }
grouped: scala.collection.immutable.Iterable[List[String]] = List(
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join everything, and add the "Dates" column in front of the buckets
scala> val table = ("Dates"::buckets)::grouped.toList
table: List[List[String]] = List(
List(Dates, bucket1, bucket2),
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join the rows by ',' and the lines by "\n"
scala> val string = table.map(_.mkString(",")).mkString("\n")
string: String =
Dates,bucket1,bucket2
dateC,Amount13,Amount23
dateB,Amount12,Amount22
dateA,Amount11,Amount21

spark join operation based on two columns

I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar
If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]
rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......
val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }