how to join two datasets by key in scala spark - scala

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!

join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))

You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

Related

Separate RDD Based on Membership of Id in Another RDD

I have two case classes and one RDD of each.
case class Thing1(Id: String, a: String, b: String, c: java.util.Date, d: Double)
case class Thing2(Id: String, e: java.util.Date, f: Double)
val rdd1 = // Loads an rdd of type RDD[Thing1]
val rdd2 = // Loads an rdd of type RDD[Thing2]
I want to create 2 new RDD[Thing1]s, 1 that contains elements of rdd1 where the element has an Id present in rdd2, and another that contains elements of rdd1 where the element does not have an Id present in rdd2
Here's what I have tried (looked at this, Scala Spark contains vs. does not contain, and other stack overflow posts, but none have worked)
val rdd2_ids = rdd2.map(r => r.Id)
val rdd1_present = rdd1.filter{case r => rdd2 contains r.Id}
val rdd1_absent = rdd1.filter{case r => !(rdd2 contains r.Id)}
But this gets me the error error: value contains is not a member of org.apache.spark.rdd.RDD[String]
I have seen many questions on SO asking how to do similar things to what I am trying to do, but none have worked for me. I get the value _____ is not a member of org.apache.spark.rdd.RDD[String] error a lot.
Why are these other answers not working for me, and how can I achieve what I am trying to do?
I created two simple RDDs
val rdd1 = sc.parallelize(Array(
| Thing1(1,2),
| Thing1(2,3),
| Thing1(3,4) ))
rdd1: org.apache.spark.rdd.RDD[Thing1] = ParallelCollectionRDD[174] at parallelize
val rdd2 = sc.parallelize(Array(
| Thing2(1, "Two"),
| Thing2(2, "Three" ) ))
rdd2: org.apache.spark.rdd.RDD[Thing2] = ParallelCollectionRDD[175] at parallelize
Now you can join them by the respective element for which you want to find the common value in both :
val rdd1_present = rdd1.keyBy(_.a).join(rdd2.keyBy(_.a) ).map{ case(a, (t1, t2) ) => t1 }
//rdd1_present.collect
//Array[Thing1] = Array(Thing1(2,3), Thing1(1,2))
val rdd1_absent = rdd1.keyBy(_.a).subtractByKey(rdd1_present.keyBy(_.a) ).map{ case(a,t1) => t1 }
//rdd1_absent.collect
//Array[Thing1] = Array(Thing1(3,4))
Try full outer join-
val joined = rdd1.map(s=>(s.id,s)).fullOuterJoin(rdd2.map(s=>(s.id,s))).cache()
//only in left
joined.filter(s=> s._2._2.isEmpty).foreach(println)
//only in right
joined.filter(s=>s._2._1.isEmpty).foreach(println)
//in both
joined.filter(s=> !s._2._1.isEmpty && !s._2._2.isEmpty).foreach(println)

how to merge RDD tuples

I want to use reduceByKey merge many tuples with same key,
here is the code:
val data = Array(DenseMatrix((2.0,1.0,5.0),(4.0,3.0,6.0)),
DenseMatrix((7.0,8.0,9.0),(10.0,12.0,11.0)))
val init = sc.parallelize(data,2)
//getColumn
def getColumn(v:DenseMatrix[Double]) : Map[Int, IndexedSeq[(Int, Double)]]={
val r = Random
val index = 0 to v.size - 1
def func(x:Int, y:DenseMatrix[Double]):(Int,(Int, Double)) =
{
( x,( r.nextInt(10), y.valueAt(x)))
}
val rest = index.map{x=> func(x,v)}.groupBy(x=>x._1).mapValues(x=>x.map(_._2))
rest
}
val out= init.flatMap{ v=> getColumn(v) }
val reduceOutput = tmp.reduceByKey(_++_)
val out2 = out.map{case(k,v)=>k}.collect() // keys here are not I want
here is two pic, the first one is the [key,value] pairs I thought it would be, the second one shows the real keys ,they are not I want,so the ouput is not right.
What should I do?

How to filter the data in spark-shell using scala?

I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". store might be repetitive because a person can visit the store any number of times.
Input Data:
id, store
1, Walmart
1, Walmart
1, Bestbuy
2, Target
3, Walmart
4, Bestbuy
Output Expected:
3, Walmart
I have got the output using dataFrames and running SQL queries on spark context. But is there any way to do this using groupByKey/reduceByKey etc without dataFrames. Can someone help me with the code, After map-> groupByKey, a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer!
The code with which I got it using sqlContext is below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
The code which I am trying now is this, but I am struck after the third step:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
if I do dataFiltered.collect(), I am getting
Array[Any] = Array(Vector((3,Walmart)), (), ())
Please help me how to extract the output after this step
To filter an RDD, just use RDD.filter:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)
I also tried it another way and it worked out
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()

How to join two RDDs by key to get RDD of (String, String)?

I have two paired rdds in the form RDD [(String, mutable.HashSet[String]):
For example:
rdd1: 332101231222, "320758, 320762, 320760, 320759, 320757, 320761"
rdd2: 332101231222, "220758, 220762, 220760, 220759, 220757, 220761"
I want to combine rdd1 and rdd2 based on common keys, so o/p should be like:
332101231222 320758, 320762, 320760, 320759, 320757, 320761 220758, 220762, 220760, 220759, 220757, 220761
Here is my code:
def cogroupTest (rdd1: RDD [(String, mutable.HashSet[String])], rdd2: RDD [(String, mutable.HashSet[String])] ): Unit =
{
val prods_per_user_co_grouped = (rdd1).cogroup(rdd2)
prods_per_user_co_grouped.map { case (key: String, (value1: mutable.HashSet[String], value2: mutable.HashSet[String])) => {
val combinedhs = value1 ++ value2
val sstr = combinedhs.mkString("\t")
val keypadded = key + "\t"
s"$keypadded$sstr"
}
}.saveAsTextFile("/scratch/rdds_joined/")
Here is the error that I get when I run the my program:
scala.MatchError: (32101231222,(CompactBuffer(Set(320758, 320762, 320760, 320759, 320757, 320761)),CompactBuffer(Set(220758, 220762, 220760, 220759, 220757, 220761)))) (of class scala.Tuple2)
Any help with this will be great!
As you might guess from the name cogroup groups observations by key. It means that in your case you get:
(String, (Iterable[mutable.HashSet[String]], Iterable[mutable.HashSet[String]]))
not
(String, (mutable.HashSet[String], mutable.HashSet[String]))
It is pretty clear when you take a look at the error you get. If you want to combine pairs you should use join method. If not you should adjust pattern to match structure you get and then use something like this:
val combinedhs = value1.reduce(_ ++ _) ++ value2.reduce(_ ++ _)

Join two ordinary RDDs with/without Spark SQL

I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join operation of two tables. I wonder if this is possible only through Spark SQL or there are other ways of doing it.
As a concrete example, consider
RDD r1 with primary key ITEM_ID:
(ITEM_ID, ITEM_NAME, ITEM_UNIT, COMPANY_ID)
and RDD r2 with primary key COMPANY_ID:
(COMPANY_ID, COMPANY_NAME, COMPANY_CITY)
I want to join r1 and r2.
How can this be done?
Soumya Simanta gave a good answer. However, the values in joined RDD are Iterable, so the results may not be very similar to ordinary table joining.
Alternatively, you can:
val mappedItems = items.map(item => (item.companyId, item))
val mappedComp = companies.map(comp => (comp.companyId, comp))
mappedItems.join(mappedComp).take(10).foreach(println)
The output would be:
(c1,(Item(1,first,2,c1),Company(c1,company-1,city-1)))
(c1,(Item(2,second,2,c1),Company(c1,company-1,city-1)))
(c2,(Item(3,third,2,c2),Company(c2,company-2,city-2)))
(Using Scala)
Let say you have two RDDs:
emp: (empid, ename, dept)
dept: (dname, dept)
Following is another way:
//val emp = sc.parallelize(Seq((1,"jordan",10), (2,"ricky",20), (3,"matt",30), (4,"mince",35), (5,"rhonda",30)))
val emp = sc.parallelize(Seq(("jordan",10), ("ricky",20), ("matt",30), ("mince",35), ("rhonda",30)))
val dept = sc.parallelize(Seq(("hadoop",10), ("spark",20), ("hive",30), ("sqoop",40)))
//val shifted_fields_emp = emp.map(t => (t._3, t._1, t._2))
val shifted_fields_emp = emp.map(t => (t._2, t._1))
val shifted_fields_dept = dept.map(t => (t._2,t._1))
shifted_fields_emp.join(shifted_fields_dept)
// Create emp RDD
val emp = sc.parallelize(Seq((1,"jordan",10), (2,"ricky",20), (3,"matt",30), (4,"mince",35), (5,"rhonda",30)))
// Create dept RDD
val dept = sc.parallelize(Seq(("hadoop",10), ("spark",20), ("hive",30), ("sqoop",40)))
// Establishing that the third field is to be considered as the Key for the emp RDD
val manipulated_emp = emp.keyBy(t => t._3)
// Establishing that the second field need to be considered as the Key for dept RDD
val manipulated_dept = dept.keyBy(t => t._2)
// Inner Join
val join_data = manipulated_emp.join(manipulated_dept)
// Left Outer Join
val left_outer_join_data = manipulated_emp.leftOuterJoin(manipulated_dept)
// Right Outer Join
val right_outer_join_data = manipulated_emp.rightOuterJoin(manipulated_dept)
// Full Outer Join
val full_outer_join_data = manipulated_emp.fullOuterJoin(manipulated_dept)
// Formatting the Joined Data for better understandable (using map)
val cleaned_joined_data = join_data.map(t => (t._2._1._1, t._2._1._2, t._1, t._2._2._1))
This will give the output as:
// Print the output cleaned_joined_data on the console
scala> cleaned_joined_data.collect()
res13: Array[(Int, String, Int, String)] = Array((3,matt,30,hive), (5,rhonda,30,hive), (2,ricky,20,spark), (1,jordan,10,hadoop))
Something like this should work.
scala> case class Item(id:String, name:String, unit:Int, companyId:String)
scala> case class Company(companyId:String, name:String, city:String)
scala> val i1 = Item("1", "first", 2, "c1")
scala> val i2 = i1.copy(id="2", name="second")
scala> val i3 = i1.copy(id="3", name="third", companyId="c2")
scala> val items = sc.parallelize(List(i1,i2,i3))
items: org.apache.spark.rdd.RDD[Item] = ParallelCollectionRDD[14] at parallelize at <console>:20
scala> val c1 = Company("c1", "company-1", "city-1")
scala> val c2 = Company("c2", "company-2", "city-2")
scala> val companies = sc.parallelize(List(c1,c2))
scala> val groupedItems = items.groupBy( x => x.companyId)
groupedItems: org.apache.spark.rdd.RDD[(String, Iterable[Item])] = ShuffledRDD[16] at groupBy at <console>:22
scala> val groupedComp = companies.groupBy(x => x.companyId)
groupedComp: org.apache.spark.rdd.RDD[(String, Iterable[Company])] = ShuffledRDD[18] at groupBy at <console>:20
scala> groupedItems.join(groupedComp).take(10).foreach(println)
14/12/12 00:52:32 INFO DAGScheduler: Job 5 finished: take at <console>:35, took 0.021870 s
(c1,(CompactBuffer(Item(1,first,2,c1), Item(2,second,2,c1)),CompactBuffer(Company(c1,company-1,city-1))))
(c2,(CompactBuffer(Item(3,third,2,c2)),CompactBuffer(Company(c2,company-2,city-2))))
Spark SQL can perform join on SPARK RDDs.
Below code performs SQL join on Company and Items RDDs
object SparkSQLJoin {
case class Item(id:String, name:String, unit:Int, companyId:String)
case class Company(companyId:String, name:String, city:String)
def main(args: Array[String]) {
val sparkConf = new SparkConf()
val sc= new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.createSchemaRDD
val i1 = Item("1", "first", 1, "c1")
val i2 = Item("2", "second", 2, "c2")
val i3 = Item("3", "third", 3, "c3")
val c1 = Company("c1", "company-1", "city-1")
val c2 = Company("c2", "company-2", "city-2")
val companies = sc.parallelize(List(c1,c2))
companies.registerAsTable("companies")
val items = sc.parallelize(List(i1,i2,i3))
items.registerAsTable("items")
val result = sqlContext.sql("SELECT * FROM companies C JOIN items I ON C.companyId= I.companyId").collect
result.foreach(println)
}
}
Output is displayed as
[c1,company-1,city-1,1,first,1,c1]
[c2,company-2,city-2,2,second,2,c2]