spark join operation based on two columns - scala

I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar

If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]

rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......

val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }

Related

spark: join rdd based on sequence of another rdd

I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌​dd.leftOuterJoin(ite‌​m_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

build inverted index in spark application using scala

I am new to Spark and scala programming language. My input is a CSV file. I need to build an inverted index on the values in csv file like explained below with an example.
Input: file.csv
attr1, attr2, attr3
1, AAA, 23
2, BBB, 23
3, AAA, 27
output format: value -> (rowid, collumnid) pairs
for example: AAA -> ((1,2),(3,2))
27 -> (3,3)
I have started with the following code. I am stuck after that. Kindly help.
object Main {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Invert Me!").setMaster("local[2]")
val sc = new SparkContext(conf)
val txtFilePath = "/home/person/Desktop/sample.csv"
val txtFile = sc.textFile(txtFilePath)
val nRows = txtFile.count()
val data = txtFile.map(line => line.split(",").map(elem => elem.trim()))
val nCols = data.collect()(0).length
}
}
Code preserving your style could look as
val header = sc.broadcast(data.first())
val cells = data.zipWithIndex().filter(_._2 > 0).flatMap { case (row, index) =>
row.zip(header.value).map { case (value, column) => value ->(column, index) }
}
val index: RDD[(String, Vector[(String, Long)])] =
cells.aggregateByKey(Vector.empty[(String, Long)])(_ :+ _, _ ++ _)
Here the index value should contain desired mapping of CellValue to pair (ColumnName, RowIndex)
Underscores in above methods are just shortcutted lambdas, it could be written another way as
val cellsVerbose = data.zipWithIndex().flatMap {
case (row, 1) => IndexedSeq.empty // skipping header row
case (row, index) => row.zip(header.value).map {
case (value, column) => value ->(column, index)
}
}
val indexVerbose: RDD[(String, Vector[(String, Long)])] =
cellsVerbose.aggregateByKey(zeroValue = Vector.empty[(String, Long)])(
seqOp = (keys, key) => keys :+ key,
combOp = (keysA, keysB) => keysA ++ keysB)

Scala spark reduce by key and find common value

I have a file of csv data stored in as a sequenceFile on HDFS, in the format of name, zip, country, fav_food1, fav_food2, fav_food3, fav_colour. There could be many entries with the same name and I needed to find out what their favourite food was (ie count all the food entries in all the records with that name and return the most popular one. I am new to Scala and Spark and have gone thorough multiple tutorials and scoured the forums but am stuck as to how to proceed. So far I have got the sequence files which had Text into String format and then filtered out the entries
Here is the sample data entries one to a line in the file
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
So the output should be the tuple (Bob, Soda) since soda appears the most amount of times in Bob's entries.
import org.apache.hadoop.io._
var lines = sc.sequenceFile("path",classOf[LongWritable],classOf[Text]).values.map(x => x.toString())
// converted to string since I could not get filter to run on Text and removing the longwritable
var filtered = lines.filter(_.split(",")(0) == "Bob");
// removed entries with all other users
var f_tuples = filtered.map(line => lines.split(",");
// split all the values
var f_simple = filtered.map(line => (line(0), (line(3), line(4), line(5))
// removed unnecessary fields
This Issue I have now is that I think I have this [<name,[f,f,f]>] structure and don't really know how to proceed to flatten it out and get the most popular food. I need to combine all the entries so I have a entry with a and then get the most common element in the value. Any help would be appreciated. Thanks
I tried this to get it to flatten out, but it seems the more I try, the more convoluted the data structure becomes.
var f_trial = fpairs.groupBy(_._1).mapValues(_.map(_._2))
// the resulting structure was of type org.apache.spark.rdd.RDD[(String, Interable[(String, String, String)]
here is what a println of a record looks like after f_trial
("Bob", List((Pizza, Soda,), (Chocolate, Cheese, Soda), (Chocolate, Pizza, Soda)))
Parenthesis Breakdown
("Bob",
List(
(Pizza, Soda, <missing value>),
(Chocolate, Cheese, Soda),
(Chocolate, Pizza, Soda)
) // ends List paren
) // ends first paren
I found time. Setup:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)
val data = """
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
""".trim
val records = sc.parallelize(data.split('\n'))
Extract the food choices, and for each make a tuple of ((name, food), 1)
val r2 = records.flatMap { r =>
val Array(name, id, country, food1, food2, food3, color) = r.split(',');
List(((name, food1), 1), ((name, food2), 1), ((name, food3), 1))
}
Total up each name/food combination:
val r3 = r2.reduceByKey((x, y) => x + y)
Remap so that the name (only) is the key
val r4 = r3.map { case ((name, food), total) => (name, (food, total)) }
Pick the food with the largest count at each step
val res = r4.reduceByKey((x, y) => if (y._2 > x._2) y else x)
And we're done
println(res.collect().mkString)
//(Mary,(Chips,1))(Bob,(Soda,3))
EDIT: To collect all the food items that have the same top count for a person, we just change the last two lines:
Start with a List of items with total:
val r5 = r3.map { case ((name, food), total) => (name, (List(food), total)) }
In the equal case, concatenate the list of food items with that score
val res2 = r5.reduceByKey((x, y) => if (y._2 > x._2) y
else if (y._2 < x._2) x
else (y._1:::x._1, y._2))
//(Mary,(List(Chocolate, Pasta, Chips),1))
//(Bob,(List(Soda),3))
If you want the top-3, say, then use aggregateByKey to assemble a list of the favorite foods per person instead of the second reduceByKey
Solutions provided by Paul and mattinbits shuffle your data twice - once to perform reduce-by-name-and-food and once to reduce-by-name. It is possible to solve this problem with only one shuffle.
/**Generate key-food_count pairs from a splitted line**/
def bitsToKeyMapPair(xs: Array[String]): (String, Map[String, Long]) = {
val key = xs(0)
val map = xs
.drop(3) // Drop name..country
.take(3) // Take food
.filter(_.trim.size !=0) // Ignore empty
.map((_, 1L)) // Generate k-v pairs
.toMap // Convert to Map
.withDefaultValue(0L) // Set default
(key, map)
}
/**Combine two count maps**/
def combine(m1: Map[String, Long], m2: Map[String, Long]): Map[String, Long] = {
(m1.keys ++ m2.keys).map(k => (k, m1(k) + m2(k))).toMap.withDefaultValue(0L)
}
val n: Int = ??? // Number of favorite per user
val records = lines.map(line => bitsToKeyMapPair(line.split(",")))
records.reduceByKey(combine).mapValues(_.toSeq.sortBy(-_._2).take(n))
If you're not a purist you can replace scala.collection.immutable.Map with scala.collection.mutable.Map to further improve performance.
Here's a complete example:
import org.apache.spark.{SparkContext, SparkConf}
object Main extends App {
val data = List(
"Bob,123,USA,Pizza,Soda,,Blue",
"Bob,456,UK,Chocolate,Cheese,Soda,Green",
"Bob,12,USA,Chocolate,Pizza,Soda,Yellow",
"Mary,68,USA,Chips,Pasta,Chocolate,Blue")
val sparkConf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(sparkConf)
val lineRDD = sc.parallelize(data)
val pairedRDD = lineRDD.map { line =>
val fields = line.split(",")
(fields(0), List(fields(3), fields(4), fields(5)).filter(_ != ""))
}.filter(_._1 == "Bob")
/*pairedRDD.collect().foreach(println)
(Bob,List(Pizza, Soda))
(Bob,List(Chocolate, Cheese, Soda))
(Bob,List(Chocolate, Pizza, Soda))
*/
val flatPairsRDD = pairedRDD.flatMap {
case (name, foodList) => foodList.map(food => ((name, food), 1))
}
/*flatPairsRDD.collect().foreach(println)
((Bob,Pizza),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Cheese),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Pizza),1)
((Bob,Soda),1)
*/
val nameFoodSumRDD = flatPairsRDD.reduceByKey((a, b) => a + b)
/*nameFoodSumRDD.collect().foreach(println)
((Bob,Cheese),1)
((Bob,Soda),3)
((Bob,Pizza),2)
((Bob,Chocolate),2)
*/
val resultsRDD = nameFoodSumRDD.map{
case ((name, food), count) => (name, (food,count))
}.groupByKey.map{
case (name, foodCountList) => (name, foodCountList.toList.sortBy(_._2).reverse.head)
}
resultsRDD.collect().foreach(println)
/*
(Bob,(Soda,3))
*/
sc.stop()
}

Split RDD into RDD's with no repeating values

I have a RDD of Pairs as below :
(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)
I would like to split it into two RDD's such that the first has only the pairs with non-repeating values (for both key and value) and the second will have the rest of the omitted pairs.
So from the above we could get two RDD's
1) (105,918)
(502,516)
2) (105,757) - Omitted as 105 is already included in 1st RDD
(105,137) - Omitted as 105 is already included in 1st RDD
(516,816) - Omitted as 516 is already included in 1st RDD
(350,502) - Omitted as 502 is already included in 1st RDD
Currently I am using a mutable Set variable to track the elements already selected after coalescing the original RDD to a single partition :
val evalCombinations = collection.mutable.Set.empty[String]
val currentValidCombinations = allCombinations
.filter(p => {
if(!evalCombinations.contains(p._1) && !evalCombinations.contains(p._2)) {
evalCombinations += p._1;evalCombinations += p._2; true
} else
false
})
This approach is limited by memory of the executor on which the operations run. Is there a better scalable solution for this ?
Here is a version, which will require more memory for driver.
import org.apache.spark.rdd._
import org.apache.spark._
def getUniq(rdd: RDD[(Int, Int)], sc: SparkContext): RDD[(Int, Int)] = {
val keys = rdd.keys.distinct
val values = rdd.values.distinct
// these are the keys which appear in value part also.
val both = keys.intersection(values)
val bBoth = sc.broadcast(both.collect.toSet)
// remove those key-value pairs which have value which is also a key.
val uKeys = rdd.filter(x => !bBoth.value.contains(x._2))
.reduceByKey{ case (v1, v2) => v1 } // keep uniq keys
uKeys.map{ case (k, v) => (v, k) } // swap key, value
.reduceByKey{ case (v1, v2) => v1 } // keep uniq value
.map{ case (k, v) => (v, k) } // correct placement
}
def getPartitionedRDDs(rdd: RDD[(Int, Int)], sc: SparkContext) = {
val r = getUniq(rdd, sc)
val remaining = rdd subtract r
val set = r.flatMap { case (k, v) => Array(k, v) }.collect.toSet
val a = remaining.filter{ case (x, y) => !set.contains(x) &&
!set.contains(y) }
val b = getUniq(a, sc)
val part1 = r union b
val part2 = rdd subtract part1
(part1, part2)
}
val rdd = sc.parallelize(Array((105,918),(105,757),(502,516),
(105,137),(516,816),(350,502)))
val (first, second) = getPartitionedRDDs(rdd, sc)
// first.collect: ((516,816), (105,918), (350,502))
// second.collect: ((105,137), (502,516), (105,757))
val rdd1 = sc.parallelize(Array((839,841),(842,843),(840,843),
(839,840),(1,2),(1,3),(4,3)))
val (f, s) = getPartitionedRDDs(rdd1, sc)
//f.collect: ((839,841), (1,2), (840,843), (4,3))