spark: join rdd based on sequence of another rdd - scala

I have an rdd say sample_rdd of type RDD[(String, String, Int))] with 3 columns id,item,count. sample data:
id1|item1|1
id1|item2|3
id1|item3|4
id2|item1|3
id2|item4|2
I want to join each id against a lookup_rdd this:
item1|0
item2|0
item3|0
item4|0
item5|0
The output should give me following for id1, outerjoin with lookuptable:
item1|1
item2|3
item3|4
item4|0
item5|0
Similarly for id2 i should get:
item1|3
item2|0
item3|0
item4|2
item5|0
Finally output for each id should have all counts with id:
id1,1,3,4,0,0
id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup
This is what i have tried:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey()
get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_r‌​dd.leftOuterJoin(ite‌​m_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)

Try below. Run each step on your local console to understand whats happening in detail.
The idea is to zipwithindex and form seq based on lookup_rdd.
(i1,0),(i2,1)..(i5,4) and (id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key(i1,id1) reduce and calculate count.
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))

Related

Trouble with Scala pattern matching with map - required String

I'm trying to make my RDD into a pairdRDD, but having trouble with the pattern matching and I have no idea what I'm doing wrong..
val test = sc.textFile("neighborhood_test.csv");
val nhead0 = test.first;
val test_split = test.map(line => line.split("\t"));
val nhead = test_split.first;
val test_neigh0 = test.filter(line => line!= nhead0);
//test_neigh0.first = 3335 Dunlap Seattle
val test_neigh1 = test_neigh0.map(line => line.split("\t"));
//test_neigh1.first = Array[String] = Array(3335, Dunlap, Seattle)
val test_neigh = test_neigh1.map({case (id, neigh, city) => (id, (neigh, city))});
Gives error:
found : (T1, T2, T3)
required: String
val test_neigh = test_neigh0.map({case (id, neigh, city) => (id, (neigh, city))});
EDIT:
The inputfile is tab seperated and looks like this:
id neighbourhood city
3335 Dunlap Seattle
4291 Roosevelt Seattle
5682 South Delridge Seattle
As output I wan't a pairRDD with id as key, and (neigh, city) as value.
Neither test_neigh0.first nor test_neigh1.first is a triple, so you cannot pattern match it as such.
The elements in test_neigh1 are Array[String]. Under the assumption that these arrays are all of length 3, you can pattern match against them as { case Array(id, neigh, city) => ...}.
To make sure that you won't get a matching error if one of the line as more or less than 3 elements, you may collect on this pattern matching, instead of mapping on it.
val test_neigh: RDD[(String, (String, String))] = test_neigh1.collect{
case Array(id, neigh, city) => (id, (neigh, city))
}
EDIT
The issues you experienced as described in your comment are related to RDD[_] not being a usual collection (such as List, Array or Set). To avoid those, you might need to fetch elements in the array without pattern matching:
val test_neigh: RDD[(String, (String, String))] = test_neigh0.map(line => {
val arr = line.split("\t")
(arr(0), (arr(1), arr(2))
})
val baseRDD = sc.textFile("neighborhood_test.csv").filter { x => !x.contains("city") }
baseRDD.map { x =>
val split = x.split("\t")
(split(0), (split(1), split(2)))
}.groupByKey().foreach(println(_))
Result:
(3335,CompactBuffer((Dunlap,Seattle)))
(4291,CompactBuffer((Roosevelt,Seattle)))
(5682,CompactBuffer((South Delridge,Seattle)))

build inverted index in spark application using scala

I am new to Spark and scala programming language. My input is a CSV file. I need to build an inverted index on the values in csv file like explained below with an example.
Input: file.csv
attr1, attr2, attr3
1, AAA, 23
2, BBB, 23
3, AAA, 27
output format: value -> (rowid, collumnid) pairs
for example: AAA -> ((1,2),(3,2))
27 -> (3,3)
I have started with the following code. I am stuck after that. Kindly help.
object Main {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Invert Me!").setMaster("local[2]")
val sc = new SparkContext(conf)
val txtFilePath = "/home/person/Desktop/sample.csv"
val txtFile = sc.textFile(txtFilePath)
val nRows = txtFile.count()
val data = txtFile.map(line => line.split(",").map(elem => elem.trim()))
val nCols = data.collect()(0).length
}
}
Code preserving your style could look as
val header = sc.broadcast(data.first())
val cells = data.zipWithIndex().filter(_._2 > 0).flatMap { case (row, index) =>
row.zip(header.value).map { case (value, column) => value ->(column, index) }
}
val index: RDD[(String, Vector[(String, Long)])] =
cells.aggregateByKey(Vector.empty[(String, Long)])(_ :+ _, _ ++ _)
Here the index value should contain desired mapping of CellValue to pair (ColumnName, RowIndex)
Underscores in above methods are just shortcutted lambdas, it could be written another way as
val cellsVerbose = data.zipWithIndex().flatMap {
case (row, 1) => IndexedSeq.empty // skipping header row
case (row, index) => row.zip(header.value).map {
case (value, column) => value ->(column, index)
}
}
val indexVerbose: RDD[(String, Vector[(String, Long)])] =
cellsVerbose.aggregateByKey(zeroValue = Vector.empty[(String, Long)])(
seqOp = (keys, key) => keys :+ key,
combOp = (keysA, keysB) => keysA ++ keysB)

Scala spark reduce by key and find common value

I have a file of csv data stored in as a sequenceFile on HDFS, in the format of name, zip, country, fav_food1, fav_food2, fav_food3, fav_colour. There could be many entries with the same name and I needed to find out what their favourite food was (ie count all the food entries in all the records with that name and return the most popular one. I am new to Scala and Spark and have gone thorough multiple tutorials and scoured the forums but am stuck as to how to proceed. So far I have got the sequence files which had Text into String format and then filtered out the entries
Here is the sample data entries one to a line in the file
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
So the output should be the tuple (Bob, Soda) since soda appears the most amount of times in Bob's entries.
import org.apache.hadoop.io._
var lines = sc.sequenceFile("path",classOf[LongWritable],classOf[Text]).values.map(x => x.toString())
// converted to string since I could not get filter to run on Text and removing the longwritable
var filtered = lines.filter(_.split(",")(0) == "Bob");
// removed entries with all other users
var f_tuples = filtered.map(line => lines.split(",");
// split all the values
var f_simple = filtered.map(line => (line(0), (line(3), line(4), line(5))
// removed unnecessary fields
This Issue I have now is that I think I have this [<name,[f,f,f]>] structure and don't really know how to proceed to flatten it out and get the most popular food. I need to combine all the entries so I have a entry with a and then get the most common element in the value. Any help would be appreciated. Thanks
I tried this to get it to flatten out, but it seems the more I try, the more convoluted the data structure becomes.
var f_trial = fpairs.groupBy(_._1).mapValues(_.map(_._2))
// the resulting structure was of type org.apache.spark.rdd.RDD[(String, Interable[(String, String, String)]
here is what a println of a record looks like after f_trial
("Bob", List((Pizza, Soda,), (Chocolate, Cheese, Soda), (Chocolate, Pizza, Soda)))
Parenthesis Breakdown
("Bob",
List(
(Pizza, Soda, <missing value>),
(Chocolate, Cheese, Soda),
(Chocolate, Pizza, Soda)
) // ends List paren
) // ends first paren
I found time. Setup:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)
val data = """
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
""".trim
val records = sc.parallelize(data.split('\n'))
Extract the food choices, and for each make a tuple of ((name, food), 1)
val r2 = records.flatMap { r =>
val Array(name, id, country, food1, food2, food3, color) = r.split(',');
List(((name, food1), 1), ((name, food2), 1), ((name, food3), 1))
}
Total up each name/food combination:
val r3 = r2.reduceByKey((x, y) => x + y)
Remap so that the name (only) is the key
val r4 = r3.map { case ((name, food), total) => (name, (food, total)) }
Pick the food with the largest count at each step
val res = r4.reduceByKey((x, y) => if (y._2 > x._2) y else x)
And we're done
println(res.collect().mkString)
//(Mary,(Chips,1))(Bob,(Soda,3))
EDIT: To collect all the food items that have the same top count for a person, we just change the last two lines:
Start with a List of items with total:
val r5 = r3.map { case ((name, food), total) => (name, (List(food), total)) }
In the equal case, concatenate the list of food items with that score
val res2 = r5.reduceByKey((x, y) => if (y._2 > x._2) y
else if (y._2 < x._2) x
else (y._1:::x._1, y._2))
//(Mary,(List(Chocolate, Pasta, Chips),1))
//(Bob,(List(Soda),3))
If you want the top-3, say, then use aggregateByKey to assemble a list of the favorite foods per person instead of the second reduceByKey
Solutions provided by Paul and mattinbits shuffle your data twice - once to perform reduce-by-name-and-food and once to reduce-by-name. It is possible to solve this problem with only one shuffle.
/**Generate key-food_count pairs from a splitted line**/
def bitsToKeyMapPair(xs: Array[String]): (String, Map[String, Long]) = {
val key = xs(0)
val map = xs
.drop(3) // Drop name..country
.take(3) // Take food
.filter(_.trim.size !=0) // Ignore empty
.map((_, 1L)) // Generate k-v pairs
.toMap // Convert to Map
.withDefaultValue(0L) // Set default
(key, map)
}
/**Combine two count maps**/
def combine(m1: Map[String, Long], m2: Map[String, Long]): Map[String, Long] = {
(m1.keys ++ m2.keys).map(k => (k, m1(k) + m2(k))).toMap.withDefaultValue(0L)
}
val n: Int = ??? // Number of favorite per user
val records = lines.map(line => bitsToKeyMapPair(line.split(",")))
records.reduceByKey(combine).mapValues(_.toSeq.sortBy(-_._2).take(n))
If you're not a purist you can replace scala.collection.immutable.Map with scala.collection.mutable.Map to further improve performance.
Here's a complete example:
import org.apache.spark.{SparkContext, SparkConf}
object Main extends App {
val data = List(
"Bob,123,USA,Pizza,Soda,,Blue",
"Bob,456,UK,Chocolate,Cheese,Soda,Green",
"Bob,12,USA,Chocolate,Pizza,Soda,Yellow",
"Mary,68,USA,Chips,Pasta,Chocolate,Blue")
val sparkConf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(sparkConf)
val lineRDD = sc.parallelize(data)
val pairedRDD = lineRDD.map { line =>
val fields = line.split(",")
(fields(0), List(fields(3), fields(4), fields(5)).filter(_ != ""))
}.filter(_._1 == "Bob")
/*pairedRDD.collect().foreach(println)
(Bob,List(Pizza, Soda))
(Bob,List(Chocolate, Cheese, Soda))
(Bob,List(Chocolate, Pizza, Soda))
*/
val flatPairsRDD = pairedRDD.flatMap {
case (name, foodList) => foodList.map(food => ((name, food), 1))
}
/*flatPairsRDD.collect().foreach(println)
((Bob,Pizza),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Cheese),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Pizza),1)
((Bob,Soda),1)
*/
val nameFoodSumRDD = flatPairsRDD.reduceByKey((a, b) => a + b)
/*nameFoodSumRDD.collect().foreach(println)
((Bob,Cheese),1)
((Bob,Soda),3)
((Bob,Pizza),2)
((Bob,Chocolate),2)
*/
val resultsRDD = nameFoodSumRDD.map{
case ((name, food), count) => (name, (food,count))
}.groupByKey.map{
case (name, foodCountList) => (name, foodCountList.toList.sortBy(_._2).reverse.head)
}
resultsRDD.collect().foreach(println)
/*
(Bob,(Soda,3))
*/
sc.stop()
}

Spark column wise word count

We are trying to generate column wise statistics of our dataset in spark. In addition to using the summary function from statistics library. We are using the following procedure:
We determine the columns with string values
Generate key value pair for the whole dataset, using the column number as key and value of column as value
generate a new map of format
(K,V) ->((K,V),1)
Then we use reduceByKey to find the sum of all unique value in all the columns. We cache this output to reduce further computation time.
In the next step we cycle through the columns using a for loop to find the statistics for all the columns.
We are trying to reduce the for loop by again utilizing the map reduce way but we are unable to find some way to achieve it. Doing so will allow us to generate column statistics for all columns in one execution. The for loop method is running sequentially making it very slow.
Code:
//drops the header
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def retAtrTuple(x: String) = {
val newX = x.split(",")
for (h <- 0 until newX.length)
yield (h,newX(h))
}
val line = sc.textFile("hdfs://.../myfile.csv")
val withoutHeader: RDD[String] = dropHeader(line)
val kvPairs = withoutHeader.flatMap(retAtrTuple) //generates a key-value pair where key is the column number and value is column's value
var bool_numeric_col = kvPairs.map{case (x,y) => (x,isNumeric(y))}.reduceByKey(_&&_).sortByKey() //this contains column indexes as key and boolean as value (true for numeric and false for string type)
var str_cols = bool_numeric_col.filter{case (x,y) => y == false}.map{case (x,y) => x}
var num_cols = bool_numeric_col.filter{case (x,y) => y == true}.map{case (x,y) => x}
var str_col = str_cols.toArray //array consisting the string col
var num_col = num_cols.toArray //array consisting numeric col
val colCount = kvPairs.map((_,1)).reduceByKey(_+_)
val e1 = colCount.map{case ((x,y),z) => (x,(y,z))}
var numPairs = e1.filter{case (x,(y,z)) => str_col.contains(x) }
//running for loops which needs to be parallelized/optimized as it sequentially operates on each column. Idea is to find the top10, bottom10 and number of distinct elements column wise
for(i <- str_col){
var total = numPairs.filter{case (x,(y,z)) => x==i}.sortBy(_._2._2)
var leastOnes = total.take(10)
println("leastOnes for Col" + i)
leastOnes.foreach(println)
var maxOnes = total.sortBy(-_._2._2).take(10)
println("maxOnes for Col" + i)
maxOnes.foreach(println)
println("distinct for Col" + i + " is " + total.count)
}
Let me simplify your question a bit. (A lot actually.) We have an RDD[(Int, String)] and we want to find the top 10 most common Strings for each Int (which are all in the 0–100 range).
Instead of sorting, as in your example, it is more efficient to use the Spark built-in RDD.top(n) method. Its run-time is linear in the size of the data, and requires moving much less data around than a sort.
Consider the implementation of top in RDD.scala. You want to do the same, but with one priority queue (heap) per Int key. The code becomes fairly complex:
import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private.
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[[(Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for (((k, v), count) <- items) {
heaps(k) += count -> v
}
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case(count, value) => value })
}
This is efficient, but made painful because we need to work with multiple columns at the same time. It would be way easier to have one RDD per column and just call rdd.top(10) on each.
Unfortunately the naive way to split up the RDD into N smaller RDDs does N passes:
def split(together: RDD[(Int, String)], columns: Int): Seq[RDD[String]] = {
together.cache // We will make N passes over this RDD.
(0 until columns).map {
i => together.filter { case (key, value) => key == i }.values
}
}
A more efficient solution could be to write out the data into separate files by key, then load it back into separate RDDs. This is discussed in Write to multiple outputs by key Spark - one Spark job.
Thanks for #Daniel Darabos's answer. But there are some mistakes.
mixed use of Map and collection.mutable.Map
withDefault((i: Int) => new Heap(n)) do not create a new Heap when you set heaps(k) += count -> v
mix uasage of parentheses
Here is the modified code:
//import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private. copy to your own folder and import it
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object BoundedPriorityQueueTest {
// https://stackoverflow.com/questions/28166190/spark-column-wise-word-count
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[((Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[collection.mutable.Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap]() // .withDefault((i: Int) => new Heap(n))
for (((k, v), count) <- items) {
println("\n---")
println("before add " + ((k, v), count) + ", the map is: ")
println(heaps)
if (!heaps.contains(k)) {
println("not contains key " + k)
heaps(k) = new Heap(n)
println(heaps)
}
heaps(k) += count -> v
println("after add " + ((k, v), count) + ", the map is: ")
println(heaps)
}
println(heaps)
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: collection.mutable.Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap]() //.withDefault((i: Int) => new Heap(n))
println(heaps)
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case (count, value) => value }).toMap
}
def main(args: Array[String]): Unit = {
Logger.getRootLogger().setLevel(Level.FATAL) //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val conf = new SparkConf().setAppName("word count").setMaster("local[1]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN") //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val words = sc.parallelize(List((1, "s11"), (1, "s11"), (1, "s12"), (1, "s13"), (2, "s21"), (2, "s22"), (2, "s22"), (2, "s23")))
println("# words:" + words.count())
val result = top(1, words)
println("\n--result:")
println(result)
sc.stop()
print("DONE")
}
}

spark join operation based on two columns

I'm trying to join two datasets based on two columns. It works until I use one column but fails with below error
:29: error: value join is not a member of org.apache.spark.rdd.RDD[(String, String, (String, String, String, String, Double))]
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Code :
import org.apache.spark.rdd.RDD
def zipWithIndex[T](rdd: RDD[T]) = {
val partitionSizes = rdd.mapPartitions(p => Iterator(p.length)).collect
val ranges = partitionSizes.foldLeft(List((0, 0))) { case(accList, count) =>
val start = accList.head._2
val end = start + count
(start, end) :: accList
}.reverse.tail.toArray
rdd.mapPartitionsWithIndex( (index, partition) => {
val start = ranges(index)._1
val end = ranges(index)._2
val indexes = Iterator.range(start, end)
partition.zip(indexes)
})
}
val dimension = sc.
textFile("dimension.txt").
map{ line =>
val parts = line.split("\t")
(parts(0),parts(1),parts(2),parts(3),parts(4),parts(5))
}
val dimensionWithSK =
zipWithIndex(dimension).map { case((nk1,nk2,prop3,prop4,prop5,prop6), idx) => (nk1,nk2,(prop3,prop4,prop5,prop6,idx + nextSurrogateKey)) }
val fact = sc.
textFile("fact.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
(parts(0),parts(1), (parts(2),parts(3), parts(4),parts(5),parts(6).toDouble))
}
val finalFact = fact.join(dimensionWithSK).map { case(nk1,nk2, ((parts1,parts2,parts3,parts4,amount), (sk, prop1,prop2,prop3,prop4))) => (sk,amount) }
Request someone's help here..
Thanks
Sridhar
If you look at the signature of join it works on an RDD of pairs:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
You have a triple. I guess your trying to join on the first 2 elements of the tuple, and so you need to map your triple to a pair, where the first element of the pair is a pair containing the first two elements of the triple, e.g. for any Types V1 and V2
val left: RDD[(String, String, V1)] = ??? // some rdd
val right: RDD[(String, String, V2)] = ??? // some rdd
left.map {
case (key1, key2, value) => ((key1, key2), value)
}
.join(
right.map {
case (key1, key2, value) => ((key1, key2), value)
})
This will give you an RDD of the form RDD[(String, String), (V1, V2)]
rdd1 Schema :
field1,field2, field3, fieldX,.....
rdd2 Schema :
field1, field2, field3, fieldY,.....
val joinResult = rdd1.join(rdd2,
Seq("field1", "field2", "field3"), "outer")
joinResult schema :
field1, field2, field3, fieldX, fieldY, ......
val emp = sc.
textFile("emp.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val emp_new = sc.
textFile("emp_new.txt").
map { line =>
val parts = line.split("\t")
// we need to output (Naturalkey, (FactId, Amount)) in
// order to be able to join with the dimension data.
((parts(0), parts(2)),parts(1))
}
val finalemp =
emp_new.join(emp).
map { case((nk1,nk2) ,((parts1), (val1))) => (nk1,parts1,val1) }