How do I join two rdds based on a common field? - scala

I am very new to Scala and learning to work with RDDs. I have two csv files which have the following headers and data:
csv1.txt
id,"location", "zipcode"
1, "a", "12345"
2, "b", "67890"
3, "c" "54321"
csv2.txt
"location_x", "location_y", trip_hrs
"a", "b", 1
"a", "c", 3
"b", "c", 2
"a", "b", 1
"c", "b", 2
Basically, csv1 data is a distinct set of locations and zip codes, whereas csv2 data has the trip duration between location_x and location_y.
The common piece of information in these two data sets is location in csv1 and location_x in csv2 even though they have different header names.
I would like to create two RDDs with one containing the data from csv1 and the other from csv2.
Then I would like to join these two RDDs and return the location, zipcode, and sum of all trip times from that location as shown below:
("a", "zipcode", 5)
("b", "zipcode", 2)
("c", "zipcode", 2)
I was wondering if one of you can assist me with this problem. Thanks.

I will give you the code (a complete app in IntelliJ) with some explanations. I hope it can be helpful.
Please read the Spark documentation for the explicit details.
working-with-key-value-pairs
This problem can be done with Spark Dataframes, you can try for yourself.
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object Joining {
val spark = SparkSession
.builder()
.appName("Joining")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id", "Joining") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val path = "/home/cloudera/files/tests/"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
// read the files
val file1 = sc.textFile(s"${path}join1.csv")
val header1 = file1.first // extract the header of the file
val file2 = sc.textFile(s"${path}join2.csv")
val header2 = file2.first // extract the header of the file
val rdd1 = file1
.filter(line => line != header1) // to leave out the header
.map(line => line.split(",")) // split the lines => Array[String]
.map(arr => (arr(1).trim,arr(2).trim)) // to make up a pairRDD with arr(1) as key and zipcode
val rdd2 = file2
.filter(line => line != header2)
.map(line => line.split(",")) // split the lines => Array[String]
.map(arr => (arr(0).trim, arr(2).trim.toInt)) // to make up a pairRDD with arr(0) as key and trip_hrs
val joined = rdd1 // join the pairRDD by its keys
.join(rdd2)
.cache() // cache joined in memory
joined.foreach(println) // checking data
println("**************")
// ("c",("54321",2))
// ("b",("67890",2))
// ("a",("12345",1))
// ("a",("12345",3))
// ("a",("12345",1))
val result = joined.reduceByKey({ case((zip, time), (zip1, time1) ) => (zip, time + time1) })
result.map({case( (id,(zip,time)) ) => (id, zip, time)}).foreach(println) // checking output
// ("b","67890",2)
// ("c","54321",2)
// ("a","12345",5)
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}

If you can read CSV into RDD already, Trips can be summarized, and then joined with Locations:
val tripsSummarized = trips
.map({ case (location, _, hours) => (location, hours) })
.reduceByKey((hoursTotal, hoursIncrement) => hoursTotal + hoursIncrement)
val result = locations
.map({ case (_, location, zipCode) => (location, zipCode) })
.join(tripsSummarized)
.map({case (location, (zipCode, hoursTotal)) => (location, zipCode, hoursTotal) })
If locations without trips are required, "leftOuterJoin" can be used.

Related

spark 2.x with mapPartitions large number of records parallel processing

I am trying to use spark mapPartitions with Datasets[Spark 2.x] for copying large list of files [1 million records] from one location to another in parallel.
However, at times, I am seeing that one record is getting copied multiple times.
The idea is to split 1 million files into number of partitions (here, 24). Then for each partition, perform copy operation in parallel and finally get result from each partition to perform further actions.
Can someone please tell me what am I doing wrong?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 million records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
var dfBuffer = new ListBuffer[DataFrame]()
dfBuffer += SourceTargetDataSet
.mapPartitions(partition => {
println("partition id: " + TaskContext.getPartitionId)
//for each partition
val result = partition
.map(row => {
val source = row._1
val target = row._2
val copyStatus = copyFiles(source, target) // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.toList
result.toIterator
})
.toDF()
val dfList = dfBuffer.toList
val newDF = dfList.tail.foldLeft(dfList.head)(
(accDF, newDF) => accDF.join(newDF, Seq("_1"))
)
println("newDF Count " + newDF.count)
newDF
}
Update 2: I changed the function as shown below and so far it is giving me consistent results as expected. May I know what I was doing wrong and am I getting the required parallelization using below function? If not, how can this be optimized?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 miilion records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
val iterator = SourceTargetDataSet.toDF
.mapPartitions(
(it: Iterator[Row]) =>
it.toList
.map(row => {
println(row)
val source = row.toString.split(",")(0).drop(1)
val target = row.toString.split(",")(1).dropRight(1)
println("source : " + source)
println("target: " + target)
val copyStatus = copyFiles() // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.iterator
)
.toLocalIterator
val df = y.toList.toDF("targetKey", "copyStatus")
df
}
One should avoid performing write operations in map actions because they can be replayed when an executor dies and the same map has to be performed by another executer.
I'd choose foreach instead.

How to analysis one row into several rows according to num of map.keys

I am trying to read a dataset and process it; dataset row type is (string,string,string,Map[String,String]), the num of Map.keys is from 1 to 3,so one row will become 1-3 rows like(string,string,string,k,v).
I actually realize it using code as follows:
var arr = new ArrayBuffer[Array[String]]()
myDataset.collect.foreach{
f:(String,String,String,Map[String,String]) =>
val ma = f._4
for((k,v)<-ma) {
arr += Array(f._1,f._2,f._3,k,v)
}
}
Orgdata like this(one row in mydataset:hundreds of millions ):
val a = ("111","222","333",Map("k1"->"v1","k2"->"v2"))
expected output:
("111","222","333","k1","v1")
("111","222","333","k2","v2")
But big data cause OOM problem,so is there other ways to accomplish this ? or how to optimize my code to avoid OOM?
You can just explode the map column and then select the exploded columns :
val df = sc.parallelize(Array(
("111","222","333",Map("k1"->"v1","k2"->"v2"))
)).toDF("a", "b", "c", "d")
df.select($"*", explode($"d") )
.select("a", "b", "c" ,"key", "value")
.as[(String, String, String, String, String)]
.first
// (String, String, String, String, String) = (111,222,333,k1,v1)

How to perform a join on two files within the same RDD loaded using wholeTextFiles()

I am fairly new to spark-scala so please don't mind if this is a beginner question.
I have a directory test which contains two files, input1.txt and input2.txt.
Now, lets say i create a RDD called inputRDD using
val inputRDD = sc.wholeTextFiles("/home/hduser/test")
which includes both the files into the pair RDD (inputRDD).
based on my understanding, inputRDD contains file name as the key and contents as the value
something like this
(input1.txt,contents of input1.txt)
(input2.txt,contents of input2.txt)
Now, lets say I have to perform a join on both the files this way(which are in the same RDD) based on the first column.
contents of input1.txt
----------------------
1 a
1 b
2 c
2 d
contents of input2.txt
----------------------
1 e
2 f
3 g
How can i do that?
You need to first split your content, then do a reduceByKey to format your join. Something like below:
val outputRDD = inputRDD.mapPartitions(iter => {
iter.map(path_content => {
// split string content
val splittedStr = path_content._2.split(" ")
// outputs (1, a) (1, b) (2, c)
(splittedStr(0), splittedStr(1))
})
}).reduceByKey(_ + _) // this outputs (1, abe)
If you have only two files in your test directory and if the filenames are known then you can separate the texts of two files into two rdds and use join as below
val rdd1 = inputRDD.filter(tuple => tuple._1.contains("input1.txt"))
.flatMap(tuple => tuple._2.split("\n"))
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
val rdd2 = inputRDD.filter(tuple => tuple._1.contains("input2.txt"))
.flatMap(tuple => tuple._2.split("\n"))
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
rdd1.join(rdd2).foreach(println)
You should have output as
(2,(c,f))
(2,(d,f))
(1,(a,e))
(1,(b,e))
I hope this is what you desire
Updated
If there are two files in test directory whose names are unknown then you can avoid wholeTextFile api and use textFile api to read them as separate rdds and join them as above. But for that you will have to write a function to list the files.
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val fileList = getListOfFiles("/home/hduser/test")
val rdd1 = sc.textFile(fileList(0).getPath)
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
val rdd2 = sc.textFile(fileList(1).getPath)
.map(line => line.split(" "))
.map(array => (array(0), array(1)))
rdd1.join(rdd2).foreach(println)

How to filter the data in spark-shell using scala?

I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". store might be repetitive because a person can visit the store any number of times.
Input Data:
id, store
1, Walmart
1, Walmart
1, Bestbuy
2, Target
3, Walmart
4, Bestbuy
Output Expected:
3, Walmart
I have got the output using dataFrames and running SQL queries on spark context. But is there any way to do this using groupByKey/reduceByKey etc without dataFrames. Can someone help me with the code, After map-> groupByKey, a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer!
The code with which I got it using sqlContext is below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
The code which I am trying now is this, but I am struck after the third step:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
if I do dataFiltered.collect(), I am getting
Array[Any] = Array(Vector((3,Walmart)), (), ())
Please help me how to extract the output after this step
To filter an RDD, just use RDD.filter:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)
I also tried it another way and it worked out
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()

Scala spark reduce by key and find common value

I have a file of csv data stored in as a sequenceFile on HDFS, in the format of name, zip, country, fav_food1, fav_food2, fav_food3, fav_colour. There could be many entries with the same name and I needed to find out what their favourite food was (ie count all the food entries in all the records with that name and return the most popular one. I am new to Scala and Spark and have gone thorough multiple tutorials and scoured the forums but am stuck as to how to proceed. So far I have got the sequence files which had Text into String format and then filtered out the entries
Here is the sample data entries one to a line in the file
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
So the output should be the tuple (Bob, Soda) since soda appears the most amount of times in Bob's entries.
import org.apache.hadoop.io._
var lines = sc.sequenceFile("path",classOf[LongWritable],classOf[Text]).values.map(x => x.toString())
// converted to string since I could not get filter to run on Text and removing the longwritable
var filtered = lines.filter(_.split(",")(0) == "Bob");
// removed entries with all other users
var f_tuples = filtered.map(line => lines.split(",");
// split all the values
var f_simple = filtered.map(line => (line(0), (line(3), line(4), line(5))
// removed unnecessary fields
This Issue I have now is that I think I have this [<name,[f,f,f]>] structure and don't really know how to proceed to flatten it out and get the most popular food. I need to combine all the entries so I have a entry with a and then get the most common element in the value. Any help would be appreciated. Thanks
I tried this to get it to flatten out, but it seems the more I try, the more convoluted the data structure becomes.
var f_trial = fpairs.groupBy(_._1).mapValues(_.map(_._2))
// the resulting structure was of type org.apache.spark.rdd.RDD[(String, Interable[(String, String, String)]
here is what a println of a record looks like after f_trial
("Bob", List((Pizza, Soda,), (Chocolate, Cheese, Soda), (Chocolate, Pizza, Soda)))
Parenthesis Breakdown
("Bob",
List(
(Pizza, Soda, <missing value>),
(Chocolate, Cheese, Soda),
(Chocolate, Pizza, Soda)
) // ends List paren
) // ends first paren
I found time. Setup:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)
val data = """
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
""".trim
val records = sc.parallelize(data.split('\n'))
Extract the food choices, and for each make a tuple of ((name, food), 1)
val r2 = records.flatMap { r =>
val Array(name, id, country, food1, food2, food3, color) = r.split(',');
List(((name, food1), 1), ((name, food2), 1), ((name, food3), 1))
}
Total up each name/food combination:
val r3 = r2.reduceByKey((x, y) => x + y)
Remap so that the name (only) is the key
val r4 = r3.map { case ((name, food), total) => (name, (food, total)) }
Pick the food with the largest count at each step
val res = r4.reduceByKey((x, y) => if (y._2 > x._2) y else x)
And we're done
println(res.collect().mkString)
//(Mary,(Chips,1))(Bob,(Soda,3))
EDIT: To collect all the food items that have the same top count for a person, we just change the last two lines:
Start with a List of items with total:
val r5 = r3.map { case ((name, food), total) => (name, (List(food), total)) }
In the equal case, concatenate the list of food items with that score
val res2 = r5.reduceByKey((x, y) => if (y._2 > x._2) y
else if (y._2 < x._2) x
else (y._1:::x._1, y._2))
//(Mary,(List(Chocolate, Pasta, Chips),1))
//(Bob,(List(Soda),3))
If you want the top-3, say, then use aggregateByKey to assemble a list of the favorite foods per person instead of the second reduceByKey
Solutions provided by Paul and mattinbits shuffle your data twice - once to perform reduce-by-name-and-food and once to reduce-by-name. It is possible to solve this problem with only one shuffle.
/**Generate key-food_count pairs from a splitted line**/
def bitsToKeyMapPair(xs: Array[String]): (String, Map[String, Long]) = {
val key = xs(0)
val map = xs
.drop(3) // Drop name..country
.take(3) // Take food
.filter(_.trim.size !=0) // Ignore empty
.map((_, 1L)) // Generate k-v pairs
.toMap // Convert to Map
.withDefaultValue(0L) // Set default
(key, map)
}
/**Combine two count maps**/
def combine(m1: Map[String, Long], m2: Map[String, Long]): Map[String, Long] = {
(m1.keys ++ m2.keys).map(k => (k, m1(k) + m2(k))).toMap.withDefaultValue(0L)
}
val n: Int = ??? // Number of favorite per user
val records = lines.map(line => bitsToKeyMapPair(line.split(",")))
records.reduceByKey(combine).mapValues(_.toSeq.sortBy(-_._2).take(n))
If you're not a purist you can replace scala.collection.immutable.Map with scala.collection.mutable.Map to further improve performance.
Here's a complete example:
import org.apache.spark.{SparkContext, SparkConf}
object Main extends App {
val data = List(
"Bob,123,USA,Pizza,Soda,,Blue",
"Bob,456,UK,Chocolate,Cheese,Soda,Green",
"Bob,12,USA,Chocolate,Pizza,Soda,Yellow",
"Mary,68,USA,Chips,Pasta,Chocolate,Blue")
val sparkConf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(sparkConf)
val lineRDD = sc.parallelize(data)
val pairedRDD = lineRDD.map { line =>
val fields = line.split(",")
(fields(0), List(fields(3), fields(4), fields(5)).filter(_ != ""))
}.filter(_._1 == "Bob")
/*pairedRDD.collect().foreach(println)
(Bob,List(Pizza, Soda))
(Bob,List(Chocolate, Cheese, Soda))
(Bob,List(Chocolate, Pizza, Soda))
*/
val flatPairsRDD = pairedRDD.flatMap {
case (name, foodList) => foodList.map(food => ((name, food), 1))
}
/*flatPairsRDD.collect().foreach(println)
((Bob,Pizza),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Cheese),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Pizza),1)
((Bob,Soda),1)
*/
val nameFoodSumRDD = flatPairsRDD.reduceByKey((a, b) => a + b)
/*nameFoodSumRDD.collect().foreach(println)
((Bob,Cheese),1)
((Bob,Soda),3)
((Bob,Pizza),2)
((Bob,Chocolate),2)
*/
val resultsRDD = nameFoodSumRDD.map{
case ((name, food), count) => (name, (food,count))
}.groupByKey.map{
case (name, foodCountList) => (name, foodCountList.toList.sortBy(_._2).reverse.head)
}
resultsRDD.collect().foreach(println)
/*
(Bob,(Soda,3))
*/
sc.stop()
}