Sorting an RDD in Spark - scala

I have a dataset listing general items bought by customers. Each record in the csv, lists items purchased by a customer, from left to right. For example (shortened sample):
Bicycle, Helmet, Gloves
Shoes, Jumper, Gloves
Television, Hat, Jumper, Playstation 5
I am looking to put this in an RDD in scala, and perform counts on them.
case class SalesItemSummary(SalesItemDesc: String, SalesItemCount: String)
val rdd_1 = sc.textFile("Data/SalesItems.csv")
val rdd_2 = rdd_1.flatMap(line => line.split(",")).countByValue();
Above is a short code sample. The first line is the case class (not used yet).
Line two grabs the data from the csv and puts it in an rdd_1. Easy enough.
Line three does flatmap, splits the data on the comma, and then does a count on each. So, for example, "Gloves" and "Jumper" above would have the number 2 beside it. The others 1. In what looks like a collection of tuples.
So far so good.
Next, I want to sort rdd_2 to list the top 3 most purchased items.
Can I do this with RDD? Or do I need to transfer RDD into a dataframe to achieve sort?
If so, how do I do it?
How do I apply the case class in line 1 for example to rdd_2, which seems to be a list of tuples? Should I take this approach?
Thanks in advance

The count in the case class should be an integer... and if you want to keep the results as an RDD, I'd suggest using reduceByKey rather than countByValue which returns a Map[String, Long] rather than an RDD.
Also I'd suggest splitting by , rather than , to avoid leading spaces in the item names.
case class SalesItemSummary(SalesItemDesc: String, SalesItemCount: Int)
val rdd_1 = sc.textFile("Data/SalesItems.csv")
val rdd_2 = rdd_1.flatMap(_.split(", "))
.map((_, 1))
.reduceByKey(_ + _)
.map(line => SalesItemSummary(line._1, line._2))
rdd_2.collect()
// Array[SalesItemSummary] = Array(SalesItemSummary(Gloves,2), SalesItemSummary(Shoes,1), SalesItemSummary(Television,1), SalesItemSummary(Bicycle,1), SalesItemSummary(Helmet,1), SalesItemSummary(Hat,1), SalesItemSummary(Jumper,2), SalesItemSummary(Playstation 5,1))
To sort the RDD, you can use sortBy:
val top3 = rdd_2.sortBy(_.SalesItemCount, false).take(3)
top3
// Array[SalesItemSummary] = Array(SalesItemSummary(Gloves,2), SalesItemSummary(Jumper,2), SalesItemSummary(Shoes,1))

Related

Combine two different RDDs with different key in Scala

I have two text file already create as rdd by sparkcontext.
one of them(rdd1) saves related words:
apple,apples
car,cars
computer,computers
Another one(rdd2) saves number of items:
(apple,12)
(apples, 50)
(car,5)
(cars,40)
(computer,77)
(computers,11)
I want to combine those two rdds
disire output:
(apple, 62)
(car,45)
(computer,88)
How to code this?
The meat of the work is to pick a key for the related words. Here I just select the first word but really you could do something more intelligent than just picking a random word.
Explanation:
Create the data
Pick a key for related words
Flatmap the tuples to enable us to join on the key we picked.
Join the RDDs
Map the RDD back into a tuple
Reduce by Key
val s = Seq(("apple","apples"),("car","cars")) // create data
val rdd = sc.parallelize(s)
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val rdd2 = sc.parallelize(t)
val keyed = rdd.flatMap( {case(a,b) => Seq((a, a),(b,a)) } ) // could be replace with any function that selects the key to use for all of the related words
.join(rdd2) // complete the join
.map({case (_, (a ,b)) => (a,b) }) // recreate a tuple and throw away the related word
.reduceByKey(_ + _)
.foreach(println) // to show it works
Even though this solves your problem there are more elegant solutions that you could use with Dataframes you may wish to look into. You could use reduce directly on RDD and skip the step of mapping back to a tuple. I think that would be a better solution but wanted to keep it simple so that it was more illustrative of what I did.

Counting occurrences of key while keeping several values

I'm having some trouble counting the number of occurrences of a key, while also keeping several values.
Usually I will just do:
val a = file1.map(x => (x, 1)).reduceByKey(_ + _)
which gives the number of occurrences for each key.
However, I also want to keep the values for each occurrence of a key, at the same time as counting the number of occurrences of the key. Something like this:
val a = file1.map(x => (x(1), (x(2), 1)).reduceByKey{case (x,y) => (x._1, y._1, x._2+y._2)}
For example: if the key x(1) is a country and x(2) is a city, I want to keep all the cities in a country, as well as knowing how many cities there are in a country.
It's complicated and redundant to keep the count of the cities together with its list. You can just collect all the cities, and add the size at the end:
It is of course easier if you use the dataframe interface (assuming a dataframe (key:Int, city:String))
import org.apache.spark.sql.{ functions => f}
import spark.implicits._
df.groupBy($"key").
agg(f.collect_set($"city").as("cities")).
withColumn("ncities", f.size($"cities"))
but you can do something similar with raw rdd (I am assuming in input tuples of (id,city) )
rdd.map{ x => (x(0),Set(x(1)))}.
reduceByKey{ case(x,y) => x ++ y }.
map { case(x,y:Set[_]) => (x,y, y.size)}
In this case, I would recommend using a dataframe instead of a RDD, and use the groupBy and agg methods.
You can easily convert an RDD to a dataframe using the toDF function, just make sure you import implicits first. Example assuming the RDD has two columns:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = rdd.toDF("country", "city")
Then use the groupBy and aggregate the values you want.
df.groupBy("country").agg(collect_set($"city").as("cities"), count($"city").as("count"))
I would suggest you to go with dataframes as well as dataframes are optimized and easy to use than rdds.
But if you want to learn about reduceByKey functionality (i.e. keeping other information as you said city information) then you can do something like below
Lets say you have a rdd as
val rdd = sc.parallelize(Seq(
("country1", "city1"),
("country1", "city2"),
("country1", "city3"),
("country1", "city3"),
("country2", "city1"),
("country2", "city2")
))
Your tried reducyByKey would need some modification as
rdd.map(x => (x._1, (Set(x._2), 1))) //I have used Set to get distinct cities (you can use list or arrays or any other collection
.reduceByKey((x,y)=> (x._1 ++ y._1, x._2 + y._2)) //cities are also summed and counts are also summed
which should give you
(country2,(Set(city1, city2),2))
(country1,(Set(city1, city2, city3),4))
I hope the answer is helpful
If you want to learn reduceByKey in detail you can check my detailed answer

How to divide a RDD into multiple RDDs according to every father RDD's element

I want to find a way to divide a fatherRDD into multiple RDDs accordingly to every fatherRDD's element.
For example, the elements of fatherRDD have lots of lists. I want to split this fatherRDD into lots of small RDD based on every element. In other words, if there are n elements in the fatherRDD, I want to get n RDDs.
Two days ago, I wrote a function like this:
def splitRDD(rdd1:RDD[List[(String, String)]]):List[RDD[(String, String)]] ={
var list = List[RDD[(String, String)]] ()
//println(rdd1.take(1).apply(0).apply(0)._1)
rdd1.foreach(x =>{
list = sc.makeRDD(x)::list
})
list
}
I think the wrong is I can not use sc.makeRDD(x) here. So how to divide a RDD into multiple RDDs according to every father RDD's element?
As per your description it should look like this:
def splitRDD(rdd1:RDD[List[(String, String)]]):List[RDD[(String, String)]] = rdd1.collect().toList.map(x => makeRdd(x))
def makeRdd(ls:List[(String,String)]): RDD[(String, String)] = sc.parallelize(ls)
Try this out for your data. is that what you want ?

Apache Spark's RDD splitting according to the particular size

I am trying to read strings from a text file, but I want to limit each line according to a particular size. For example;
Here is my representing the file.
aaaaa\nbbb\nccccc
When trying to read this file by sc.textFile, RDD would appear this one.
scala> val rdd = sc.textFile("textFile")
scala> rdd.collect
res1: Array[String] = Array(aaaaa, bbb, ccccc)
But I want to limit the size of this RDD. For example, if the limit is 3, then I should get like this one.
Array[String] = Array(aaa, aab, bbc, ccc, c)
What is the best performance way to do that?
Not a particularly efficient solution (not terrible either) but you can do something like this:
val pairs = rdd
.flatMap(x => x) // Flatten
.zipWithIndex // Add indices
.keyBy(_._2 / 3) // Key by index / n
// We'll use a range partitioner to minimize the shuffle
val partitioner = new RangePartitioner(pairs.partitions.size, pairs)
pairs
.groupByKey(partitioner) // group
// Sort, drop index, concat
.mapValues(_.toSeq.sortBy(_._2).map(_._1).mkString(""))
.sortByKey()
.values
It is possible to avoid the shuffle by passing data required to fill the partitions explicitly but it takes some effort to code. See my answer to Partition RDD into tuples of length n.
If you can accept some misaligned records on partitions boundaries then simple mapPartitions with grouped should do the trick at much lower cost:
rdd.mapPartitions(_.flatMap(x => x).grouped(3).map(_.mkString("")))
It is also possible to use sliding RDD:
rdd.flatMap(x => x).sliding(3, 3).map(_.mkString(""))
You will need to read all the data anyhow. Not much you can do apart from mapping each line and trim it.
rdd.map(line => line.take(3)).collect()

transform rdd into pairRDD

This is a newbie question.
Is it possible to transform an RDD like (key,1,2,3,4,5,5,666,789,...) with a dynamic dimension into a pairRDD like (key, (1,2,3,4,5,5,666,789,...))?
I feel like it should be super-easy but I cannot get how to.
The point of doing it is that I would like to sum all the values, but not the key.
Any help is appreciated.
I am using Spark 1.2.0
EDIT enlightened by the answer I explain my use case deeplier. I have N (unknown at compile time) different pairRDD (key, value), that have to be joined and whose values must be summed up. Is there a better way than the one I was thinking?
First of all if you just wanna sum all integers but first the simplest way would be:
val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first
On the other hand if you want to have access to the index of elements you can use rdd zipWithIndex method like this:
val indexed = rdd.zipWithIndex()
indexed.cache()
val result = (indexed.first()._2, indexed.filter(_._1 != 1))
But in your case this feels like overkill.
One more thing i would add, this looks like questionable desine to put key as first element of your rdd. Why not just instead use pairs (key, rdd) in your driver program. Its quite hard to reason about order of elements in rdd and i cant not think about natural situation in witch key is computed as first element of rdd (ofc i dont know your usecase so i can only guess).
EDIT
If you have one rdd of key value pairs and you want to sum them by key then do just:
val result = rdd.reduceByKey(_ + _)
If you have many rdds of key value pairs before counting you can just sum them up
val list = List(pairRDD0, pairRDD1, pairRDD2)
//another pairRDD arives in runtime
val newList = anotherPairRDD0::list
val pairRDD = newList.reduce(_ union _)
val resultSoFar = pairRDD.reduceByKey(_ + _)
//another pairRDD arives in runtime
val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)
EDIT
I edited example. As you can see you can add additional rdd when every it comes up in runtime. This is because reduceByKey returns rdd of the same type so you can iterate this operation (Ofc you will have to consider performence).