RDD with (key, (key2, value)) - pyspark

I have an RDD in pyspark of the form (key, other things), where "other things" is a list of fields. I would like to get another RDD that uses a second key from the list of fields. For example, if my initial RDD is:
(User1, 1990 4 2 green...)
(User1, 1990 2 2 green...)
(User2, 1994 3 8 blue...)
(User1, 1987 3 4 blue...)
I would like to get (User1, [(1990, x), (1987, y)]),(User2, (1994 z))
where x, y, z would be an aggregation on the other fields, eg x is the count of how may rows I have with User1 and 1990 (two in this case), and I get a list with one tuple per year.
I am looking at the key value functions from:
https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html
But don't seem to find anything that will give and aggregation twice: once for user and one for year. My initial attempt was with combineByKey() but I get stuck in getting a list from the values.
Any help would be appreciated!

You can do the following using groupby:
# sample rdd
l = [("User1", "1990"),
("User1", "1990"),
("User2", "1994"),
("User1", "1987") ]
rd = sc.parallelize(l)
# returns a tuples of count of year
def f(l):
dd = {}
for i in l:
if i not in dd:
dd[i] =1
else:
dd[i]+=1
return list(dd.items())
# using groupby and applying the function on x[1] (which is a list)
rd1 = rd.groupByKey().map(lambda x : (x[0], f(x[1]))).collect()
[('User1', [('1990', 2), ('1987', 1)]), ('User2', [('1994', 1)])]

Related

How to get count of year using spark scala

I have following Movies data which is like below,
I should get count of movies in each year like 2002,2 and 2004,1
Littlefield, John (I) x House 2002
Houdyshell, Jayne demon State 2004
Houdyshell, Jayne mall in Manhattan 2002
val data=sc.textFile("..line to file")
val dataSplit=data.map(line=>{var d=line.split("\t");(d(0),d(1),d(2))})
What i am unable to understand is when i use dataSplit.take(2).foreach(println) I see that d(0) is first two columns Littlefield, John (I) which are firstname and lastname and d(1) is movie name such as "x House" and d(2) is year. How can i get the count of movies each year?
Use reduceByKey with the mapped tuple in this way.
val dataSplit = data
.map(line => {var d = line.split("\t"); (d(2), 1)}) // (2002, 1)
.reduceByKey((a, b) => a + b)
// .collect() gives the result: Array((2004,1), (2002,2))

pairing each RDD value with all the other values in RDD in scala

I am trying to pair each value in the RDD with all the other values of the same RDD. But I am not able to come up with a proper solution.
RDD : Below Image represents the RDD data with pair as-> (UserId, MovieName::Rating).
I want to pair the moviename and ratings of each user as below:
from the above image:
user 1 rated Edison Kinetoscopic.. as 10 and La sortie... as 10
user 2 rated The Arrival .. as 8, Le manoir.. as 7, Edison Kinetoscopic.. as 7 etc...
SO, the output should be..
**key**: (Edison Kinetoscopic,La sortie des)
**Value** : (10,10), (7,8) -> Since user 1 and user two rated these two movies
**Key**: (The Arrival, Le manoir)
**value**: (8,7) -> only user-2 rated these two movies.
Any help appreciated.
If you're trying to build a recommender system, or compute movie-movie similarity, there are must better ways to do this.
However, to solve your problem, you can do the following:
val rdd = sc.parallelize(List(
(1,"Edison", 10),
(1,"La sortie", 10),
(2,"The Arrival", 8),
(2,"Le manoir", 7),
(2,"Edison", 7),
(2,"La sortie", 8),
(2,"Le voyage", 8),
(2,"The Great", 7)
))
// first group user movies
val pairings = rdd.map{case (user,movie,rating) => (user, List((movie,rating)))}.reduceByKey(_++_)
// then get all pairs for each user
val allPairs = pairings.flatMap{case (user, movieRatings) => (1 until movieRatings.length).flatMap(i => movieRatings.zip(movieRatings drop i))}
// re-structure pairings into format we want
val finalPairing = allPairs.map{case ((m1,r1),(m2,r2)) => m1.compareTo(m2) match {case -1 => ((m1,m2),List((r1,r2))); case _ => ((m2,m1),List((r2,r1)))}}.
// group by pairings
val groupByPair = finalPairing.reduceByKey(_++_)
// look at our pairings
pairings.take(100).foreach(println)
the compareTo is needed to guarantee that movies appear in the same order in the tuple, and thus can be grouped.

how to merge 2 different rdd in spark using scala

I m trying to merge 2 rdds to one. If my rdd1 consists of 2 records of 2 elements both are strings ex:
key_A:value_A and Key_B:value_B
rdd2 also consists of 1 record of 2 elements both of which are strings
key_C:value_c
my final rdd would look like this:
key_A :value_A , Key_B :value_B , key_C :value_c
we can use union method of rdd but its not working . Plz kindly help
while using union of 2 rdds should the row of the 2 differnt rdd contain the same no of elments or there size can differ.......??
Try with join:
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
See the associated section of the docs
union is working.
Sample code is:
val rdd = sparkContext.parallelize(1 to 10, 3)
val pairRDD = rdd.map { x => (x, x) }
val rdd1 = sparkContext.parallelize(11 to 20, 3)
val pairRDD1 = rdd1.map { x => (x, x) }
pairRDD.union(pairRDD1).foreach(tuple => {
println(tuple._1)
println(tuple._2)
})

Apache Spark Scala : How to maintain order of values while grouping rdd by key

May be i am asking very basic question apology for that, but i didn't find it's answer on internet. I have paired RDD want to use something like aggragateByKey and concatenating all the values by a key. Value which occur first in input RDD should come first in the aggragated RDD.
Input RDD [Int, Int]
2 20
1 10
2 8
2 25
Output RDD (Aggregated RDD)
2 20 8 25
1 10
I tried aggregateByKey and gropByKey, both are giving me ouput, but order of values is not maintained. So please suggest something in this.
Since groupByKey and aggregateByKey indeed cannot preserve order - you'll have to artificially add a "hint" to each record so that you can order by that hint yourself after the grouping:
val input = sc.parallelize(Seq((2, 20), (1, 10), (2, 8), (2, 25)))
val withIndex: RDD[(Int, (Long, Int))] = input
.zipWithIndex() // adds index to each record, will be used to order result
.map { case ((k, v), i) => (k, (i, v)) } // restructure into (key, (index, value))
val result: RDD[(Int, List[Int])] = withIndex
.groupByKey()
.map { case (k, it) => (k, it.toList.sortBy(_._1).map(_._2)) } // order values and remove index

in scala how do we aggregate an Array to determine the count per Key and the percentage vs total

I am trying to find an efficient way to find the following :
Int1 = 1 or 0, Int2 = 1..k (where k = 3) and Double = 1.0
I want to find how many 1 or 0 are there in every k
I need to find the percentage of result of 3 on the total of the size of the Array??
Input is :
val clusterAndLabel = sc.parallelize(Array((0, 0), (0, 0), (1, 0), (1, 1), (2, 1), (2, 1), (2, 0)))
So in this example:
I have : 0,0 = 2 , 0,1 = 0
I have : 1,0 = 1 , 1,1 = 1
I have : 2,1 = 2 , 2,0 = 1
Total is 7 instances
I was thinking of doing some aggegation but I am stuck on the thought that they are both considered 2-key join
If you want to find how many 1 and 0s there are you can do:
val rdd = clusterAndLabel.map(x => (x,1)).reduceByKey(_+_)
this will give you an RDD[(Int,Int),Int] containing exactly what you described, meaning: [((0,0),2), ((1,0),1), ((1,1),1), ((2,1),2), ((2,0),1)]. If you really want them gathered by their first key, you can add this line:
val rdd2 = rdd.map(x => (x._1._1, (x._1._2, x._2))).groupByKey()
this will yield an RDD[(Int, (Int,Int)] which will look like what you described, i.e.: [(0, [(0,2)]), (1, [(0,1),(1,1)]), (2, [(1,2),(0,1)])].
If you need the number of instances, it looks like (at least in your example) clusterAndLabel.count() should do the work.
I don't really understand question 3? I can see two things:
you want to know how many keys have 3 occurrences. To do so, you can start from the object I called rdd (no need for the groupByKey line) and do so:
val rdd3 = rdd.map(x => (x._2,1)).reduceByKey(_+_)
this will yield and RDD[(Int,Int)] which is kind of a frequency RDD: the key is the number of occurences and the value is how many times this key is hit. Here it would look like: [(1,3),(2,2)]. So if you want to know how many pairs occur 3 times, you just do rdd3.filter(_._1==3).collect() (which will be an array of size 0, but if it's not empty then it'll have one value and it will be your answer).
you want to know how many time the first key 3 occurs (once again 0 in your example). Then you start from rdd2 and do:
val rdd3 = rdd2.map(x=>(x._1,x._2.size)).filter(_._1==3).collect()
once again it will yield either an empty array or an array of size 1 containing how many elements have a 3 for their first key. Note that you can do it directly if you don't need to display rdd2, you can just do:
val rdd4 = rdd.map(x => (x._1._1,1)).reduceByKey(_+_).filter(_._1==3).collect()
(for performance you might want to do the filter before reduceByKey also!)