How to do 2 distinct groupby conditions on the same data frame in Scala? - scala

I have a data frame, I need to two different groupbys on the same data frame.
+----+-------+--------+----------------------------+
| id | type | item | value | timestamp |
+----+-------+--------+----------------------------+
| 1 | rent | dvd | 12 |2016-09-19T00:00:00Z
| 1 | rent | dvd | 12 |2016-09-19T00:00:00Z
| 1 | buy | tv | 12 |2016-09-20T00:00:00Z
| 1 | rent | movie | 12 |2016-09-20T00:00:00Z
| 1 | buy | movie | 12 |2016-09-18T00:00:00Z
| 1 | buy | movie | 12 |2016-09-18T00:00:00Z
+----+-------+-------+------------------------------+
I would like to get the result as :
id : 1
totalValue : 72 --- group by based on id
typeCount : {"rent" : 3, "buy" : 3} --- group by based on id
itemCount : {"dvd" : 2, "tv" : 1, "movie" : 3 } --- group by based on id
typeForDay : {"rent: 2, "buy" : 2 } --- group By based on id and dayofmonth(col("timestamp")) atmost 1 type per day
I tried :
val count_by_value = udf {( listValues :scala.collection.mutable.WrappedArray[String]) => if (listValues == null) null else listValues.groupBy(identity).mapValues(_.size)}
val group1 = df.groupBy("id").agg(collect_list("type"),sum("value") as "totalValue", collect_list("item"))
val group1Result = group1.withColumn("typeCount", count_by_value($"collect_list(type)"))
.drop("collect_list(type)")
.withColumn("itemCount", count_by_value($"collect_list(item)"))
.drop("collect_list(item)")
val group2 = df.groupBy("id", dayofmonth(col("timestamp"))).agg(collect_set("type"))
val group2Result = group2.withColumn("typeForDay", count_by_value($"collect_set(type)"))
.drop("collect_set(type)")
val groupedResult = group1Result.join(group2Result, "id").show()
But it takes time, is there any other efficient way of doing this ?

Better approach is to add each group field to key & reduce them instead of groupBy(). You can use these:
df1.map(rec => (rec(0), rec(3).toString().toInt)).
reduceByKey(_+_).take(5).foreach(println)
=> (1,72)
df1.map(rec => ((rec(0), rec(1)), 1)).
map(x => (x._1._1, x._1._2,x._2)).
reduceByKey(_+_).take(5).foreach(println)
=>(1,rent,3)
(1,buy,3)
df1.map(rec => ((rec(0), rec(2)), 1)).
map(x => (x._1._1, x._1._2,x._2)).
reduceByKey(_+_).take(5).foreach(println)
=>(1,dvd,2)
(1,tv,1)
(1,movie,3)
df1.map(rec => ((rec(0), rec(1), rec(4).toString().substring(8,10)), 1)).
reduceByKey(_+_).map(x => (x._1._1, x._1._2,x._1._3,x._2)).
take(5).foreach(println)
=>(1,rent,19,2)
(1,buy,20,1)
(1,buy,18,2)
(1,rent,20,1)

Related

Spark merge sets of common elements

I have a DataFrame that looks like this:
+-----------+-----------+
| Package | Addresses |
+-----------+-----------+
| Package 1 | address1 |
| Package 1 | address2 |
| Package 1 | address3 |
| Package 2 | address3 |
| Package 2 | address4 |
| Package 2 | address5 |
| Package 2 | address6 |
| Package 3 | address7 |
| Package 3 | address8 |
| Package 4 | address9 |
| Package 5 | address9 |
| Package 5 | address1 |
| Package 6 | address10 |
| Package 7 | address8 |
+-----------+-----------+
I need to find all the addresses that were seen together across different packages. Example output:
+----+------------------------------------------------------------------------+
| Id | Addresses |
+----+------------------------------------------------------------------------+
| 1 | [address1, address2, address3, address4, address5, address6, address9] |
| 2 | [address7, address8] |
| 3 | [address10] |
+----+------------------------------------------------------------------------+
So, I have DataFrame. I'm grouping it by package (instead of grouping):
val rdd = packages.select($"package", $"address").
map{
x => {
(x(0).toString(), x(1).toString())
}
}.rdd.combineByKey(
(source) => {
Set[String](source)
},
(acc: Set[String], v) => {
acc + v
},
(acc1: Set[String], acc2: Set[String]) => {
acc1 ++ acc2
}
)
Then, I'm merging rows that have common addresses:
val result = rdd.treeAggregate(
Set.empty[Set[String]]
)(
(map: Set[Set[String]], row) => {
val vals = row._2
val sets = map + vals
// copy-paste from here https://stackoverflow.com/a/25623014/772249
sets.foldLeft(Set.empty[Set[String]])((cum, cur) => {
val (hasCommon, rest) = cum.partition(_ & cur nonEmpty)
rest + (cur ++ hasCommon.flatten)
})
},
(map1, map2) => {
val sets = map1 ++ map2
// copy-paste from here https://stackoverflow.com/a/25623014/772249
sets.foldLeft(Set.empty[Set[String]])((cum, cur) => {
val (hasCommon, rest) = cum.partition(_ & cur nonEmpty)
rest + (cur ++ hasCommon.flatten)
})
},
10
)
But, no matter what I do, treeAggregate are taking very long, and I can't finish single task. Raw data size is about 250gb. I've tried different clusters, but treeAggregate is taking too long.
Everything before treeAggregate works good, but it's stuch after that.
I've tried different spark.sql.shuffle.partitions (default, 2000, 10000), but it doesn't seems to matter.
I've tried different depth for treeAggregate, but didn't noticed the difference.
Related questions:
Merge Sets of Sets that contain common elements in Scala
Spark complex grouping
Take a look at your data as if it is a graph where addresses are vertices and they have a connection if there is package for both of them. Then solution to your problem will be connected components of the graph.
Sparks gpraphX library has optimized function to find connected components. It will return vertices that are in different connected components, think of them as ids of each connected component.
Then having id you can collect all other addresses connected to it if needed.
Have a look at this article how they use graphs to achieve the same grouping as you.

Spark - Iterating through all rows in dataframe comparing multiple columns for each row against another

| match_id | player_id | team | win |
| 0 | 1 | A | A |
| 0 | 2 | A | A |
| 0 | 3 | B | A |
| 0 | 4 | B | A |
| 1 | 1 | A | B |
| 1 | 4 | A | B |
| 1 | 8 | B | B |
| 1 | 9 | B | B |
| 2 | 8 | A | A |
| 2 | 4 | A | A |
| 2 | 3 | B | A |
| 2 | 2 | B | A |
I have a dataframe that looks like above.
I need to to create a map (key,value) pair such that for every
(k=>(player_id_1, player_id_2), v=> 1 ), if player_id_1 wins against player_id_2 in a match
and
(k=>(player_id_1, player_id_2), v=> 0 ), if player_id_1 loses against player_id_2 in a match
I will have to thus iterate through the entire data frame comparing each player id to another based upon the other 3 columns.
I am planning to achieve this as follows.
Group by match_id
In each group for a player_id check against other player_id's the following
a. If match_id is same and team is different
Then
if team = win
(k=>(player_id_1, player_id_2), v=> 0 )
else team != win
(k=>(player_id_1, player_id_2), v=> 1 )
For example, after partitioning by matches consider match 1.
player_id 1 needs to be compared to player_id 2,3 and 4.
While iterating, record for player_id 2 will be skipped as the team is same
for player_id 3 as team is different the team & win will be compared.
As player_id 1 was in team A and player_id 3 was in team B and team A won the key-value formed would be
((1,3),1)
I have a fair idea of how to achieve this in imperative programming but I am really new to scala and functional programming and can't get a clue as to how while iterating through every row for a field create a (key,value) pair by having checks on other fields.
I tried my best to explain the problem. Please do let me know if any part of my question is unclear. I would be happy to explain the same. Thank you.
P.S: I am using Spark 1.6
This can be achieved using the DataFrame API as shown below..
Dataframe API version:
val df = Seq((0,1,"A","A"),(0,2,"A","A"),(0,3,"B","A"),(0,4,"B","A"),(1,1,"A","B"),(1,4,"A","B"),(1,8,"B","B"),(1,9,"B","B"),(2,8,"A","A"),(2,4,"A","A"),(2,3,"B","A"),(2,2,"B","A")
).toDF("match_id", "player_id", "team", "win")
val result = df.alias("left")
.join(df.alias("right"), $"left.match_id" === $"right.match_id" && not($"right.team" === $"left.team"))
.select($"left.player_id", $"right.player_id", when($"left.team" === $"left.win", 1).otherwise(0).alias("flag"))
scala> result.collect().map(x => (x.getInt(0),x.getInt(1)) -> x.getInt(2)).toMap
res4: scala.collection.immutable.Map[(Int, Int),Int] = Map((1,8) -> 0, (3,4) -> 0, (3,1) -> 0, (9,1) -> 1, (4,1) -> 0, (8,1) -> 1, (2,8) -> 0, (8,3) -> 1, (1,9) -> 0, (1,4) -> 1, (8,2) -> 1, (4,9) -> 0, (3,2) -> 0, (1,3) -> 1, (4,8) -> 0, (4,2) -> 1, (2,4) -> 1, (8,4) -> 1, (2,3) -> 1, (4,3) -> 1, (9,4) -> 1, (3,8) -> 0)
SPARK SQL version:
df.registerTempTable("data_table")
val result = sqlContext.sql("""
SELECT DISTINCT t0.player_id, t1.player_id, CASE WHEN t0.team == t0.win THEN 1 ELSE 0 END AS flag FROM data_table t0
INNER JOIN data_table t1
ON t0.match_id = t1.match_id
AND t0.team != t1.team
""")

SPARK : groupByKey vs reduceByKey which is better and efficient to combine the Maps?

I have a data frame [df] :
+------------+-----------+------+
| id | itemName |Value |
--------------------------------+
| 1 | TV | 4 |
| 1 | Movie | 5 |
| 2 | TV | 6 |
I am trying to transform it to :
{id : 1, itemMap : { "TV" : 4, "Movie" : 5}}
{id : 2, itemMap : {"TV" : 6}}
I want the final result to be in RDD[(String, String)] with itemMap as the value's name
So I am doing :
case class Data (itemMap : Map[String, Int]) extends Serializable
df.map{
case r =>
val id = r.getAs[String]("id")
val itemName = r.getAs[String]("itemName")
val Value = r.getAs[Int]("Value")
(id, Map(itemName -> Value ))
}.reduceByKey((x, y) => x ++ y).map{
case (k, v) =>
(k, JacksonUtil.toJson(Data(v)))
}
But it takes forever to run. Is it efficient to use reducebyKey here ? Or can I use groupByKey ? Is there any other efficient way to do the transformation ?
My Config :
I have 10 salves and a master of type r3.8xLarge
spark.driver.cores 30
spark.driver.memory 200g
spark.executor.cores 16
spark.executor.instances 40
spark.executor.memory 60g
spark.memory.fraction 0.95
spark.yarn.executor.memoryOverhead 2000
Is this the correct type of machine for this task ?

SPARK: How to get day difference between a data frame column and timestamp in SCALA

How do I get date difference (no of days in between) in data frame scala ?
I have a df : [id: string, itemName: string, eventTimeStamp: timestamp] and a startTime (timestamp string) how do I get a column "Daydifference" - day between (startTime - timeStamp)
My Code :
Initial df :
+------------+-----------+-------------------------+
| id | itemName | eventTimeStamp |
----------------------------------------------------
| 1 | TV | 2016-09-19T00:00:00Z |
| 1 | Movie | 2016-09-19T00:00:00Z |
| 1 | TV | 2016-09-26T00:00:00Z |
| 2 | TV | 2016-09-18T00:00:00Z |
I need to get most recent eventTimeStamp based on id and itemName, so I did:
val result = df.groupBy("id", "itemName").agg(max("eventTimeStamp") as "mostRecent")
+------------+-----------+-------------------------+
| id | itemName | mostRecent |
----------------------------------------------------
| 1 | TV | 2016-09-26T00:00:00Z |
| 1 | Movie | 2016-09-19T00:00:00Z |
| 2 | TV | 2016-09-26T00:00:00Z |
Now I need to get the date difference between mostRecent and startTime (2016-09-29T00:00:00Z) , so that I can get :
{ id : 1, {"itemMap" : {"TV" : 3, "Movie" : 10 }} }
{ id : 2, {"itemMap" : {"TV" : 3}} }
I tried like this :
val startTime = "2016-09-26T00:00:00Z"
val result = df.groupBy("id", "itemName").agg(datediff(startTime, max("eventTimeStamp")) as Daydifference)
case class Data (itemMap : Map[String, Long]) extends Serializable
result.map{
case r =>
val id = r.getAs[String]("id")
val itemName = r.getAs[String]("itemName")
val Daydifference = r.getAs[Long]("Daydifference")
(id, Map(itemName -> Daydifference ))
}.reduceByKey((x, y) => x ++ y).map{
case (k, v) =>
(k, JacksonUtil.toJson(Data(v)))
}
But getting error on datediff. Can some one tell me how do I acheive this ?
When you want to use some constant ("literal") value as a Column in a DataFrame, you should use the lit(...) function. The other error here is trying to use a String as the startDate, to compare it to a timestamp column you can use java.sql.Date:
val startTime = new java.sql.Date(2016, 8, 26) // beware, months are Zero-based
val result = df.groupBy("id", "itemName")
.agg(datediff(lit(startTime), max("eventTimeStamp")) as "Daydifference")
result.show()
// +---+--------+-------------+
// | id|itemName|Daydifference|
// +---+--------+-------------+
// | 1| Movie| 7|
// | 1| TV| 0|
// | 2| TV| 0|
// +---+--------+-------------+

How to rank the data set having multiple columns in Scala?

I have data set like this which i am fetching from csv file but how to
store in Scala to do the processing.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234| 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
Actually I need to sort the data and rank it.
I am new to Scala programming.
Thanks
Answering your question here is the solution, this code reads a csv and order by the third column
object CSVDemo extends App {
println("recent, freq, monitor")
val bufferedSource = io.Source.fromFile("./data.csv")
val list: Array[Array[String]] = (bufferedSource.getLines map { line => line.split(",").map(_.trim) }).toArray
val newList = list.sortBy(_(2))
newList map { line => println(line.mkString(" ")) }
bufferedSource.close
}
you read the file and you parse it to an Array[Array[String]], then you order by the third column, and you print
Here I am using the list and try to normalize each column at a time and then concatenating them. Is there any other way to iterate column wise and normalize them. Sorry my coding is very basic.
val col1 = newList.map(line => line.head)
val mi = newList.map(line => line.head).min
val ma = newList.map(line => line.head).max
println("mininumn value of first column is " +mi)
println("maximum value of first column is : " +ma)
// calculate scale for the first column
val scale = col1.map(x => math.round((x.toInt - mi.toInt) / (ma.toInt - mi.toInt)))
println("Here is the normalized range of first column of the data")
scale.foreach(println)