Spark Scala sum of values by unique key - scala

If I have key,value pairs that compromise item(key) and the sales(value):
bolt 45
bolt 5
drill 1
drill 1
screw 1
screw 2
screw 3
So I want to obtain an RDD where each element is the sum of the values for every unique key:
bolt 50
drill 2
screw 6
My current code is like that:
val salesRDD = sc.textFile("/user/bigdata/sales.txt")
val pairs = salesRDD.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
counts.collect().foreach(println)
But my results get this:
(bolt 5,1)
(drill 1,2)
(bolt 45,1)
(screw 2,1)
(screw 3,1)
(screw 1,1)
How should I edit my code to get the above result?

Java way, hope you can convert this to scala. Looks like you just need a groupby and count
salesRDD.groupBy(salesRDD.col("name")).count();
+-----+-----+
| name|count|
+-----+-----+
| bolt| 50|
|drill| 2|
|screw| 6 |
+-----+-----+
Also,
please use Datasets and Dataframes rather than RDDs. You will find it a lot handy

Related

How to apply custom logic inside an aggregate function

I'm currently learning Spark and let's say we have the following DataFrame
user_id
activity
1
liked
2
comment
1
liked
1
liked
1
comment
2
liked
Each type of activity has its own weight which is used to calculate the score
activity
weight
liked
1
comment
3
And this is the desired output
user_id
score
1
6
2
4
The calculation of score involves counting how many times an event occurred followed by their weight. For instance, user 1 perform 3 likes and a comment, so the weight is given by
(3 * 1) + (1 * 3)
How do we do this calculation in Spark?
My initial attempt is below
val df1 = evidenceDF
.groupBy("user_id")
.agg(collect_set("event") as "event_ids")
but I got stuck on the mapping portion. What I want to achieve is after I aggregated the events into its event_ids field, I'm going to split them and do the calculation in a map function, but I'm having difficulty moving further.
I searched about using a custom aggregator function but it sounds complicated, is there a straight forward way to do this?
You can join with the weights dataframe the group by and sum weights :
val df1 = evidenceDF.join(df_weight, Seq("activity"))
.groupBy("user_id")
.agg(
sum(col("weight")).as("score")
)
df1.show
//+-------+-----+
//|user_id|score|
//+-------+-----+
//| 1| 6|
//| 2| 4|
//+-------+-----+
Or if actually you have only 2 categories then using when expression directly in the sum :
val df1 = evidenceDF.groupBy("user_id")
.agg(
sum(
when(col("activity") === "liked", 1)
.when(col("activity") === "comment", 3)
).as("score")
)

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

Spark Scala GroupBy column and sum values

I am a newbie in Apache-spark and recently started coding in Scala.
I have a RDD with 4 columns that looks like this:
(Columns 1 - name, 2- title, 3- views, 4 - size)
aa File:Sleeping_lion.jpg 1 8030
aa Main_Page 1 78261
aa Special:Statistics 1 20493
aa.b User:5.34.97.97 1 4749
aa.b User:80.63.79.2 1 4751
af Blowback 2 16896
af Bluff 2 21442
en Huntingtown,_Maryland 1 0
I want to group based on Column Name and get the sum of Column views.
It should be like this:
aa 3
aa.b 2
af 2
en 1
I have tried to use groupByKey and reduceByKey but I am stuck and unable to proceed further.
This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:
sc.textFile("path to the text file")
.map(x => x.split(" ",-1))
.map(x => (x(0),x(3)))
.countByKey
To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:
val result = df.groupBy("column to Group on").agg(count("column to count on"))
another possibility is to use the sql approach:
val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")
I assume that you have already have your RDD populated.
//For simplicity, I build RDD this way
val data = Seq(("aa", "File:Sleeping_lion.jpg", 1, 8030),
("aa", "Main_Page", 1, 78261),
("aa", "Special:Statistics", 1, 20493),
("aa.b", "User:5.34.97.97", 1, 4749),
("aa.b", "User:80.63.79.2", 1, 4751),
("af", "Blowback", 2, 16896),
("af", "Bluff", 2, 21442),
("en", "Huntingtown,_Maryland", 1, 0))
Dataframe approach
val sql = new SQLContext(sc)
import sql.implicits._
import org.apache.spark.sql.functions._
val df = data.toDF("name", "title", "views", "size")
df.groupBy($"name").agg(count($"name") as "") show
**Result**
+----+-----+
|name|count|
+----+-----+
| aa| 3|
| af| 2|
|aa.b| 2|
| en| 1|
+----+-----+
RDD Approach (CountByKey(...))
rdd.keyBy(f => f._1).countByKey().foreach(println(_))
RDD Approach (reduceByKey(...))
rdd.map(f => (f._1, 1)).reduceByKey((accum, curr) => accum + curr).foreach(println(_))
If any of this does not solve your problem, pls share where exactely you have strucked.

Traversing RDD Key Value pairs when having several Values

I currently new to Spark and I'm using Scala.
Im having some trouble with traversing a RDD Key Value pairs.
Im got a TSV file, file1, with among other things Country Name, Latittude and Longitude and I got so far;
val a = file1.map(_.split("\t")).map(rec => (rec(1), (rec(11).toDouble, rec(12).toDouble)))
Where rec(1) is country name and rec(11) is longitude, and rec(12) is latitude.
And as far as I understand a is now a key-value pair with rec(1) being key and rec(11) and rec(12) are values.
I have managed to test that a.first._1 gives the first Key
a.first._2._1 gives the first value for the key.
a.first._2._2 gives the second value for the key.
My goal is to at least manage to get the average of all the rec(11) with the same key, and the same with rec(12). So my thought was to sum them all and then divide with the number of key-value pairs with that key.
Could someone help me with what i should do next? I tried with map, flatValueMap, valueMap, groupByKey and so on, but i cant seem to manage to sum rec(11)'s and rec(12)'s at the same time.
You can do it using a groupByKey and then the agg operation with avg
Here is a quick example:
Original DF:
+------------+-----+
|country code|pairs|
+------------+-----+
| ES|[1,2]|
| UK|[2,3]|
| ES|[4,5]|
+------------+-----+
Performing the operation:
df.groupBy($"country code").agg(avg($"pairs._1"), avg($"pairs._2"))
Result:
+------------+-------------+-------------+
|country code|avg(pairs._1)|avg(pairs._2)|
+------------+-------------+-------------+
| ES| 2.5| 3.5|
| UK| 2.0| 3.0|
+------------+-------------+-------------+
My goal is to at least manage to get the average of all the rec(11) with the same key, and the same with rec(12)
You can proceed as below (commented for clarity)
a.mapValues(x => (x, 1)) //putting counter to the values of (k, (v1, v2)) as (k, ((v1, v2), 1))
.reduceByKey{case(x,y) => ((x._1._1+y._1._1, x._1._2+y._1._2), x._2+y._2)} //summing separately all the values of v1, all the values of v2 and the counter of same key
.map{case(x, y)=> (x, (y._1._1/y._2, y._1._2/y._2))} //finding the average i.e. deviding the sum of v1 and v1 by counter sum separately
this is all explained in https://stackoverflow.com/a/49166009/5880706

Summing the column values of all the rows in a Dataframe - scala/spark

I am new to scala/spark. I am working on a scala/java application on spark, trying to read some data from a hive table and then sum up all the column values for each row. for example consider the following DF:
+--------+-+-+-+-+-+-+
| address|a|b|c|d|e|f|
+--------+-+-+-+-+-+-+
|Newyork |1|0|1|0|1|1|
| LA |0|1|1|1|0|1|
|Chicago |1|1|0|0|1|1|
+--------+-+-+-+-+-+-+
I want to sum up all the 1's in all the rows and get the total.i.e. the above dataframe's sum of all the columns should be 12 (since there are 12 number of 1's in all the rows combined)
I tried doing this:
var count = 0
DF.foreach( x => {
count = count + Integer.parseInt(x.getAs[String]("a")) + Integer.parseInt(x.getAs[String]("b")) + Integer.parseInt(x.getAs[String]("c")) + Integer.parseInt(x.getAs[String]("d")) + Integer.parseInt(x.getAs[String]("e")) + Integer.parseInt(x.getAs[String]("f"))
})
When I run the above code, the count value is still being zero. I think this has something to do with running the application on a cluster. So, declaring a variable and adding to it doesn't work for me as I have to run my application on a cluster. I also tried declaring static variable in a separate java class and adding to it - that gives me the same result.
As far as my knowledge goes, there should be an easy way of achieving this using the inline functions available in spark/scala libraries.
What would be an efficient way of achieving this? Any help would be appreciated.
Thank you.
P.S: I am using Spark 1.6.
You can sum the columns values firstly which gives back a single Row data frame of sums, then you can convert this Row to a Seq and sum the values up:
val sum_cols = df.columns.tail.map(x => sum(col(x)))
df.agg(sum_cols.head, sum_cols.tail: _*).first.toSeq.asInstanceOf[Seq[Long]].sum
// res9: Long = 12
df.agg(sum_cols.head, sum_cols.tail: _*).show
+------+------+------+------+------+------+
|sum(a)|sum(b)|sum(c)|sum(d)|sum(e)|sum(f)|
+------+------+------+------+------+------+
| 2| 2| 2| 1| 2| 3|
+------+------+------+------+------+------+
Here is an alternative approach:
first let's prepare an aggregating function:
scala> val f = df.drop("address").columns.map(col).reduce((c1, c2) => c1 + c2)
f: org.apache.spark.sql.Column = (((((a + b) + c) + d) + e) + f)
get sum as a DataFrame:
scala> df.agg(sum(f).alias("total")).show
+-----+
|total|
+-----+
| 12|
+-----+
get sum as a Long number:
scala> df.agg(sum(f)).first.getLong(0)
res39: Long = 12