Traversing RDD Key Value pairs when having several Values - scala

I currently new to Spark and I'm using Scala.
Im having some trouble with traversing a RDD Key Value pairs.
Im got a TSV file, file1, with among other things Country Name, Latittude and Longitude and I got so far;
val a = file1.map(_.split("\t")).map(rec => (rec(1), (rec(11).toDouble, rec(12).toDouble)))
Where rec(1) is country name and rec(11) is longitude, and rec(12) is latitude.
And as far as I understand a is now a key-value pair with rec(1) being key and rec(11) and rec(12) are values.
I have managed to test that a.first._1 gives the first Key
a.first._2._1 gives the first value for the key.
a.first._2._2 gives the second value for the key.
My goal is to at least manage to get the average of all the rec(11) with the same key, and the same with rec(12). So my thought was to sum them all and then divide with the number of key-value pairs with that key.
Could someone help me with what i should do next? I tried with map, flatValueMap, valueMap, groupByKey and so on, but i cant seem to manage to sum rec(11)'s and rec(12)'s at the same time.

You can do it using a groupByKey and then the agg operation with avg
Here is a quick example:
Original DF:
+------------+-----+
|country code|pairs|
+------------+-----+
| ES|[1,2]|
| UK|[2,3]|
| ES|[4,5]|
+------------+-----+
Performing the operation:
df.groupBy($"country code").agg(avg($"pairs._1"), avg($"pairs._2"))
Result:
+------------+-------------+-------------+
|country code|avg(pairs._1)|avg(pairs._2)|
+------------+-------------+-------------+
| ES| 2.5| 3.5|
| UK| 2.0| 3.0|
+------------+-------------+-------------+

My goal is to at least manage to get the average of all the rec(11) with the same key, and the same with rec(12)
You can proceed as below (commented for clarity)
a.mapValues(x => (x, 1)) //putting counter to the values of (k, (v1, v2)) as (k, ((v1, v2), 1))
.reduceByKey{case(x,y) => ((x._1._1+y._1._1, x._1._2+y._1._2), x._2+y._2)} //summing separately all the values of v1, all the values of v2 and the counter of same key
.map{case(x, y)=> (x, (y._1._1/y._2, y._1._2/y._2))} //finding the average i.e. deviding the sum of v1 and v1 by counter sum separately
this is all explained in https://stackoverflow.com/a/49166009/5880706

Related

Spark assigning data according to a percentage

Let's say I have multiple customer service tiers (Premium, Basic and Free) each of these have several dedicated support teams :
Premium => Purple, Blue, Green & Yellow
Basic => Red, White & Black
Free => Orange & Pink
Every client has a specific customer service tier.
What I'm trying to achieve is to dedicate a specific support team for each of my clients given their customer service tier. Also, I want the teams of a particular customer service tier to have more or less the same number of clients to handle.
When there's less clients than team in assigned to a particular customer support tier, I don't really care of which teams get assigned. I just want each team to handle approximately the same amount of clients.
In this example, my base Dataset looks like :
Some possible outputs :
I can't really figure out of a way to do with Spark, can anyone help me up ?
Okay, so let's solve this step by step. If I was supposed to do this, I would first create a Map which maps each group with its possible amount of values, to avoid re-computation for each row, something like this:
// groupsDF is the single column df in your question
val groupsAvailableCount: Map[Int, Long] =
groupsDF
.groupBy("group")
.count
.as[(Int, Long)]
.collect.toMap
// result would be: Map(1 -> 2, 2 -> 3, 3 -> 4)
Now the second part is a bit tricky, because as far as you explained, the probability of each value in your groups is the same (like in group 1, all values have probability of 0.25), while their probabilities might not be the same in your actual problem. Anyway, this is a permutation with probabilities, you can decide how to sort things in your problem easily. The good thing about this second part is that it abstracts away all these permutation with probability problems into a single function, which you can change it as you wish, and the rest of your code will be immune to changes:
def getWithProbabilities(groupId: Int, probs: Map[String, Double]): List[String] = {
def getValues(groupId: Int, probabilities: List[(String, Double)]): List[String] = {
// here is the logic, you can change the sorting and other stuff as you would want to
val take = groupsAvailableCount.getOrElse(groupId, 0L).toInt
if (take > probabilities.length) getValues(groupId, probabilities ::: probabilities)
else probabilities.sortBy(_._2).take(take).map(_._1)
}
getValues(groupId, probs.toList)
}
So now, you're somehow magically (the function abstraction thing) able to get the values you want, from each group based on the spec! Almost done, now you only need the distinct groupIds, and given groups, you can fetch the values for them, and create your rows:
groupsDF.distinct.as[Int]. // distinct group ids
.collect.toList // collect as scala list, to do the mappings
.map(groupId => groupId -> getWithProbabilities(groupId, spec(groupId))) // here you calculate the values for the group, based on spec map
.flatMap {
case (groupId, values) => values.map(groupId -> _)
} // inline the results to create unique rows
.toDF("group", "value") // create your dataframe out of it
And the result would be:
+-----+-------------+
|group| value|
+-----+-------------+
| 1| value_one_x|
| 1| value_two_x|
| 2| value_one_y|
| 2| value_two_y|
| 2|value_three_y|
| 3| value_one_z|
| 3| value_two_z|
| 3| value_one_z|
| 3| value_two_z|
+-----+-------------+
Update:
So to use spark api and not to use collect, you can use a udf, which basically calls the previous function we had:
val getWithProbabilitiesUDF = udf { (groupId: Int) =>
getWithProbabilities(groupId, spec(groupId))
}
At the end, just call it on the groups dataframe you have:
groupsDF
.distinct
.select(
col("group"),
explode(getWithProbabilitiesUDF(col("group"))) as "value"
)

Group by and save the max value with overlapping columns in scala spark

I have data that looks like this:
id,start,expiration,customerid,content
1,13494,17358,0001,whateveriwanthere
2,14830,28432,0001,somethingelsewoo
3,11943,19435,0001,yes
4,39271,40231,0002,makingfakedata
5,01321,02143,0002,morefakedata
In the data above, I want to group by customerid for overlapping start and expiration (essentially just merge intervals). I am doing this successfully by grouping by the customer id, then aggregating on a first("start") and max("expiration").
df.groupBy("customerid").agg(first("start"), max("expiration"))
However, this drops the id column entirely. I want to save the id of the row that had the max expiration. For instance, I want my output to look like this:
id,start,expiration,customerid
2,11934,28432,0001
4,39271,40231,0002
5,01321,02143,0002
I am not sure how to add that id column for whichever row had the maximum expiration.
You can use a cumulative conditional sum along with lag function to define group column that flags rows that overlap. Then, simply group by customerid + group and get min start and max expiration. To get the id value associated with max expiration date, you can use this trick with struct ordering:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("customerid").orderBy("start")
val result = df.withColumn(
"group",
sum(
when(
col("start").between(lag("start", 1).over(w), lag("expiration", 1).over(w)),
0
).otherwise(1)
).over(w)
).groupBy("customerid", "group").agg(
min(col("start")).as("start"),
max(struct(col("expiration"), col("id"))).as("max")
).select("max.id", "customerid", "start", "max.expiration")
result.show
//+---+----------+-----+----------+
//| id|customerid|start|expiration|
//+---+----------+-----+----------+
//| 5| 0002|01321| 02143|
//| 4| 0002|39271| 40231|
//| 2| 0001|11943| 28432|
//+---+----------+-----+----------+

Spark Scala sum of values by unique key

If I have key,value pairs that compromise item(key) and the sales(value):
bolt 45
bolt 5
drill 1
drill 1
screw 1
screw 2
screw 3
So I want to obtain an RDD where each element is the sum of the values for every unique key:
bolt 50
drill 2
screw 6
My current code is like that:
val salesRDD = sc.textFile("/user/bigdata/sales.txt")
val pairs = salesRDD.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
counts.collect().foreach(println)
But my results get this:
(bolt 5,1)
(drill 1,2)
(bolt 45,1)
(screw 2,1)
(screw 3,1)
(screw 1,1)
How should I edit my code to get the above result?
Java way, hope you can convert this to scala. Looks like you just need a groupby and count
salesRDD.groupBy(salesRDD.col("name")).count();
+-----+-----+
| name|count|
+-----+-----+
| bolt| 50|
|drill| 2|
|screw| 6 |
+-----+-----+
Also,
please use Datasets and Dataframes rather than RDDs. You will find it a lot handy

Check the minimum by iterating one row in a dataframe over all the rows in another dataframe

Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.

Converting a Dataframe to a scala Mutable map doesn't produce equal number of records

I am new to Scala/spark. I am working on Scala/Spark application that selects a couple of columns from a hive table and then converts it into a Mutable map with the first column being the keys and second column being the values. For example:
+--------+--+
| c1 |c2|
+--------+--+
|Newyork |1 |
| LA |0 |
|Chicago |1 |
+--------+--+
will be converted to Scala.mutable.Map(Newyork -> 1, LA -> 0, Chicago -> 1)
Here is my code for the above conversion:
val testDF = hiveContext.sql("select distinct(trim(c1)),trim(c2) from default.table where trim(c1)!=''")
val testMap = scala.collection.mutable.Map(testDF.map(r => (r(0).toString,r(1).toString)).collectAsMap().toSeq: _*)
I have no problem with the conversion. However, when I print the counts of rows in the Dataframe and the size of the Map, I see that they don't match:
println("Map - "+testMap.size+" DataFrame - "+testDF.count)
//Map - 2359806 DataFrame - 2368295
My idea is to convert the Dataframes to collections and perform some comparisons. I am also picking up data from other tables, but they are just single columns. and I have no problem converting them to ArrayBuffer[String] - The counts match.
I don't understand why I am having a problem with the testMap. Generally, the counts rows in the DF and the size of the Map should match, right?
Is it because there are too many records? How do I get the same number of records in the DF into the Map?
Any help would be appreciated. Thank you.
I believe the mismatch in counts is caused by elimination of duplicated keys (i.e. city names) in Map. By design, Map maintains unique keys by removing all duplicates. For example:
val testDF = Seq(
("Newyork", 1),
("LA", 0),
("Chicago", 1),
("Newyork", 99)
).toDF("city", "value")
val testMap = scala.collection.mutable.Map(
testDF.rdd.map( r => (r(0).toString, r(1).toString)).
collectAsMap().toSeq: _*
)
// testMap: scala.collection.mutable.Map[String,String] =
// Map(Newyork -> 99, LA -> 0, Chicago -> 1)
You might want to either use a different collection type or include an identifying field to your Map key to make it unique. Depending on your data processing need, you can also aggregate data into a Map-like dataframe via groupBy like below:
testDF.groupBy("city").agg(count("value").as("valueCount"))
In this example, the total of valueCount should match the original row count.
If you add entries with duplicate key to your map, duplicates are automatically removed. So what you should compare is:
println("Map - "+testMap.size+" DataFrame - "+testDF.select($"c1").distinct.count)