Spark-Scala RDD, group by count from array of array

Spark-Scala RDD, group by count from array of array - scala

I'm fairly new to Spark-Scala and just started to learn it by myself so please bear with me, I am using oracle virtual machine.
Here is my code,
val dataLines = sc.textFile("Data/client_jobs.csv")
val data = dataLines.map(_.split(";"))
val values = data.map(array => (array(0), array(1))
I can fetch data using spark-shell, and get the data in array of array format like this,
val data = Array[Array[String]] = Array(Array("c1",20)
,Array("c2",102)
,Array("c3",50)
,Array("c4",80)
,Array("c5",140)
,Array("c6",2036), Array("c7",568))
As you can see from the code I have also mapped it but instead of giving me an output like this
Array(("c1",20), ("c2",102), ("c3",50)...)
It gives me,
MapPartitionsRDD[3] at map at code1.scala:14
From the array of array or the normal array I need to fetch data and get the output like shown below,
(below100, 3)
(100 to 150, 2)
(above150, 2)
Basically it is only counting the jobs within the range.
I know very little about Spark-Scala, so any kind of help is appreciated.
Thanks in advance.

If you have RDD of tuples, category can be applied to each value, and then grouped:
val data = Array(("c1", 20),
("c2", 102),
("c3", 50),
("c4", 80),
("c5", 140),
("c6", 2036),
("c7", 568))
val rdd = sparkContext.parallelize(data)
rdd.map(tuple =>
(
if (tuple._2 < 100) {
"below100"
}
else if (tuple._2 >= 100 && tuple._2 <= 150) {
"100 to 150"
}
else "above150"
, 1
)
).reduceByKey(_ + _)
Result is:
(below100,3)
(above150,2)
(100 to 150,2)

Related

Split Map type column with huge values into multiple rows using Scala and Spark

This is in continuation of the question: Combine value part of Tuple2 which is a map, into single map grouping by the key of Tuple2
I am now able to reduce the rows by using reduceByKey.
But now, in the final DataFrame...
e.g.
(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900, 6-> 111, 7-> 222, 8-> 333, 12-> 444, 13->555, 19->666})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})
...some rows have the map column with very large map. For e.g. for B above.
I am trying to write the DF to Azure Cosmos DB Core SQL. Here each row from the above DF turns into 1 document of Cosmos DB. The issue is if the row size is more than 2MB, then, Cosmos DB rejects the request.
Question: I want to split rows with huge map columns into multiple rows (so that they become less than 2MB in size). Duplicate key column is not an issue.
The final result can be (if I divide the map if it has more than 5 elements everytime):
(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900})
(B, {6-> 111, 7-> 222, 8-> 333, 12-> 444, 13->555})
(B, {19->666})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})
You may ask that in the previous question, it was already split, then why did I merge? The reason is in the previous question, for B, without reduceByKey, I may have 1000 rows. But, finally I only need 20 rows for example as above. 1 row would have been ideal but due to Cosmos limit, I have to create multiple documents (each less than 2MB).
Hope I am clear. Please let me know for any clarification required.

I was able to solve this by writing my own custom code as below:
originalDF.rdd.reduceByKey((a, b) => a ++ b).map(row => {
val indexedMapEntries: Map[Int, (String, String)] = row._2.zipWithIndex.map(mapWithIndex => (mapWithIndex._2, mapWithIndex._1))
var min = 0
var max = Math.min(indexedMapEntries.size - 1, 9999)
var proceed = true
var rowKeyIdToAttributesMapList: ListBuffer[(String, Map[String, String])] = new ListBuffer[(String, Map[String, String])]()
while (proceed) {
var tempMapToHoldEntries = Map[String, String]()
var i = min
while (i <= max) {
var entry: (String, String) = indexedMapEntries.get(i).get
tempMapToHoldEntries += entry
i = i + 1
}
rowKeyIdToAttributesMapList += ((row._1, tempMapToHoldEntries))
min = max + 1
max = Math.min(indexedMapEntries.size - 1, max + 9999)
if (min > (indexedMapEntries.size - 1))
proceed = false
}
rowKeyIdToAttributesMapList.toList
}).flatMap(x => x).toDF("rowKeyId", "attributes")
Here, the originalDF is the one from my previous question (check OP). 10000 is the maximum size of each map for a rowKeyId. If the map size exceeds 10000, then I create a new row with same rowKeyId and remaining attributes in loop.

How to filter an rdd by data type?

I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.

Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.

spark aggregatebykey - sum and running average in the same call

I am learning spark, and do not have experience in hadoop.
Problem
I am trying to calculate the sum and average in the same call to aggregateByKey.
Let me share what I have tried so far.
Setup the data
val categoryPrices = List((1, 20), (1, 25), (1, 10), (1, 45))
val categoryPricesRdd = sc.parallelize(categoryPrices)
Attempt to calculate the average in the same call to aggregateByKey. This does not work.
val zeroValue1 = (0, 0, 0.0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue1)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
val newAverage = newSum/newCount
(newCount, newSum, newAverage)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
// TRYING TO CALCULATE THE RUNNING AVERAGE HERE
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1))/(tuple1._1 + tuple2._1)
(newCount1, newSum1, newAverage1)
}
).
collect.
foreach(println)
Result: Prints a different average each time
First time: (1,(4,100,70.0))
Second time: (1,(4,100,52.0))
Just do the sum first, and then calculate the average in a separate operation. This works.
val zeroValue2 = (0, 0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue2)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
(newCount, newSum)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
(newCount1, newSum1)
}
).
map(rec => {
val category = rec._1
val count = rec._2._1
val sum = rec._2._2
(category, count, sum, sum/count)
}).
collect.
foreach(println)
Prints the same result every time:
(1,4,100,25)
I think I understand the difference between seqOp and CombOp. Given that an operation can split data across multiple partitions on different servers, my understanding is that seqOp operates on data in a single partition, and then combOp combines data received from different partitions. Please correct if this is wrong.
However, there is something very basic that I am not understanding. Looks like we can't calculate both the sum and average in the same call. If this is true, please help me understand why.

The computation related to your average aggregation in seqOp:
val newAverage = newSum/newCount
and in combOp:
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1)) / (tuple1._1 + tuple2._1)
is incorrect.
Let's say the first three elements are in one partition and the last element in another. Your seqOp would generate the (count, sum, average) tuples as follows:
Partition #1: [20, 25, 10]
--> (1, 20, 20/1)
--> (2, 45, 45/2)
--> (3, 55, 55/3)
Partition #2: [45]
--> (1, 45, 45/1)
Next, the cross-partition combOp would combine the 2 tuples from the two partitions to give:
((55 * 3) + (45 * 1)) / 4
// Result: 52
As you can see from the above steps, the average value could be different if the ordering of the RDD elements or the partitioning is different.
Your 2nd approach works, as average is by definition total sum over total count hence is better calculated after first computing the sum and count values.

Perform a nested for loop with RDD.map() in Scala

I'm rather new to Spark and Scala and have a Java background. I have done some programming in haskell, so not completely new to functional programming.
I'm trying to accomplish some form of a nested for-loop. I have a RDD which I want to manipulate based on every two elements in the RDD. The pseudo code (java-like) would look like this:
// some RDD named rdd is available before this
List list = new ArrayList();
for(int i = 0; i < rdd.length; i++){
list.add(rdd.get(i)._1);
for(int j = 0; j < rdd.length; j++){
if(rdd.get(i)._1 == rdd.get(j)._1){
list.add(rdd.get(j)._1);
}
}
}
// Then now let ._1 of the rdd be this list
My scala solution (that does not work) looks like this:
val aggregatedTransactions = joinedTransactions.map( f => {
var list = List[Any](f._2._1)
val filtered = joinedTransactions.filter(t => f._1 == t._1)
for(i <- filtered){
list ::= i._2._1
}
(f._1, list, f._2._2)
})
I'm trying to achieve to put item _2._1 into a list if ._1 of both items is equal.
I am aware that i cannot do any filter or map function within another map function. I've read that you could achieve something like this with a join, but I don't see how I could actually get these items into a list or any structure that can be used as list.
How do you achieve an effect like this with RDDs?

Assuming your input has the form RDD[(A, (A, B))] for some types A, B, and that the expected result should have the form RDD[A] - not a List (because we want to keep data distributed) - this would seem to do what you need:
rdd.join(rdd.values).keys
Details:
It's hard to understand the exact input and expected output, as the data structure (type) of neither is explicitly stated, and the requirement is not well explained by the code example. So I'll make some assumptions and hope that it will help with your specific case.
For the full example, I'll assume:
Input RDD has type RDD[(Int, (Int, Int))]
Expected output has the form RDD[Int], and would contain a lot of duplicates - if the original RDD has the "key" X multiple times, each match (in ._2._1) would appear once per occurrence of X as a key
If that's the case we're trying to solve - this join would solve it:
// Some sample data, assuming all ints
val rdd = sc.parallelize(Seq(
(1, (1, 5)),
(1, (2, 5)),
(2, (1, 5)),
(3, (4, 5))
))
// joining the original RDD with an RDD of the "values" -
// so the joined RDD will have "._2._1" as key
// then we get the keys only, because they equal the values anyway
val result: RDD[Int] = rdd.join(rdd.values).keys
// result is a key-value RDD with the original keys as keys, and a list of matching _2._1
println(result.collect.toList) // List(1, 1, 1, 1, 2)

Efficient countByValue of each column Spark Streaming

I want to find countByValues of each column in my data. I can find countByValue() for each column (e.g. 2 columns now) in basic batch RDD as fallows:
scala> val double = sc.textFile("double.csv")
scala> val counts = sc.parallelize((0 to 1).map(index => {
double.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
}))
scala> counts.take(2)
res20: Array[scala.collection.Map[Long,Long]] = Array(Map(2 -> 5, 1 -> 5), Map(4 -> 5, 5 -> 5))
Now I want to perform same with DStreams. I have windowedDStream and want to countByValue on each column. My data has 50 columns. I have done it as fallows:
val windowedDStream = myDStream.window(Seconds(2), Seconds(2)).cache()
ssc.sparkContext.parallelize((0 to 49).map(index=> {
val counts = windowedDStream.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
counts.print()
}))
val topCounts = counts.map . . . . will not work
I get correct results with this, the only issue is that I want to apply more operations on counts and it's not available outside map.

You misunderstand what parallelize does. You think when you give it a Seq of two elements, those two elements will be calculated in parallel. That it not the case and it would be impossible for it to be the case.
What parallelize actually does is it creates an RDD from the Seq that you provided.
To try to illuminate this, consider that this:
val countsRDD = sc.parallelize((0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
})
Is equal to this:
val counts = (0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
}
val countsRDD = sc.parallelize(counts)
By the time parallelize runs, the work has already been performed. parallelize cannot retroactively make it so that the calculation happened in parallel.
The solution to your problem is to not use parallelize. It is entirely pointless.