How to update a global variable inside RDD map operation - scala

I have RDD[(Int, Array[Double])] and after that, I called a classFunction
val rdd = spark.sparkContext.parallelize(Seq(
(1, Array(2.0,5.0,6.3)),
(5, Array(1.0,3.3,9.5)),
(1, Array(5.0,4.2,3.1)),
(2, Array(9.6,6.3,2.3)),
(1, Array(8.5,2.5,1.2)),
(5, Array(6.0,2.4,7.8)),
(2, Array(7.8,9.1,4.2))
)
)
val new_class = new ABC
new_class.demo(data)
Inside class, declared a global variable value =0. Inside the demo() the new variable new_value = 0 is declared. After the map operation, the new_value get updated and it prints the updated value inside the map.
class ABC extends Serializable {
var value = 0
def demo(data_new : RDD[(Int ,Array[Double])]): Unit ={
var new_value = 0
data_new.coalesce(1).map(x => {
if(x._1 == 1)
new_value = new_value + 1
println(new_value)
value = new_value
}).count()
println("Outside-->" +value)
}
}
OUTPUT:-
1
1
2
2
3
3
3
Outside-->0
How can I update the global variable value after the map operation?.

I'm not sure about what is it you are doing but you need to use Accumulators to perform the type of operations where you need to add values like this.
Here is an example :
scala> val rdd = spark.sparkContext.parallelize(Seq(
| (1, Array(2.0,5.0,6.3)),
| (5, Array(1.0,3.3,9.5)),
| (1, Array(5.0,4.2,3.1)),
| (2, Array(9.6,6.3,2.3)),
| (1, Array(8.5,2.5,1.2)),
| (5, Array(6.0,2.4,7.8)),
| (2, Array(7.8,9.1,4.2))
| )
| )
rdd: org.apache.spark.rdd.RDD[(Int, Array[Double])] = ParallelCollectionRDD[83] at parallelize at <console>:24
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 46181, name: Some(My Accumulator), value: 0)
scala> rdd.foreach { x => if(x._1 == 1) accum.add(1) }
scala> accum.value
res38: Long = 3
And as mentioned by #philantrovert, if you wish to count the number of occurrences of each key, you can do the following :
scala> rdd.mapValues(_ => 1L).reduceByKey(_ + _).take(3)
res41: Array[(Int, Long)] = Array((1,3), (2,2), (5,2))
You can also use countByKey but it is to be avoided with big datasets.

No you can't change the global variables from inside the map.
If you are trying to count the number of one in the function than you can use filter
val value = data_new.filter(x => (x._1 == 1)).count
println("Outside-->" +value)
Output:
Outside-->3
Also it is not recommended to use mutable variables var. You should always try to use immutable as val
I hope this helps!

OR You can do achieve your problem in this way also:
class ABC extends Serializable {
def demo(data_new : RDD[(Int ,Array[Double])]): Unit ={
var new_value = 0
data_new.coalesce(1).map(x => {
if(x._1 == 1)
var key = x._1
(key, 1)
}).reduceByKey(_ + _)
}
println("Outside-->" +demo(data_new))
}

Related

How to convert Flink DataSet tuple to one column

I've a graph data like
1 2
1 4
4 1
4 2
4 3
3 2
2 3
But I couldn't find a way to convert it a one column dataset like
1
2
1
4
4
1
...
here is my code, I used scala ListBuffer, but couldn't find a way doing it in Flink DataSet
val params: ParameterTool = ParameterTool.fromArgs(args)
val env = ExecutionEnvironment.getExecutionEnvironment
env.getConfig.setGlobalJobParameters(params)
val text = env.readTextFile(params.get("input"))
val tupleText = text.map { line =>
val arr = line.split(" ")
(arr(0), arr(1))
}
var x: Seq[(String, String)] = tupleText.collect()
var tempList = new ListBuffer[String]
x.foreach(line => {
tempList += line._1
tempList += line._2
})
tempList.foreach(println)
You can do that with flatMap:
// get some input
val input: DataSet[(Int, Int)] = env.fromElements((1, 2), (2, 3), (3, 4))
// emit every tuple element as own record
val output: DataSet[Int] = input.flatMap( (t, out) => {
out.collect(t._1)
out.collect(t._2)
})
// print result
output.print()

How to write nested for loops in efficient way in scala?

I am working on a scenario where I need to have nested for loops. I am able to get the desired output though but thought there might be some better way to achieve that too.
I am having the sample DF and wanted the output in the below format
List(/id=1/state=CA/, /id=2/state=MA/, /id=3/state=CT/)
Below snippet does the job but any suggestion improve it.
Example:
val stateDF = Seq(
(1, "CA"),
(2, "MA"),
(3, "CT")
).toDF("id", "state")
var cond = ""
val columnsLst =List("id","state")
var pathList = List.empty[String]
for (row <- stateDF.collect) {
cond ="/"
val dataRow = row.mkString(",").split(",")
for (colPosition <- columnsLst.indices) {
cond = cond + columnsLst(colPosition) + "=" + dataRow(colPosition) + "/"
}
pathList = pathList ::: List(cond)
}
println(pathList)
You can convert your dataframe to the format you want, and do a collect later if needed, here is the sample code:
scala> stateDF.select(concat(lit("/id="), col("id"),lit("/state="), col("state"), lit("/")).as("value")).show
+---------------+
| value|
+---------------+
|/id=1/state=CA/|
|/id=2/state=MA/|
|/id=3/state=CT/|
+---------------+
Thank you for all the sugestion. Now I come up the below for my above requirement.
import org.apache.spark.sql.{DataFrame}
val stateDF = Seq(
(1, "CA"),
(2, "MA"),
(3, "CT")
).toDF("id", "state")
val allStates = stateDF.columns.foldLeft(stateDF) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit("/" + colName + "="), col(colName)))
}
val dfResults = allStates.select(concat(allStates.columns.map(cols => col(cols)): _*))
val columnList: List[String] = dfResults.map(col => col.getString(0) + "/").collect.toList
println(columnList)

Getting all Keys associated with the max value in Spark

I have a RDD, and i want to find all Keys which have the max values.
So if i have
( ((A), 5), ((B), 4), ((C), 5)) )
then i want to return
( ((A), 5), ((C), 5)) )
Edit; MaxBy only gives out one key, so i dont think that will work.
I have tried
newRDD = oldRDD.sortBy(._2, false).filter{._2 == _.first}ยจ
and
newRDD = oldRDD.filter{_._2 == _.maxBy}
Where i know _.first and _.MaxBy wont work, but are supposed to get the maxValue from the oldRDD. My problem in every solution i try is that i cant accsess the maxValue inside a filter. I also belive the 2nd "solution" i tried is much quicker than the first since sortBy is not really necessary.
Here is an answer. The logic is pretty simple:
val rdd = sc.parallelize(Seq(("a", 5), ("b", 4), ("c", 5)))
// first get maximum value
val maxVal = rdd.values.max
// now filter to those elements with value==max value
val rddMax = rdd.filter { case (_, v) => v == maxVal }
rddMax.take(10)
I'm not familiar with spark/RDD. In plain Scala, I would do:
scala> val max = ds.maxBy (_._2)._2
max: Int = 5
scala> ds.filter (_._2 == max)
res207: List[(String, Int)] = List((A,5), (C,5))
Setup was:
scala> val (a, b, c) = ("A", "B", "C")
a: String = A
b: String = B
c: String = C
scala> val ds = List ( ((a), 5), ((b), 4), ((c), 5))
ds: List[(String, Int)] = List((A,5), (B,4), (C,5))

SubtractByKey and keep rejected values

I was playing around with spark and I am getting stuck with something that seems foolish.
Let's say we have two RDD:
rdd1 = {(1, 2), (3, 4), (3, 6)}
rdd2 = {(3, 9)}
if I am doing rdd1.substrackByKey(rdd2) , I will get {(1, 2)} wich is perfectly fine. But I also want to save the rejected values {(3,4),(3,6)} to another RDD, is there a prebuilt function in spark or an elegant way to do this?
Please keep in mind that I am new with Spark, any help will be appreciated, thanks.
As Rohan suggests, there is no (to the best of my knowledge) standard API call to do this. What you want to do can be expressed as Union - Intersection.
Here is how you can do this on spark:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val intersection = r1.map(_._1).intersection(r2.map(_._1))
val union = r1.map(_._1).union(r2.map(_._1))
val diff = union.subtract(intersection)
diff.collect()
> Array[Int] = Array(1)
To get the actual pairs:
val d = diff.collect()
r1.union(r2).filter(x => d.contains(x._1)).collect
I think I claim this is slightly more elegant:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val r3 = r1.leftOuterJoin(r2)
val subtracted = r3.filter(_._2._2.isEmpty).map(x=>(x._1, x._2._1))
val discarded = r3.filter(_._2._2.nonEmpty).map(x=>(x._1, x._2._1))
//subtracted: (1,2)
//discarded: (3,4)(3,6)
The insight is noticing that leftOuterJoin produces both the discarded (== records with a matching key in r2) and remaining (no matching key) in one go.
It's a pity Spark doesn't have RDD.partition (in the Scala collection sense of split a collection into two depending on a predicate) or we could caclculate subtracted and discarded in one pass
You can try
val rdd3 = rdd1.subtractByKey(rdd2)
val rdd4 = rdd1.subtractByKey(rdd3)
But you won't be keeping the values, just running another subtraction.
Unfortunately, I don't think there's an easy way to keep the rejected values using subtractByKey(). I think one way you get your desired result is through cogrouping and filtering. Something like:
val cogrouped = rdd1.cogroup(rdd2, numPartitions)
def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
You might be able to borrow the work done here to make the last two lines look more elegant.
When I run this on your example, I see:
scala> val rdd1 = sc.parallelize(Array((1, 2), (3, 4), (3, 6)))
scala> val rdd2 = sc.parallelize(Array((3, 9)))
scala> val cogrouped = rdd1.cogroup(rdd2)
scala> def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
scala> val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> res1.collect()
...
res7: Array[(Int, Int)] = Array((1,2))
scala> res2.collect()
...
res8: Array[(Int, Int)] = Array((3,4), (3,6))
First use substractByKey() and then subtract
val rdd1 = spark.sparkContext.parallelize(Seq((1,2), (3,4), (3,5)))
val rdd2 = spark.sparkContext.parallelize(Seq((3,10)))
val result = rdd1.subtractByKey(rdd2)
result.foreach(print) // (1,2)
val rejected = rdd1.subtract(result)
rejected.foreach(print) // (3,5)(3,4)

In Scala, how do I keep track of running totals without using var?

For example, suppose I wish to read in fat, carbs and protein and wish to print the running total of each variable. An imperative style would look like the following:
var totalFat = 0.0
var totalCarbs = 0.0
var totalProtein = 0.0
var lineNumber = 0
for (lineData <- allData) {
totalFat += lineData...
totalCarbs += lineData...
totalProtein += lineData...
lineNumber += 1
printCSV(lineNumber, totalFat, totalCarbs, totalProtein)
}
How would I write the above using only vals?
Use scanLeft.
val zs = allData.scanLeft((0, 0.0, 0.0, 0.0)) { case(r, c) =>
val lineNr = r._1 + 1
val fat = r._2 + c...
val carbs = r._3 + c...
val protein = r._4 + c...
(lineNr, fat, carbs, protein)
}
zs foreach Function.tupled(printCSV)
Recursion. Pass the sums from previous row to a function that will add them to values from current row, print them to CSV and pass them to itself...
You can transform your data with map and get the total result with sum:
val total = allData map { ... } sum
With scanLeft you get the particular sums of each step:
val steps = allData.scanLeft(0) { case (sum,lineData) => sum+lineData}
val result = steps.last
If you want to create several new values in one iteration step I would prefer a class which hold the values:
case class X(i: Int, str: String)
object X {
def empty = X(0, "")
}
(1 to 10).scanLeft(X.empty) { case (sum, data) => X(sum.i+data, sum.str+data) }
It's just a jump to the left,
and then a fold to the right /:
class Data (val a: Int, val b: Int, val c: Int)
val list = List (new Data (3, 4, 5), new Data (4, 2, 3),
new Data (0, 6, 2), new Data (2, 4, 8))
val res = (new Data (0, 0, 0) /: list)
((acc, x) => new Data (acc.a + x.a, acc.b + x.b, acc.c + x.c))