Compute an average in a RDD and then filter this RDD based on the average in Spark Streaming - scala

I want to do something i feel very strange in Spark Streaming and i want to get some feedback on it.
I have a DStream of a tuple (String,Int). Let's say the string is an id and the integer a value.
So for a microbatch, I want to compute The average of the field Int, and base on this average i want to filter the same microbatch, for example field2 > average. So i wrote this code :
lineStreams
.foreachRDD(
rdd => {
val totalElement = rdd.count()
if(totalElement > 0) {
val totalSum = rdd.map(elem => elem.apply(1).toInt).reduce(_ + _)
val average = totalSum / totalElement
rdd.foreach(
elem => {
if(elem.apply(1).toInt > average){
println("Element is higher than average")
}
}
)
}
})
But actually this code is not running, the first part of computation looks ok but not the test.
I know there is some dirty things in this code, but I just want to know if the logic is good.
Thanks for you advices !

Try:
lineStreams.transform { rdd => {
val mean = rdd.values.map(_.toDouble).mean
rdd.filter(_._2.toDouble > mean)
}}

Related

Cleaner way to find all indices of same value in Scala

I have a textfile like so
NameOne,2,3,3
NameTwo,1,0,2
I want to find the indices of the max value in each line in Scala. So the output of this would be
NameOne,1,2
NameTwo,2
I'm currently using the function below to do this but I can't seem to find a simple way to do this without a for loop and I'm wondering if there is a better method out there.
def findIndices(movieRatings: String): (String) = {
val tokens = movieRatings.split(",", -1)
val movie = tokens(0)
val ratings = tokens.slice(1, tokens.size)
val max = ratings.max
var indices = ArrayBuffer[Int]()
for (i<-0 until ratings.length) {
if (ratings(i) == max) {
indices += (i+1)
}
}
return movie + "," + indices.mkString(",")
}
This function is called as so:
val output = textFile.map(findIndices).saveAsTextFile(args(1))
Just starting to learn Scala so any advice would help!
You can zipWithIndex and use filter:
ratings.zipWithIndex
.filter { case(_, value) => value == max }
.map { case(index, _) => index }
I noticed that your code doesn't actually produce the expected result from your example input. I'm going to assume that the example given is the correct result.
def findIndices(movieRatings :String) :String = {
val Array(movie, ratings #_*) = movieRatings.split(",", -1)
val mx = ratings.maxOption //Scala 2.13.x
ratings.indices
.filter(x => mx.contains(ratings(x)))
.mkString(s"$movie,",",","")
}
Note that this doesn't address some of the shortcomings of your algorithm:
No comma allowed in movie name.
Only works for ratings from 0 to 9. No spaces allowed.
testing:
List("AA"
,"BB,"
,"CC,5"
,"DD,2,5"
,"EE,2,5, 9,11,5"
,"a,b,2,7").map(findIndices)
//res0: List[String] = List(AA, <-no ratings
// , BB,0 <-comma, no ratings
// , CC,0 <-one rating
// , DD,1 <-two ratings
// , EE,1,4 <-" 9" and "11" under valued
// , a,0 <-comma in name error
// )

Scala broadcast join with "one to many" relationship

I am fairly new to Scala and RDDs.
I have a very simple scenario yet it seems very hard to implement with RDDs.
Scenario:
I have two tables. One large and one small. I broadcast the smaller table.
I then want to join the table and finally aggregate the values after the join to a final total.
Here is an example of the code:
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1,a._2) } // turn to pair RDD
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.map( accs => {
//get list of groups
val groups = List("Fruit", "ZipCode")
val i = "Fruit"
//for each group
//for(i <- groups) {
if (broadcastVar.value.get(accs._1, i) != None) {
( broadcastVar.value.get(accs._1, i).get._1,
broadcastVar.value.get(accs._1, i).get._2,
accs._2, accs._3)
} else {
None
}
//}
}
)
//expected after this
//("A","Fruit","Apples",1, "1Jan2000"),("B","Fruit","Apples",2, "1Jan2000"),
//("A","ZipCode","1234", 1,"1Jan2000"),("B","ZipCode","456", 2,"1Jan2000")
//then group and sum
//cannot do anything with the joinedRDD!!!
//error == value copy is not a member of Product with Serializable
// Final Expected Result
//("Fruit","Apples",3, "1Jan2000"),("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
My questions:
Is this the best approach first of all with RDDs?
Disclaimer - I have done this whole task using dataframes successfully. The idea is to create another version using only RDDs to compare performance.
Why is the type of my joinedRDD not recognised after it was created so that I can continue to use functions like copy on it?
How can I get away with not doing a .collectAsMap() when broadcasting the variable. I currently have to include the first to items to enforce uniqueness and not dropping any values.
Thanks for the help in advance!
Final solution for anyone interested
case class dt (group:String, group_key:String, count:Long, date:String)
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1) } // turn to pair RDD
.groupByKey() //to not loose any data
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.flatMap( accs => {
if (broadcastVar.value.get(accs._1) != None) {
val bc = broadcastVar.value.get(accs._1).get
bc.map(p => {
dt(p._2, p._3,accs._2, accs._3)
})
} else {
None
}
}
)
//expected after this
//("Fruit","Apples",1, "1Jan2000"),("Fruit","Apples",2, "1Jan2000"),
//("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
//then group and sum
var finalRDD = joinedRDD.map(s => {
(s.copy(count=0),s.count) //trick to keep code to minimum (count = 0)
})
.reduceByKey(_ + _)
.map(pair => {
pair._1.copy(count=pair._2)
})
In your map statement you return either a tuple or None based on the if condition. These types do not match so you fall back the a common supertype so joinedRDD is an RDD[Product with Serializable] Which is not what you want at all (it's basically RDD[Any]). You need to make sure all paths return the same type. In this case, you probably want an Option[(String, String, Int, String)]. All you need to do is wrap the tuple result into a Some
if (broadcastVar.value.get(accs._1, i) != None) {
Some(( broadcastVar.value.get(accs._1, i).get.group_key,
broadcastVar.value.get(accs._1, i).get.group,
accs._2, accs._3))
} else {
None
}
And now your types will match up. This will make joinedRDD and RDD[Option(String, String, Int, String)]. Now that the type is correct the data is usable, however, it means that you will need to map the Option to work with the tuples. If you don't need the None values in the final result, you can use flatmap instead of map to create joinedRDD which will unwrap the Options for you, filtering out all the Nones.
CollectAsMap is the correct way to turnan RDD into a Hashmap, but you need multiple values for a single key. Before using collectAsMap but after mapping the smallRDD into a Key,Value pair, use groupByKey to group all of the values for a single key together. When when you look up a key from your HashMap, you can map over the values, creating a new record for each one.

too many map keys causing out of memory exception in spark

I have an RDD 'inRDD' of the form RDD[(Vector[(Int, Byte)], Vector[(Int, Byte)])] which is a PairRDD(key,value) where key is Vector[(Int, Byte)] and value is Vector[(Int, Byte)].
For each element (Int, Byte) in the vector of key field, and each element (Int, Byte) in the vector of value field I would like to get a new (key,value) pair in the output RDD as (Int, Int), (Byte, Byte).
That should give me an RDD of the form RDD[((Int, Int), (Byte, Byte))].
For example, inRDD contents could be like,
(Vector((3,2)),Vector((4,2))), (Vector((2,3), (3,3)),Vector((3,1))), (Vector((1,3)),Vector((2,1))), (Vector((1,2)),Vector((2,2), (1,2)))
which would become
((3,4),(2,2)), ((2,3),(3,1)), ((3,3),(3,1)), ((1,2),(3,1)), ((1,2),(2,2)), ((1,1),(2,2))
I have the following code for that.
val outRDD = inRDD.flatMap {
case (left, right) =>
for ((ll, li) <- left; (rl, ri) <- right) yield {
(ll,rl) -> (li,ri)
}
}
It works when the vectors are small in size in the inRDD. But when there are lot elements in the vectors, I get out of memory exception. Increasing the available memory
to spark could only solve for smaller inputs and the error appears again for even larger inputs.
Looks like I am trying to assemble a huge structure in memory. I am unable to rewrite this code in any other ways.
I have implemented a similar logic with java in hadoop as follows.
for (String fromValue : fromAssetVals) {
fromEntity = fromValue.split(":")[0];
fromAttr = fromValue.split(":")[1];
for (String toValue : toAssetVals) {
toEntity = toValue.split(":")[0];
toAttr = toValue.split(":")[1];
oKey = new Text(fromEntity.trim() + ":" + toEntity.trim());
oValue = new Text(fromAttr + ":" + toAttr);
outputCollector.collect(oKey, oValue);
}
}
But when I try something similar in spark, I get nested rdd exceptions.
How do I do this efficiently with spark using scala?
Well, if Cartesian product is the only option you can at least make it a little bit more lazy:
inRDD.flatMap { case (xs, ys) =>
xs.toIterator.flatMap(x => ys.toIterator.map(y => (x, y)))
}
You can also handle this at the Spark level
import org.apache.spark.RangePartitioner
val indexed = inRDD.zipWithUniqueId.map(_.swap)
val partitioner = new RangePartitioner(indexed.partitions.size, indexed)
val partitioned = indexed.partitionBy(partitioner)
val lefts = partitioned.flatMapValues(_._1)
val rights = partitioned.flatMapValues(_._2)
lefts.join(rights).values

Distributing a loop to different machines of a cluster in Spark

Here is a for loop that I'm running in my code:
for(x<-0 to vertexArray.length-1)
{
for(y<-0 to vertexArray.length-1)
{
breakable {
if (x.equals(y)) {
break
}
else {
var d1 = vertexArray(x)._2._2
var d2 = vertexArray(y)._2._2
val ps = new Period(d1, d2)
if (ps.getMonths() == 0 && ps.getYears() == 0 && Math.abs(ps.toStandardHours().getHours()) <= 5) {
edgeArray += Edge(vertexArray(x)._1, vertexArray(y)._1, Math.abs(ps.toStandardHours().getHours()))
}
}
}
}
}
I want to speed up the running time of this code by distributing it across multiple machines in a cluster. I'm using Scala on intelliJ-idea with Spark. How would I implement this type of code to work on multiple machines?
As already stated by Mariano Kamp Spark is probably not a good choice here and there are much better options out there. To add on top of that any approach which has to work on a relatively large data and requires O(N^2) time is simply unacceptable. So the first thing you should do is to focus on choosing suitable algorithm not a platform.
Still it is possible to translate it to Spark. A naive approach which directly reflects your code would be to use Cartesian product:
def check(v1: T, v2: T): Option[U] = {
if (v1 == v2) {
None
} else {
// rest of your logic, Some[U] if all tests passed
// None otherwise
???
}
}
val vertexRDD = sc.parallelize(vertexArray)
.map{case (v1, v2) => check(v1, 2)}
.filter(_.isDefined)
.map(_.get)
If vertexArray is small you could use flatMap with broadcast variable
val vertexBd = sc.broadcast(vertexArray)
vertexRDD.flatMap(v1 =>
vertexBd.map(v2 => check(v1, v2)).filter(_.isDefined).map(_.get))
)
Another improvement is to perform proper join. The obvious condition is year and month:
def toPair(v: T): ((Int, Int), T) = ??? // Return ((year, month), vertex)
val vertexPairs = vertexRDD.map(toPair)
vertexPairs.join(vertexPairs)
.map{case ((_, _), (v1, v2)) => check(v1, v2) // Check should be simplified
.filter(_.isDefined)
.map(_.get)
Of course this can be achieved with a broadcast variable as well. You simply have to group vertexArray by (year, month) pair and broadcast Map[(Int, Int), T].
From here you can improve further by avoiding naive checks by partition and traversing data sorted by timestamp:
def sortPartitionByDatetime(iter: Iterator[U]): Iterator[U] = ???
def yieldMatching(iter: Iterator[U]): Iterator[V] = {
// flatmap keeping track of values in open window
???
}
vertexPairs
.partitionBy(new HashPartitioner(n))
.mapPartitions(sortPartitionByDatetime)
.mapPartitions(yieldMatching)
or using a DataFrame with window function and range clause.
Note:
All types are simply placeholders. In the future please try to provide type information. Right now all I can tell is there are some tuples and dates involved
Welcome to Stack Overflow. Unfortunately this is not the right approach ;(
Spark is not a tool to parallelize tasks, but to parallelize data.
So you need to think how you can distribute/parallelize/partition your data, then compute the individual partitions, then consolidate the results as a last step.
Also you need to read up on Spark in general. A simple answer here cannot get you started. This is just the wrong format.
Start here: http://spark.apache.org/docs/latest/programming-guide.html

applying a function to graph data using mapReduceTriplets in spark and graphx

I'm having some problems applying the mapReduceTriplets to my graph network in spark using graphx.
I've been following the tutorials and read in my own data which is put together as [Array[String],Int], so for example my vertices are:
org.apache.spark.graphx.VertexRDD[Array[String]] e.g. (3999,Array(17, Low, 9))
And my edges are:
org.apache.spark.graphx.EdgeRDD[Int]
e.g. Edge(3999,4500,1)
I'm trying to apply an aggregate type function using mapReduceTriplets which counts how many of the last integer in the array of a vertices (in the above example 9) is the same or different to the first integer (in the above example 17) of all connected vertices.
So you would end up with a list of counts for the number of matches or non-matches.
The problem I am having is applying any function using mapReduceTriplets, I am quite new to scala so this may be really obvious, but in the graphx tutorials it has an example which is using a graph with the format Graph[Double, Int], however my graph is in the format of Graph[Array[String],Int], so i'm just trying as a first step to figure out how I can use my graph in the example and then work from there.
The example on the graphx website is as follows:
val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Double)](
triplet => { // Map Function
if (triplet.srcAttr > triplet.dstAttr) {
// Send message to destination vertex containing counter and age
Iterator((triplet.dstId, (1, triplet.srcAttr)))
} else {
// Don't send a message for this triplet
Iterator.empty
}
},
// Add counter and age
(a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
)
Any advice would be most appreciated, or if you think there is a better way than using mapreducetriplets I would be happy to hear it.
Edited new code
val nodes = (sc.textFile("C~nodeData.csv")
.map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
val edges = GraphLoader.edgeListFile(sc, "C:~edges.txt")
val graph = edges.outerJoinVertices(nodes) {
case (uid, deg, Some(attrList)) => attrList
case (uid, deg, None) => Array.empty[String]
}
val countsRdd = graph.collectNeighbors(EdgeDirection.Either).leftOuterJoin(graph.vertices).map {
case (id, t) => {
val neighbors: Array[(VertexId, Array[String])] = t._1
val nodeAttr = (t._2)
neighbors.map(_._2).count( x => x.apply(x.size - 1) == nodeAttr(0))
}
}
I think you want to use GraphOps.collectNeighbors instead of either mapReduceTriplets or aggregateMessages.
collectNeighbors will give you an RDD with, for every VertexId in your graph, the connected nodes as an array. Just reduce the Array based on your needs. Something like:
val countsRdd = graph.collectNeighbors(EdgeDirection.Either)
.join(graph.vertices)
.map{ case (vid,t) => {
val neighbors = t._1
val nodeAttr = t._2
neighbors.map(_._2).filter( <add logic here> ).size
}
If this doesn't get you going in the right direction, or you get stuck, let me know (the "" part, for example).