I have a scala code like this
def avgCalc(buffer: Iterable[Array[String]], list: Array[String]) = {
val currentTimeStamp = list(1).toLong // loads the timestamp column
var sum = 0.0
var count = 0
var check = false
import scala.util.control.Breaks._
breakable {
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += array(5).toDouble // RSSI weightage values
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
}
list :+ sum
}
I will call the above function like this
import spark.implicits._
val averageDF =
filterop.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false) // Sort by timestamp
.groupBy(array => (array(0), array(2))) // group by tag and listner
.mapValues(buffer => {
buffer.map(list => {
avgCalc(buffer, list) // calling the average function
})
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble)) // defining the schema through case class
.toDF // converting to data frame
The above code is working fine.But I need to get rid of list.My senior ask me to remove the list,Because list reduces the execution speed.Any suggestions to proceed without list ?
Any help will be appreciated.
The following solution should work I guess, I have tried to avoid passing both iterable and one array.
def avgCalc(buffer: Iterable[Array[String]]) = {
var finalArray = Array.empty[Array[String]]
import scala.util.control.Breaks._
breakable {
for (outerArray <- buffer) {
val currentTimeStamp = outerArray(1).toLong
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list :+ (sum / count).toString
else list = list :+ list(5).toDouble.toString
finalArray ++= Array(list)
}
}
finalArray
}
and you can call it like
import sqlContext.implicits._
val averageDF =
filter_op.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false)
.groupBy(array => (array(0), array(2)))
.mapValues(buffer => {
avgCalc(buffer)
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble))
.toDF
I hope this is the desired answer
I can see that you have accepted an answer, but I have to say that you have a lot of unnecessary code. As far as I can see, you have no reason to do the initial conversion to Array type in the first place and the sortBy is also unnecessary at this point. I would suggest you work directly on the Row.
Also you have a number of unused variables that should be removed and converting to a case-class only to be followed by a toDF seems excessive IMHO.
I would do something like this:
import org.apache.spark.sql.Row
def avgCalc(sortedList: List[Row]) = {
sortedList.indices.map(i => {
var sum = 0.0
val row = sortedList(i)
val currentTimeStamp = row.getString(1).toLong // loads the timestamp column
import scala.util.control.Breaks._
breakable {
for (j <- 0 until sortedList.length) {
if (j != i) {
val anotherRow = sortedList(j)
val toCheckTimeStamp = anotherRow.getString(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += anotherRow.getString(5).toDouble // RSSI weightage values
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
break
}
}
}
}
(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4), row.getString(5), sum.toString)
})
}
val averageDF = filterop.rdd
.groupBy(row => (row(0), row(2)))
.flatMap{case(_,buffer) => avgCalc(buffer.toList.sortBy(_.getString(1).toLong))}
.toDF("Tag", "Timestamp", "Listner", "X", "Y", "RSSI", "AvgCalc")
And as a final comment, I'm pretty sure it's possible to come up with at nicer/cleaner implementation of the avgCalc function, but I'll leave it to you to play around with that :)
Related
I want to remove last line from RDD using .mapPartitionsWithIndex function.
I have tried below code
val withoutFooter = rdd.mapPartitionsWithIndex { (idx, iter) =>
if (idx == noOfTotalPartitions) {
iter.drop(size - 1)
}
else iter
}
But not able to get correct result.
drop will drop first n elements and returns the remaining elements
Read more here https://stackoverflow.com/a/51792161/6556191
Below code works for me
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),4)
val lastPartitionIndex = rdd.getNumPartitions - 1
rdd.mapPartitionsWithIndex { (idx, iter) =>
var reti = iter
if (idx == lastPartitionIndex) {
var lastPart = iter.toArray
reti = lastPart.slice(0, lastPart.length-1).toIterator
}
reti
}
Hi I am trying UDAF with spark scala. I am getting the following exception.
Description Resource Path Location Type type mismatch; found : scala.collection.immutable.IndexedSeq[Any] required: String SumCalc.scala /filter line 63 Scala Problem
This is my code.
override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
var finalArray = Array.empty[Array[String]]
import scala.util.control.Breaks._
breakable {
for (outerArray <- in_array) {
val currentTimeStamp = outerArray(1).toLong
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- in_array) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list :+ (sum).toString // getting error on this line.
else list = list :+ list(5).toDouble.toString
finalArray ++= Array(list)
}
finalArray
}
}
Any help will be appreciated.
There are a couple of mistakes in your evaluate function of UDAF.
list variable is a string but you are treating it as an array
finalArray is initialized as Array.empty[Array[String]] but later on you are adding Array(list) to the finalArray
You are not returning finalArray from evaluate method as its inside for loop
So the correct way should be as below
override def evaluate(buffer: Row): Any = {
val in_array = buffer.getAs[WrappedArray[String]](0);
var finalArray = Array.empty[String]
import scala.util.control.Breaks._
breakable {
for (outerArray <- in_array) {
val currentTimeStamp = outerArray(1).toLong // timestamp values
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- in_array) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble // RSSI weightage values
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list + (sum).toString // calculate sum for the 10 secs difference
else list = list + (sum).toString // If 10 secs difference is not there take rssi weightage value
finalArray ++= Array(list)
}
}
finalArray // Final results for this function
}
Hope the answer is helpful
I have below rdd created and I need to perform a series of filters on the same dataset to derive different counters and aggregates.
Is there a way I can apply these filters and compute aggregates in a single pass, avoiding spark to go over the same dataset multiple times?
val res = df.rdd.map(row => {
// ............... Generate data here for each row.......
})
res.persist(StorageLevel.MEMORY_AND_DISK)
val all = res.count()
val stats1 = res.filter(row => row.getInt(1) > 0)
val stats1Count = stats1.count()
val stats1Agg = stats1.map(r => r.getInt(1)).mean()
val stats2 = res.filter(row => row.getInt(2) > 0)
val stats2Count = stats2.count()
val stats2Agg = stats2.map(r => r.getInt(2)).mean()
You can use aggregate:
case class Stats(count: Int = 0, sum: Int = 0) {
def mean = sum/count
def +(s: Stats): Stats = Stats(count + s.count, sum + s.sum)
def <- (n: Int) = if(n > 0) copy(count + 1, sum + n) else this
}
val (stats1, stats2) = res.aggregate(Stats() -> Stats()) (
{ (s, row) => (s._1 <- row.getInt(1), s._2 <- row.getInt(2)) },
{ _ + _ }
)
val (stat1Count, stats1Agg, stats2Count, stats2Agg) = (stats1.count, stats1.mean, stats2.count, stats2.mean)
I am generating graph using following code.
NOTE:- raw(1) = name, raw(2) = type
val Nodes: RDD[(VertexId, (String, String))] = sc.textFile(nodesFile).flatMap {
line =>
if (!line.isEmpty && line(0) != '#') {
val row = line.split(";,;,;")
if (row.length == 3) {
if (row(0).length > 0 && row(1).length > 0 && row(2).length > 0 && row(0).forall(_.isDigit) && row(2).toString.toUpperCase != "AB" && row(2).toString.toUpperCase != "XYZ") {
List((row(0).toLong, (row(1).toString.toUpperCase, row(2).toString.toUpperCase)))
} else { None }
} else { None }
} else {
None
}
}
So It's generating map like this.
(11,(SAMSUNG_PHONE,Item))
(0,null)
(1,(Flying,PC))
(6,null)
Means Vertices 0 and 6 have with value is 'AB' or 'XYZ'. that's why it's inserting null but i want to filter and want to remove this null value node.. I tried but didn't get it.
Please give me hint or reference.
A Solution
Assuming an input file with content
0;foo;AB
1;cool,stuff
2;other;things
6;foo;XYZ
3;a;b
your code is nearly working.
After adapting the split pattern (see below) and polishing the return value (List() instead of None) the code works:
configuredUnitTest("Test SO") { sc =>
import org.apache.spark.rdd.RDD
val sqlContext = new SQLContext(sc)
val nodesFile = "/home/martin/input.txt"
val nodes: RDD[(Long, (String, String))] = sc.textFile(nodesFile).flatMap {
line =>
if (!line.isEmpty && line(0) != '#') {
val row = line.split("[,;]")
if (row.length == 3) {
if (row(0).length > 0 && row(1).length > 0 && row(2).length > 0 && row(0).forall(_.isDigit) && row(2).toString.toUpperCase != "AB" && row(2).toString.toUpperCase != "XYZ") {
List((row(0).toLong, (row(1).toString.toUpperCase, row(2).toString.toUpperCase)))
} else {
List()
}
} else {
List()
}
} else {
List()
}
}
println( nodes.count() )
val result = nodes.collect()
println( result.size )
println( result.mkString("\n") )
}
Result is
3
3
(1,(COOL,STUFF))
(2,(OTHER,THINGS))
(3,(A,B))
Code deficiencies
Return type of function String => List[Long, (String, String)]
Your code for rejecting non matching lines is verbose and unreadable. Why don't you return a List() instead of None. Example that proofs:
scala> val a = List(1,2,3)
a: List[Int] = List(1, 2, 3)
scala> a.flatMap( x => { if (x==1) List() else List(x)} )
res0: List[Int] = List(2, 3)
Having said, you don't need to use flatmap and filter for null values
Split pattern
Your regex split pattern is wrong.
Your pattern ";,;,;" says: Split when encountering the sequence ";,;,;", thus "a;,;,;b" is split into a and b. This is most probably not what you want. Instead, you want to split at either ";" or ",", so the rexex saying ";" or "," is "[;,]".
scala> val x ="a;b,c"
x: String = a;b,c
scala> x.split(";").mkString("|")
res2: String = a|b,c
scala> x.split(";,").mkString("|")
res3: String = a;b,c
scala> x.split("[;,]").mkString("|")
res4: String = a|b|c
A much better approach
With filter and some helper functions, your code can be rewritten as
configuredUnitTest("Test SO") { sc =>
import org.apache.spark.rdd.RDD
val sqlContext = new SQLContext(sc)
val nodesFile = "/home/martin/input.txt"
def lengthOfAtLeastOne(x: String) : Boolean = x.length > 0
def toUpperCase(x: String) = x.toString.toUpperCase
val nodes = sc.textFile(nodesFile)
.map( _.split("[;,]") )
.filter( _.size == 3 )
.filter( xs => ( lengthOfAtLeastOne(xs(0)) && lengthOfAtLeastOne(xs(1)) && lengthOfAtLeastOne(xs(2)) ) )
.filter( xs => (toUpperCase(xs(2)) != "AB") && (toUpperCase(xs(2)) != "XYZ"))
.map( xs => (xs(0).toLong, ( toUpperCase(xs(1)), toUpperCase(xs(2))) ))
println( nodes.count() )
val result = nodes.collect()
println( result.size )
println( result.mkString("\n") )
println("END OF")
}
Much better to read, eh?
I've tested your code and it seems to work just fine.
I've used the input from Martin Senne, and also used his regex.
Because I don't know exactly what textFile is, I just read some lines from a file
val lines = Source.fromFile("so/resources/f").getLines.toList
val x = lines.flatMap {
line =>
if (!line.isEmpty && line(0) != '#') {
val row = line.split("[;,]")
if (row.length == 3) {
if (row(0).length > 0 && row(1).length > 0 && row(2).length > 0 && row(0).forall(_.isDigit) && row(2).toString.toUpperCase != "AB" && row(2).toString.toUpperCase != "XYZ") {
List((row(0).toLong, (row(1).toString.toUpperCase, row(2).toString.toUpperCase)))
} else {
None
}
} else {
None
}
} else {
None
}
}
println(x.size)
x foreach println
So for this input:
0;foo;AB
1;cool,stuff
2;other;things
6;foo;XYZ
3;a;b
I got this output:
3
(1,(COOL,STUFF))
(2,(OTHER,THINGS))
(3,(A,B))
Of course, there can be added many modification to your code:
.filterNot(_.startsWith("#")) // lines cannot start with #
.map(_.split("[;,]")) // split the lines
.filter(_.size == 3) // each line must have 3 items in it
.filter(line => line.filter(_.length > 0).size == 3) // and none of those items can be empty ***
.filter(line => line(2).toUpperCase != "AB" ) // also those that have AB
.filter(line => line(2).toUpperCase != "XYZ" ) // also those that have XYZ
.map(line => (line(0).toLong, (line(1).toUpperCase, line(2).toUpperCase))) // the final format
*** this:
.filter(line => line.filter(_.length > 0).size == 3)
can also be written like this:
.map(_.filter(_.length > 0)) // each item must not be empty
.filter(_.size == 3) // so the lines with less than 3 items will be removed; ***
Also, as an observation, you don't have to put toString alongside a String.
I have used a cogroup function and obtain following RDD:
org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
Before the map operation joined object would look like this:
RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))
Now I want to do the following:
val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
for(tupleB <- listB) {
val localResults = ListBuffer[Double]()
val itemToTest = Set(tupleB._1)
val tempList = ListBuffer[(Int, Double)]()
for(tupleA <- listA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
val i = (tupleA._1, tValue)
tempList += i
}
val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
val intersect = sortList.toSet.intersect(itemToTest)
if (intersect.size > 0)
localResults += 1.0
else localResults += 0.0
val normalized = sum(localResults.toList)/localResults.size
globalBuffer += normalized
}
})
//method sum
def sum(xs: List[Double]): Double = {//do the sum}
At the end of this I was expecting joined to be a list with double values. But when I looked at it it was unit. Also I will this is not the Scala way of doing it. How do I obtain globalBuffer as the final result.
Hmm, if I understood your code correctly, it could benefit from these improvements:
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
val localResults = listB.map {
case (intBValue, longBValue) =>
val itemToTest = intBValue // it's always one element
val tempList = listA.map {
case (intAValue, longAValue) =>
(intAValue, someFunctionReturnDouble(longBvalue, longAValue))
}
val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
}
sum(localResults)/localResults.size
})
Transformations of RDDs are not going to modify globalBuffer. Copies of globalBuffer are made and sent out to each of the workers, but any modifications to these copies on the workers will never modify the globalBuffer that exists on the driver (the one you have defined outside the map on the RDD.) Here's what I do (with a few additional modifications):
val joined = data1.cogroup(data2) map { x =>
val iterA = x._2._1
val iterB = x._2._2
var count, positiveCount = 0
val tempList = ListBuffer[(Int, Double)]()
for (tupleB <- iterB) {
tempList.clear
for(tupleA <- iterA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
tempList += ((tupleA._1, tValue))
}
val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
count += 1
}
positiveCount.toDouble/count
}
At this point you can obtain of local copy of the proportions by using joined.collect.