How to remove last line from RDD Spark Scala - scala

I want to remove last line from RDD using .mapPartitionsWithIndex function.
I have tried below code
val withoutFooter = rdd.mapPartitionsWithIndex { (idx, iter) =>
if (idx == noOfTotalPartitions) {
iter.drop(size - 1)
}
else iter
}
But not able to get correct result.

drop will drop first n elements and returns the remaining elements
Read more here https://stackoverflow.com/a/51792161/6556191
Below code works for me
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),4)
val lastPartitionIndex = rdd.getNumPartitions - 1
rdd.mapPartitionsWithIndex { (idx, iter) =>
var reti = iter
if (idx == lastPartitionIndex) {
var lastPart = iter.toArray
reti = lastPart.slice(0, lastPart.length-1).toIterator
}
reti
}

Related

Do I need to persist a continuously updated RDD?

I'm working with a spark program which need to continuously update some RDD in a loop:
var totalRandomPath: RDD[String] = null
for (iter <- 0 until config.numWalks) {
var randomPath: RDD[String] = examples.map { case (nodeId, clickNode) =>
clickNode.path.mkString("\t")
}
for (walkCount <- 0 until config.walkLength) {
randomPath = edge2attr.join(randomPath.mapPartitions { iter =>
iter.map { pathBuffer =>
val paths: Array[String] = pathBuffer.split("\t")
(paths.slice(paths.size - 2, paths.size).mkString(""), pathBuffer)
}
}).mapPartitions { iter =>
iter.map { case (edge, (attr, pathBuffer)) =>
try {
if (pathBuffer != null && pathBuffer.nonEmpty && attr.dstNeighbors != null && attr.dstNeighbors.nonEmpty) {
val nextNodeIndex: PartitionID = GraphOps.drawAlias(attr.J, attr.q)
val nextNodeId: VertexId = attr.dstNeighbors(nextNodeIndex)
s"$pathBuffer\t$nextNodeId"
} else {
pathBuffer //add
}
} catch {
case e: Exception => throw new RuntimeException(e.getMessage)
}
}.filter(_ != null)
}
}
if (totalRandomPath != null) {
totalRandomPath = totalRandomPath.union(randomPath)
} else {
totalRandomPath = randomPath
}
}
In this program, RDD totalRandomPath and randomPath are constantly updated with a lot of transformation operations: join and mapPartitions. This program will end with action collect.
So need I persist those continuously updated RDDs(totalRandomPath, randomPath) to speed up my spark program?
And I notice that this program run fast in single node machine, but slow down when run in a three node cluster, why does this happen?
Yes, you need to persist updated RDD and also unpersist older RDD
var totalRandomPath:RDD[String] = spark.sparkContext.parallelize(List.empty[String]).cache()
for (iter <- 0 until config.numWalks){
// existing logic
val tempRDD = totalRandomPath.union(randomPath).cache()
tempRDD foreach { _ => } //this will trigger cache operation for tempRDD immediately
totalRandomPath.unpersist() //unpersist old RDD which is no longer needed
totalRandomPath = tempRDD // point totalRandomPath to updated RDD
}

Multiple filter condition in Spark Filter method

How to write multiple case in filter() method in spark using scala like, I have an Rdd of cogroup
(1,(CompactBuffer(1,john,23),CompactBuffer(1,john,24)).filter(x => (x._2._1 != x._2._2))//value not equal
(2,(CompactBuffer(),CompactBuffer(2,Arun,24)).filter(x => (x._2._1==null))//Second tuple first value is null
(3,(CompactBuffer(3,kumar,25),CompactBuffer()).filter(x => (x._2._2==null))//Second tuple second value is null
val a = source_primary_key.cogroup(destination_primary_key).filter(x => (x._2._1 != x._2._2))
val c= a.map { y =>
val key = y._1
val value = y._2
srcs = value._1.mkString(",")
destt = value._2.mkString(",")
if (srcs.equalsIgnoreCase(destt) == false) {
srcmis :+= srcs
destmis :+= destt
}
if (srcs == "") {
extraindest :+= destt.mkString("")
}
if (destt == "") {
extrainsrc :+= srcs.mkString("")
}
}
How to store each condition in 3 different Array[String]
I tried like above but seems naive, is there anyway way we can do it efficiently ?
For testing purpose, I created following rdds
val source_primary_key = sc.parallelize(Seq((1,(1,"john",23)),(3,(3,"kumar",25))))
val destination_primary_key = sc.parallelize(Seq((1,(1,"john",24)),(2,(2,"arun",24))))
Then I cogrouped as you did
val coGrouped = source_primary_key.cogroup(destination_primary_key)
Now is the step to filter the cogrouped rdd to three separate rdds as
val a = coGrouped.filter(x => !x._2._1.isEmpty && !x._2._2.isEmpty)
val b = coGrouped.filter(x => x._2._1.isEmpty && !x._2._2.isEmpty)
val c = coGrouped.filter(x => !x._2._1.isEmpty && x._2._2.isEmpty)
I hope the answer is helpful
You can use collect on your RDD and then toList .
Example :
(1,(CompactBuffer(1,john,23),CompactBuffer(1,john,24)).filter(x => (x._2._1 != x._2._2)).collect().toList

alternate way to proceed without list in scala

I have a scala code like this
def avgCalc(buffer: Iterable[Array[String]], list: Array[String]) = {
val currentTimeStamp = list(1).toLong // loads the timestamp column
var sum = 0.0
var count = 0
var check = false
import scala.util.control.Breaks._
breakable {
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += array(5).toDouble // RSSI weightage values
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
}
list :+ sum
}
I will call the above function like this
import spark.implicits._
val averageDF =
filterop.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false) // Sort by timestamp
.groupBy(array => (array(0), array(2))) // group by tag and listner
.mapValues(buffer => {
buffer.map(list => {
avgCalc(buffer, list) // calling the average function
})
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble)) // defining the schema through case class
.toDF // converting to data frame
The above code is working fine.But I need to get rid of list.My senior ask me to remove the list,Because list reduces the execution speed.Any suggestions to proceed without list ?
Any help will be appreciated.
The following solution should work I guess, I have tried to avoid passing both iterable and one array.
def avgCalc(buffer: Iterable[Array[String]]) = {
var finalArray = Array.empty[Array[String]]
import scala.util.control.Breaks._
breakable {
for (outerArray <- buffer) {
val currentTimeStamp = outerArray(1).toLong
var sum = 0.0
var count = 0
var check = false
var list = outerArray
for (array <- buffer) {
val toCheckTimeStamp = array(1).toLong
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) {
sum += array(5).toDouble
count += 1
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
check = true
break
}
}
if (sum != 0.0 && check) list = list :+ (sum / count).toString
else list = list :+ list(5).toDouble.toString
finalArray ++= Array(list)
}
}
finalArray
}
and you can call it like
import sqlContext.implicits._
val averageDF =
filter_op.rdd.map(_.mkString(",")).map(line => line.split(",").map(_.trim))
.sortBy(array => array(1), false)
.groupBy(array => (array(0), array(2)))
.mapValues(buffer => {
avgCalc(buffer)
})
.flatMap(x => x._2)
.map(x => findingavg(x(0).toString, x(1).toString.toLong, x(2).toString, x(3).toString, x(4).toString, x(5).toString.toDouble, x(6).toString.toDouble))
.toDF
I hope this is the desired answer
I can see that you have accepted an answer, but I have to say that you have a lot of unnecessary code. As far as I can see, you have no reason to do the initial conversion to Array type in the first place and the sortBy is also unnecessary at this point. I would suggest you work directly on the Row.
Also you have a number of unused variables that should be removed and converting to a case-class only to be followed by a toDF seems excessive IMHO.
I would do something like this:
import org.apache.spark.sql.Row
def avgCalc(sortedList: List[Row]) = {
sortedList.indices.map(i => {
var sum = 0.0
val row = sortedList(i)
val currentTimeStamp = row.getString(1).toLong // loads the timestamp column
import scala.util.control.Breaks._
breakable {
for (j <- 0 until sortedList.length) {
if (j != i) {
val anotherRow = sortedList(j)
val toCheckTimeStamp = anotherRow.getString(1).toLong // timestamp column
if (((currentTimeStamp - 10L) <= toCheckTimeStamp) && (currentTimeStamp >= toCheckTimeStamp)) { // to check the timestamp for 10 seconds difference
sum += anotherRow.getString(5).toDouble // RSSI weightage values
}
if ((currentTimeStamp - 10L) > toCheckTimeStamp) {
break
}
}
}
}
(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4), row.getString(5), sum.toString)
})
}
val averageDF = filterop.rdd
.groupBy(row => (row(0), row(2)))
.flatMap{case(_,buffer) => avgCalc(buffer.toList.sortBy(_.getString(1).toLong))}
.toDF("Tag", "Timestamp", "Listner", "X", "Y", "RSSI", "AvgCalc")
And as a final comment, I'm pretty sure it's possible to come up with at nicer/cleaner implementation of the avgCalc function, but I'll leave it to you to play around with that :)

How to filter FlatMap based on null value

I am generating graph using following code.
NOTE:- raw(1) = name, raw(2) = type
val Nodes: RDD[(VertexId, (String, String))] = sc.textFile(nodesFile).flatMap {
line =>
if (!line.isEmpty && line(0) != '#') {
val row = line.split(";,;,;")
if (row.length == 3) {
if (row(0).length > 0 && row(1).length > 0 && row(2).length > 0 && row(0).forall(_.isDigit) && row(2).toString.toUpperCase != "AB" && row(2).toString.toUpperCase != "XYZ") {
List((row(0).toLong, (row(1).toString.toUpperCase, row(2).toString.toUpperCase)))
} else { None }
} else { None }
} else {
None
}
}
So It's generating map like this.
(11,(SAMSUNG_PHONE,Item))
(0,null)
(1,(Flying,PC))
(6,null)
Means Vertices 0 and 6 have with value is 'AB' or 'XYZ'. that's why it's inserting null but i want to filter and want to remove this null value node.. I tried but didn't get it.
Please give me hint or reference.
A Solution
Assuming an input file with content
0;foo;AB
1;cool,stuff
2;other;things
6;foo;XYZ
3;a;b
your code is nearly working.
After adapting the split pattern (see below) and polishing the return value (List() instead of None) the code works:
configuredUnitTest("Test SO") { sc =>
import org.apache.spark.rdd.RDD
val sqlContext = new SQLContext(sc)
val nodesFile = "/home/martin/input.txt"
val nodes: RDD[(Long, (String, String))] = sc.textFile(nodesFile).flatMap {
line =>
if (!line.isEmpty && line(0) != '#') {
val row = line.split("[,;]")
if (row.length == 3) {
if (row(0).length > 0 && row(1).length > 0 && row(2).length > 0 && row(0).forall(_.isDigit) && row(2).toString.toUpperCase != "AB" && row(2).toString.toUpperCase != "XYZ") {
List((row(0).toLong, (row(1).toString.toUpperCase, row(2).toString.toUpperCase)))
} else {
List()
}
} else {
List()
}
} else {
List()
}
}
println( nodes.count() )
val result = nodes.collect()
println( result.size )
println( result.mkString("\n") )
}
Result is
3
3
(1,(COOL,STUFF))
(2,(OTHER,THINGS))
(3,(A,B))
Code deficiencies
Return type of function String => List[Long, (String, String)]
Your code for rejecting non matching lines is verbose and unreadable. Why don't you return a List() instead of None. Example that proofs:
scala> val a = List(1,2,3)
a: List[Int] = List(1, 2, 3)
scala> a.flatMap( x => { if (x==1) List() else List(x)} )
res0: List[Int] = List(2, 3)
Having said, you don't need to use flatmap and filter for null values
Split pattern
Your regex split pattern is wrong.
Your pattern ";,;,;" says: Split when encountering the sequence ";,;,;", thus "a;,;,;b" is split into a and b. This is most probably not what you want. Instead, you want to split at either ";" or ",", so the rexex saying ";" or "," is "[;,]".
scala> val x ="a;b,c"
x: String = a;b,c
scala> x.split(";").mkString("|")
res2: String = a|b,c
scala> x.split(";,").mkString("|")
res3: String = a;b,c
scala> x.split("[;,]").mkString("|")
res4: String = a|b|c
A much better approach
With filter and some helper functions, your code can be rewritten as
configuredUnitTest("Test SO") { sc =>
import org.apache.spark.rdd.RDD
val sqlContext = new SQLContext(sc)
val nodesFile = "/home/martin/input.txt"
def lengthOfAtLeastOne(x: String) : Boolean = x.length > 0
def toUpperCase(x: String) = x.toString.toUpperCase
val nodes = sc.textFile(nodesFile)
.map( _.split("[;,]") )
.filter( _.size == 3 )
.filter( xs => ( lengthOfAtLeastOne(xs(0)) && lengthOfAtLeastOne(xs(1)) && lengthOfAtLeastOne(xs(2)) ) )
.filter( xs => (toUpperCase(xs(2)) != "AB") && (toUpperCase(xs(2)) != "XYZ"))
.map( xs => (xs(0).toLong, ( toUpperCase(xs(1)), toUpperCase(xs(2))) ))
println( nodes.count() )
val result = nodes.collect()
println( result.size )
println( result.mkString("\n") )
println("END OF")
}
Much better to read, eh?
I've tested your code and it seems to work just fine.
I've used the input from Martin Senne, and also used his regex.
Because I don't know exactly what textFile is, I just read some lines from a file
val lines = Source.fromFile("so/resources/f").getLines.toList
val x = lines.flatMap {
line =>
if (!line.isEmpty && line(0) != '#') {
val row = line.split("[;,]")
if (row.length == 3) {
if (row(0).length > 0 && row(1).length > 0 && row(2).length > 0 && row(0).forall(_.isDigit) && row(2).toString.toUpperCase != "AB" && row(2).toString.toUpperCase != "XYZ") {
List((row(0).toLong, (row(1).toString.toUpperCase, row(2).toString.toUpperCase)))
} else {
None
}
} else {
None
}
} else {
None
}
}
println(x.size)
x foreach println
So for this input:
0;foo;AB
1;cool,stuff
2;other;things
6;foo;XYZ
3;a;b
I got this output:
3
(1,(COOL,STUFF))
(2,(OTHER,THINGS))
(3,(A,B))
Of course, there can be added many modification to your code:
.filterNot(_.startsWith("#")) // lines cannot start with #
.map(_.split("[;,]")) // split the lines
.filter(_.size == 3) // each line must have 3 items in it
.filter(line => line.filter(_.length > 0).size == 3) // and none of those items can be empty ***
.filter(line => line(2).toUpperCase != "AB" ) // also those that have AB
.filter(line => line(2).toUpperCase != "XYZ" ) // also those that have XYZ
.map(line => (line(0).toLong, (line(1).toUpperCase, line(2).toUpperCase))) // the final format
*** this:
.filter(line => line.filter(_.length > 0).size == 3)
can also be written like this:
.map(_.filter(_.length > 0)) // each item must not be empty
.filter(_.size == 3) // so the lines with less than 3 items will be removed; ***
Also, as an observation, you don't have to put toString alongside a String.

Task not serializable Flink

I am trying to do the pagerank Basic example in flink with little bit of modification(only in reading the input file, everything else is the same) i am getting the error as Task not serializable and below is the part of the output error
atorg.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
Below is my code
object hpdb {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val maxIterations = 10000
val DAMPENING_FACTOR: Double = 0.85
val EPSILON: Double = 0.0001
val outpath = "/home/vinoth/bigdata/assign10/pagerank.csv"
val links = env.readCsvFile[Tuple2[Long,Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1,4)).as('sourceId,'targetId).toDataSet[Link]//source and target
val pages = env.readCsvFile[Tuple1[Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1)).as('pageId).toDataSet[Id]//Pageid
val noOfPages = pages.count()
val pagesWithRanks = pages.map(p => Page(p.pageId, 1.0 / noOfPages))
val adjacencyLists = links
// initialize lists ._1 is the source id and ._2 is the traget id
.map(e => AdjacencyList(e.sourceId, Array(e.targetId)))
// concatenate lists
.groupBy("sourceId").reduce {
(l1, l2) => AdjacencyList(l1.sourceId, l1.targetIds ++ l2.targetIds)
}
// start iteration
val finalRanks = pagesWithRanks.iterateWithTermination(maxIterations) {
// **//the output shows error here**
currentRanks =>
val newRanks = currentRanks
// distribute ranks to target pages
.join(adjacencyLists).where("pageId").equalTo("sourceId") {
(page, adjacent, out: Collector[Page]) =>
for (targetId <- adjacent.targetIds) {
out.collect(Page(targetId, page.rank / adjacent.targetIds.length))
}
}
// collect ranks and sum them up
.groupBy("pageId").aggregate(SUM, "rank")
// apply dampening factor
//**//the output shows error here**
.map { p =>
Page(p.pageId, (p.rank * DAMPENING_FACTOR) + ((1 - DAMPENING_FACTOR) / pages.count()))
}
// terminate if no rank update was significant
val termination = currentRanks.join(newRanks).where("pageId").equalTo("pageId") {
(current, next, out: Collector[Int]) =>
// check for significant update
if (math.abs(current.rank - next.rank) > EPSILON) out.collect(1)
}
(newRanks, termination)
}
val result = finalRanks
// emit result
result.writeAsCsv(outpath, "\n", " ")
env.execute()
}
}
Any help in the right direction is highly appreciated? Thank you.
The problem is that you reference the DataSet pages from within a MapFunction. This is not possible, since a DataSet is only the logical representation of a data flow and cannot be accessed at runtime.
What you have to do to solve this problem is to assign the val pagesCount = pages.count value to a variable pagesCount and refer to this variable in your MapFunction.
What pages.count actually does, is to trigger the execution of the data flow graph, so that the number of elements in pages can be counted. The result is then returned to your program.