Task not serializable Flink - scala

I am trying to do the pagerank Basic example in flink with little bit of modification(only in reading the input file, everything else is the same) i am getting the error as Task not serializable and below is the part of the output error
atorg.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
Below is my code
object hpdb {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val maxIterations = 10000
val DAMPENING_FACTOR: Double = 0.85
val EPSILON: Double = 0.0001
val outpath = "/home/vinoth/bigdata/assign10/pagerank.csv"
val links = env.readCsvFile[Tuple2[Long,Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1,4)).as('sourceId,'targetId).toDataSet[Link]//source and target
val pages = env.readCsvFile[Tuple1[Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1)).as('pageId).toDataSet[Id]//Pageid
val noOfPages = pages.count()
val pagesWithRanks = pages.map(p => Page(p.pageId, 1.0 / noOfPages))
val adjacencyLists = links
// initialize lists ._1 is the source id and ._2 is the traget id
.map(e => AdjacencyList(e.sourceId, Array(e.targetId)))
// concatenate lists
.groupBy("sourceId").reduce {
(l1, l2) => AdjacencyList(l1.sourceId, l1.targetIds ++ l2.targetIds)
}
// start iteration
val finalRanks = pagesWithRanks.iterateWithTermination(maxIterations) {
// **//the output shows error here**
currentRanks =>
val newRanks = currentRanks
// distribute ranks to target pages
.join(adjacencyLists).where("pageId").equalTo("sourceId") {
(page, adjacent, out: Collector[Page]) =>
for (targetId <- adjacent.targetIds) {
out.collect(Page(targetId, page.rank / adjacent.targetIds.length))
}
}
// collect ranks and sum them up
.groupBy("pageId").aggregate(SUM, "rank")
// apply dampening factor
//**//the output shows error here**
.map { p =>
Page(p.pageId, (p.rank * DAMPENING_FACTOR) + ((1 - DAMPENING_FACTOR) / pages.count()))
}
// terminate if no rank update was significant
val termination = currentRanks.join(newRanks).where("pageId").equalTo("pageId") {
(current, next, out: Collector[Int]) =>
// check for significant update
if (math.abs(current.rank - next.rank) > EPSILON) out.collect(1)
}
(newRanks, termination)
}
val result = finalRanks
// emit result
result.writeAsCsv(outpath, "\n", " ")
env.execute()
}
}
Any help in the right direction is highly appreciated? Thank you.

The problem is that you reference the DataSet pages from within a MapFunction. This is not possible, since a DataSet is only the logical representation of a data flow and cannot be accessed at runtime.
What you have to do to solve this problem is to assign the val pagesCount = pages.count value to a variable pagesCount and refer to this variable in your MapFunction.
What pages.count actually does, is to trigger the execution of the data flow graph, so that the number of elements in pages can be counted. The result is then returned to your program.

Related

How to implement Levenshtein Distance Algorithm in Scala

I've a text file which contains the information about the sender and messages and the format is sender,messages. I want to use Levenshtein Distance Algorithm with threshold of 70% and want to store the similar messages to the Map. In the Map, My key is String and value is List[String]
For example my requirement is: If my messages are abc, bcd, cdf.
step1: First I should add the message 'abc' to the List. map.put("Group1",abc.toList)
step2: Next, I should compare the 'bcd'(2nd message) with 'abc'(1st message). If they meets the threshold of 70% then I should add the 'bcd' to List. Now, 'abc' and 'bcd' are added under the same key called 'Group1'.
step3: Now, I should get all the elements from Map. Currently G1 only with 2 values(abc,bcd), next compare the current message 'cdf' with 'abc' or 'bcd' (As 'abc' and 'bcd' is similar comparing with any one of them would be enough)
step4: If did not meet the threshold, I should create a new key "Group2" and add that message to the List and so on.
The 70% threshold means, For example:
message1: Dear customer! your mobile number 9032412236 has been successfully recharged with INR 500.00
message2: Dear customer! your mobile number 7999610201 has been successfully recharged with INR 500.00
Here, the Levenshtein Distance between these two is 8. We can check this here: https://planetcalc.com/1721/
8 edits needs to be done, 8 characters did not match out of (message1.length+message2.length)/2
If I assume the first message is of 100 characters and second message is of 100 characters then the average length is 100, out of 100, 8 characters did not match which means the accuracy level of this is 92%, so here, I should keep threshold 70%.
If Levenshtein distance matching at least 70%, then take them as similar.
I'm using the below library:
libraryDependencies += "info.debatty" % "java-string-similarity" % "2.0.0"
My code:
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {
val arr = line.split(","); (arr(0), arr(1))
})
val grpData = data2.groupByKey()
val myMap = scala.collection.mutable.Map.empty[String, List[String]]
for (values <- grpData.values.collect) {
val list = ListBuffer[String]()
for (value <- values) {
println(values)
if (myMap.isEmpty) {
list += value
myMap.put("G1", list.toList)
} else {
val currentMsg = value
val valuePartOnly = myMap.valuesIterator.toString()
for (messages <- valuePartOnly) {
def levenshteinDistance(currentMsg: String, messages: String) = {
???//TODO: Implement distance
}
}
}
}
}
}
}
After the else part, I'm not sure how do I start with this algorithm.
I do not have any output sample. So, I've explained it step by step.
Please check from step1 to step4.
Thanks.
I'm not really certain about next code I did not tried it, but I hope it demonstrates the idea:
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val distance: Levenshtein = new Levenshtein(); //Create object for calculation distance
def levenshteinDistance(left: String, right: String): Double = {
// I'm not really certain about this, how you would like to calculate relative distance?
// Relatevly to string with max size, min size, left or right?
l.distance(left, right) / Math.max(left.size, right.size)
}
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {
val arr = line.split(","); (arr(0), arr(1))
})
val grpData = data2.groupByKey()
val messages = scala.collection.mutable.Map.empty[String, List[String]]
var group = 1
for (values <- grpData.values.collect) {
val list = ListBuffer[String]()
for (value <- values) {
println(values)
if (messages.isEmpty) {
list += value
messages.put("G$group", list.toList)
} else {
val currentMsg = value
val group = messages.values.find {
case(key, messages) => messages.forall(message => levenshteinDistance(currentMsg, message) <= 0.7)
}._1.getOrElse {
group += 1
"G$group"
}
val groupMessages = messages.getOrEse(group, ListBuffer.empty[String])
groupMessages.append(currentMsg)
messages.put(currentMsg, groupMessages)
}
}
}
}
}

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

How to time scala map functions in an aggregative fashion?

I am having a hard to come up with a solution to time individual functions in a Scala map operation. Let's say I have the following line in my code:
val foo = data.map(d => func1(d).func2())
where data is a Seq of N elements. How would I go about timing how long my program has executed func1 in total and func2 in total? Since it is a map operation, these functions will be called N times, so each time record should be added to a cumulative time record.
How can I do this without breaking the Scala syntax?
NB: I want to end up with totalTime_inFunc1 and totalTime_inFunc2.
Let's say, func2() returns YourType. Then, you need to return tuple (YourType, Long, Long) from function inside map, where second tuple element is execution time of func1 and third element is exec time of func2. After that, you can easily get execution time from seq of tuples using sum:
val fooWithTime = {
data.map(d => {
def now = System.currentTimeMillis
val beforeFunc1 = now
val func1Result = func1(d)
val func1Time = now - beforeFunc1
val beforeFun2 = now
val result = func1Result.func2()
(result, func1Time, now - beforeFun2)
}
}
val foo = fooWithTime.map(_._1)
val totalTimeFunc1 = fooWithTime.map(_._2).sum
val totalTimeFunc2 = fooWithTime.map(_._3).sum
Also, you can easily use your preferred method of calculating execution time instead of System.currentTimeMillis().
Look at the Closure. You will declare your functions and make them refer to variable in scope, then pass them to map, and make them increment variable from scope.
Edit
code with closures
object closure {
var time1 = 0L
var time2 = 0L
def time[R](block: => R)(time: Int): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
if (time==1)
time1 += t1-t0
else
time2 += t1-t0
result
}
def fun1(i: Int): Int = {
time{i+1}(1)
}
def fun2(i: Int): Int = {
time{i+2}(2)
}
}
val data = List(1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7)
val foo = data.map(d => closure.fun2(closure.fun1(d)))
closure.time1 // res4: Long = 22976
closure.time2 // res5: Long = 25438
Edit 2
object closure {
var time1 = 0L
var time2 = 0L
def time[R](block: => R)(time: Int): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
if (time==1)
time1 += t1-t0
else
time2 += t1-t0
result
}
val data = List(1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7)
val test = new test;
val foo = data.map(d => {
val fun1 = time{test.fun1(d)}(1)
time{fun1.fun2(d)}(2)
})
}
val s = Seq(1, 2, 3)
val (mappedSeq, totalTime) = s.map(x => {
// call your map methods here and time
// x is the mapped element
val timing = 5.5
// then return the tuple with mapped element and time taken for the map function
(x, timing)
}).foldLeft((Seq.empty[Int], 0d))((accumulator, pair) => (accumulator._1 :+ pair._1, accumulator._2 + pair._2))
println(totalTime)
println(mappedSeq.mkString(", "))

Compilation error not found: value x but I declared it already, what's wrong

result.map { res =>
val totaldocs: Int = res.value
// do something with this number
}
//val totaldocs = 60
val totalpages:Int = (totaldocs/ipp)+1
Compilation error not found: value x but I declared it already, what is wrong with my implementation, sorry I am new to play framework and scala programming language.
I would say this line is the problem:
val totalpages:Int = (totaldocs/ipp)+1
because totaldocs is only defined inside the map scope
maybe you want something like:
private def getTotalPages(query:BSONDocument, ipp:Int) (implicit ec: ExecutionContext) = {
val key = collectionName + ":" + BSONDocument.pretty(query)
Logger.debug("Query key = "+key)
val command = Count(query)
val result: Future[CountResult] = collection.runCommand(command)
result.map { res =>
val totaldocs: Int = res.value
// do something with this number
val totalpages:Int = (totaldocs/ipp)+1
Logger.debug(s"Total docs $totaldocs, Total pages $totalpages, Items per page, $ipp")
totalpages
}
}
but now it will return a Future[Int] and you will have to deal with the future on the caller.
Note: this is just one solution, depending on your code it may not be the most adequate one

spark scala get uncommon map elements

I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Rating(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way,
I am trying this but fails:
val test = ratings.filter(!_.equals(train.map(_)))
val test = ratings.subtract(train)
Take a look here. http://markmail.org/message/qi6srcyka6lcxe7o
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = data.partitions.map(partition => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index))
iter.map(x => (x, partitionRand.nextDouble))
})
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}
Instead of using an exclusion method (like filter or subtract), I'd partition the set "by hand" for a more efficient execution:
val probabilisticSegment:(RDD[Double,Rating],Double=>Boolean) => RDD[Rating] =
(rdd,prob) => rdd.filter{case (k,v) => prob(k)}.map {case (k,v) => v}
val ranRating = rating.map( x=> (Random.nextDouble(), x)).cache
val train = probabilisticSegment(ranRating, _ < 0.8)
val test = probabilisticSegment(ranRating, _ >= 0.8)
cache saves the intermediate RDD sothat the next two operations can be performed from that point on without incurring in the execution of the complete lineage.
(*) Note the use of val to define a function instead of def. vals are serializer-friendly