How to implement Levenshtein Distance Algorithm in Scala - scala

I've a text file which contains the information about the sender and messages and the format is sender,messages. I want to use Levenshtein Distance Algorithm with threshold of 70% and want to store the similar messages to the Map. In the Map, My key is String and value is List[String]
For example my requirement is: If my messages are abc, bcd, cdf.
step1: First I should add the message 'abc' to the List. map.put("Group1",abc.toList)
step2: Next, I should compare the 'bcd'(2nd message) with 'abc'(1st message). If they meets the threshold of 70% then I should add the 'bcd' to List. Now, 'abc' and 'bcd' are added under the same key called 'Group1'.
step3: Now, I should get all the elements from Map. Currently G1 only with 2 values(abc,bcd), next compare the current message 'cdf' with 'abc' or 'bcd' (As 'abc' and 'bcd' is similar comparing with any one of them would be enough)
step4: If did not meet the threshold, I should create a new key "Group2" and add that message to the List and so on.
The 70% threshold means, For example:
message1: Dear customer! your mobile number 9032412236 has been successfully recharged with INR 500.00
message2: Dear customer! your mobile number 7999610201 has been successfully recharged with INR 500.00
Here, the Levenshtein Distance between these two is 8. We can check this here: https://planetcalc.com/1721/
8 edits needs to be done, 8 characters did not match out of (message1.length+message2.length)/2
If I assume the first message is of 100 characters and second message is of 100 characters then the average length is 100, out of 100, 8 characters did not match which means the accuracy level of this is 92%, so here, I should keep threshold 70%.
If Levenshtein distance matching at least 70%, then take them as similar.
I'm using the below library:
libraryDependencies += "info.debatty" % "java-string-similarity" % "2.0.0"
My code:
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {
val arr = line.split(","); (arr(0), arr(1))
})
val grpData = data2.groupByKey()
val myMap = scala.collection.mutable.Map.empty[String, List[String]]
for (values <- grpData.values.collect) {
val list = ListBuffer[String]()
for (value <- values) {
println(values)
if (myMap.isEmpty) {
list += value
myMap.put("G1", list.toList)
} else {
val currentMsg = value
val valuePartOnly = myMap.valuesIterator.toString()
for (messages <- valuePartOnly) {
def levenshteinDistance(currentMsg: String, messages: String) = {
???//TODO: Implement distance
}
}
}
}
}
}
}
After the else part, I'm not sure how do I start with this algorithm.
I do not have any output sample. So, I've explained it step by step.
Please check from step1 to step4.
Thanks.

I'm not really certain about next code I did not tried it, but I hope it demonstrates the idea:
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object Demo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("My App")
val distance: Levenshtein = new Levenshtein(); //Create object for calculation distance
def levenshteinDistance(left: String, right: String): Double = {
// I'm not really certain about this, how you would like to calculate relative distance?
// Relatevly to string with max size, min size, left or right?
l.distance(left, right) / Math.max(left.size, right.size)
}
val sc = new SparkContext(conf)
val inputFile = "D:\\MyData.txt"
val data = sc.textFile(inputFile)
val data2 = data.map(line => {
val arr = line.split(","); (arr(0), arr(1))
})
val grpData = data2.groupByKey()
val messages = scala.collection.mutable.Map.empty[String, List[String]]
var group = 1
for (values <- grpData.values.collect) {
val list = ListBuffer[String]()
for (value <- values) {
println(values)
if (messages.isEmpty) {
list += value
messages.put("G$group", list.toList)
} else {
val currentMsg = value
val group = messages.values.find {
case(key, messages) => messages.forall(message => levenshteinDistance(currentMsg, message) <= 0.7)
}._1.getOrElse {
group += 1
"G$group"
}
val groupMessages = messages.getOrEse(group, ListBuffer.empty[String])
groupMessages.append(currentMsg)
messages.put(currentMsg, groupMessages)
}
}
}
}
}

Related

Checking if elements of a tweets array contain one of the elements of positive words array and count

We are building sentiment analysis application and we converted our tweets dataframe to an array. We created another array consisting of positive words. But we cannot count the number of tweets containing one of those positive words. We tried these and we get 1 as result. It must be more than 1. Apparently it did not count:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
var tweetDF = sqlContext.read.json("hdfs:///sandbox/tutorial-files/770/tweets_staging/*")
tweetDF.show()
var messages = tweetDF.select("msg").collect.map(_.toSeq)
println("Total messages: " + messages.size)
val positive = Source.fromFile("/home/teslavm/positive.txt").getLines.toArray
var happyCount=0
for (e <- 0 until messages.size) {
for (f <- 0 until positive.size) {
if (messages(e).contains(positive(f))){
happyCount=happyCount+1
}
}
}
print("\nNumber of happy messages: " +happyCount)
This should work.
It has the advantage that you do not have to collect the result, as well as being more functional.
val messages = tweetDF.select("msg").as[String]
val positiveWords =
Source
.fromFile("/home/teslavm/positive.txt")
.getLines
.toList
.map(word => word.toLowerCase)
def hasPositiveWords(message: String): Boolean = {
val _message = message.toLowerCase
positiveWords.exists(word => _message.contains(word))
}
val positiveMessages = messages.filter(hasPositiveWords _)
println(positiveMessages.count())
I tested this code locally with:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
val tweetDF = List(
(1, "Yes I am happy"),
(2, "Sadness is a way of life"),
(3, "No, no, no, no, yes")
).toDF("id", "msg")
val positiveWords = List("yes", "happy")
And it worked.

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Implementing topological sort in Spark GraphX

I am trying to implement topological sort using Spark's GraphX library.
This is the code I've written so far:
MyObject.scala
import java.util.ArrayList
import scala.collection.mutable.Queue
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.graphx.Edge
import org.apache.spark.graphx.EdgeDirection
import org.apache.spark.graphx.Graph
import org.apache.spark.graphx.Graph.graphToGraphOps
import org.apache.spark.graphx.VertexId
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object MyObject {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Spark-App").setMaster("local[2]")
val sc = new SparkContext(conf)
val resources: RDD[Resource] = makeResources(sc)
val relations: RDD[Relation] = makeRelations(sc)
println("Building graph ...")
var graph = buildGraph(resources, relations, sc)
println("Graph built!!")
println("Testing topo sort ...")
val topoSortResult = topoSort(graph, sc);
println("topoSortResult = " + topoSortResult)
println("Testing topo sort done!")
}
def buildGraph(resources: RDD[Resource], relations: RDD[Relation], sc: SparkContext): Graph[Resource, Relation] =
{
val vertices: RDD[(Long, Resource)] = resources.map(resource => (resource.id, resource))
val edges: RDD[Edge[Relation]] = relations.map(relation => Edge(relation.srcId, relation.dstId, relation))
var graph = Graph[Resource, Relation](vertices, edges)
graph
}
def makeResources(sc: SparkContext): RDD[Resource] =
{
var list: List[Resource] = List()
list = list :+ new Resource(1L)
list = list :+ new Resource(2L)
list = list :+ new Resource(3L)
list = list :+ new Resource(4L)
list = list :+ new Resource(5L)
sc.parallelize(list)
}
def makeRelations(sc: SparkContext): RDD[Relation] =
{
var list: List[Relation] = List()
list = list :+ new Relation(1L, "depends_on", 2L)
list = list :+ new Relation(3L, "depends_on", 2L)
list = list :+ new Relation(4L, "depends_on", 2L)
list = list :+ new Relation(5L, "depends_on", 2L)
sc.parallelize(list)
}
def topoSort(graph: Graph[Resource, Relation], sc: SparkContext): java.util.List[(VertexId, Resource)] =
{
// Will contain the result
val sortedResources: java.util.List[(VertexId, Resource)] = new ArrayList()
// Contains all the vertices
val vertices = graph.vertices
// Contains all the vertices whose in-degree > 0
val inDegrees = graph.inDegrees;
val inDegreesKeys_array = inDegrees.keys.collect();
// Contains all the vertices whose in-degree == 0
val inDegreeZeroList = vertices.filter(vertex => !inDegreesKeys_array.contains(vertex._1))
// A map of vertexID vs its in-degree
val inDegreeMapRDD = inDegreeZeroList.map(vertex => (vertex._1, 0)).union(inDegrees);
// Insert all the resources whose in-degree == 0 into a queue
val queue = new Queue[(VertexId, Resource)]
for (vertex <- inDegreeZeroList.toLocalIterator) { queue.enqueue(vertex) }
// Get an RDD containing the outgoing edges of every vertex
val neighbours = graph.collectNeighbors(EdgeDirection.Out)
// Initiate the algorithm
while (!queue.isEmpty) {
val vertex_top = queue.dequeue()
// Add the topmost element of the queue to the result
sortedResources.add(vertex_top)
// Get the neigbours (from outgoing edges) of this vertex
// This will be an RDD containing just 1 element which will be an array of neighbour vertices
val vertex_neighbours = neighbours.filter(vertex => vertex._1.equals(vertex_top._1))
// For each vertex, decrease its in-degree by 1
vertex_neighbours.foreach(arr => {
val neighbour_array = arr._2
neighbour_array.foreach(vertex => {
val oldInDegree = inDegreeMapRDD.filter(vertex_iter => (vertex_iter._1 == vertex._1)).first()._2
val newInDegree = oldInDegree - 1
// Reflect the new in-degree in the in-degree map RDD
inDegreeMapRDD.map(vertex_iter => {
if (vertex_iter._1 == vertex._1) {
(vertex._1, newInDegree)
}
else{
vertex_iter
}
});
// Add this vertex to the result if its in-degree has become zero
if (newInDegree == 0) {
queue.enqueue(vertex)
}
})
})
}
return sortedResources
}
}
Resource.scala
class Resource(val id: Long) extends Serializable {
override def toString(): String = {
"id = " + id
}
}
Relation.scala
class Relation(val srcId: Long, val name: String, val dstId: Long) extends Serializable {
override def toString(): String = {
srcId + " " + name + " " + dstId
}
}
I am getting the error :
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
for the line val oldInDegree = inDegreeMapRDD.filter(vertex_iter => (vertex_iter._1 == vertex._1)).first()._2.
I guess this is because it is illegal to modify an RDD inside the for-each loop of some other RDD.
Also, I fear that queue.enqueue(vertex) will not work, since it is not possible to modify a local collection inside a for-each loop.
How do I correctly implement this topological sort algorithm ?
The full stack trace of the exception is uploaded here (Had to upload it externally to prevent exceeding the body size limit of StackOverflow).
vertex_neighbours.foreach(arr => {
val neighbour_array = arr._2
neighbour_array.foreach(vertex => {
. . .
The outer foreach could be replaced by a for loop.
val vertex_neighbours = neighbours.filter(vertex => vertex._1.equals(vertex_top._1)).collect()
You need to get the RDD before doing for loop over it.

Task not serializable Flink

I am trying to do the pagerank Basic example in flink with little bit of modification(only in reading the input file, everything else is the same) i am getting the error as Task not serializable and below is the part of the output error
atorg.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
Below is my code
object hpdb {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val maxIterations = 10000
val DAMPENING_FACTOR: Double = 0.85
val EPSILON: Double = 0.0001
val outpath = "/home/vinoth/bigdata/assign10/pagerank.csv"
val links = env.readCsvFile[Tuple2[Long,Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1,4)).as('sourceId,'targetId).toDataSet[Link]//source and target
val pages = env.readCsvFile[Tuple1[Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1)).as('pageId).toDataSet[Id]//Pageid
val noOfPages = pages.count()
val pagesWithRanks = pages.map(p => Page(p.pageId, 1.0 / noOfPages))
val adjacencyLists = links
// initialize lists ._1 is the source id and ._2 is the traget id
.map(e => AdjacencyList(e.sourceId, Array(e.targetId)))
// concatenate lists
.groupBy("sourceId").reduce {
(l1, l2) => AdjacencyList(l1.sourceId, l1.targetIds ++ l2.targetIds)
}
// start iteration
val finalRanks = pagesWithRanks.iterateWithTermination(maxIterations) {
// **//the output shows error here**
currentRanks =>
val newRanks = currentRanks
// distribute ranks to target pages
.join(adjacencyLists).where("pageId").equalTo("sourceId") {
(page, adjacent, out: Collector[Page]) =>
for (targetId <- adjacent.targetIds) {
out.collect(Page(targetId, page.rank / adjacent.targetIds.length))
}
}
// collect ranks and sum them up
.groupBy("pageId").aggregate(SUM, "rank")
// apply dampening factor
//**//the output shows error here**
.map { p =>
Page(p.pageId, (p.rank * DAMPENING_FACTOR) + ((1 - DAMPENING_FACTOR) / pages.count()))
}
// terminate if no rank update was significant
val termination = currentRanks.join(newRanks).where("pageId").equalTo("pageId") {
(current, next, out: Collector[Int]) =>
// check for significant update
if (math.abs(current.rank - next.rank) > EPSILON) out.collect(1)
}
(newRanks, termination)
}
val result = finalRanks
// emit result
result.writeAsCsv(outpath, "\n", " ")
env.execute()
}
}
Any help in the right direction is highly appreciated? Thank you.
The problem is that you reference the DataSet pages from within a MapFunction. This is not possible, since a DataSet is only the logical representation of a data flow and cannot be accessed at runtime.
What you have to do to solve this problem is to assign the val pagesCount = pages.count value to a variable pagesCount and refer to this variable in your MapFunction.
What pages.count actually does, is to trigger the execution of the data flow graph, so that the number of elements in pages can be counted. The result is then returned to your program.