Set similarity join using Spark - scala

I have two text files each line is in the form of (id, sequence of numbers).
I have a threshold value as well.
File 1 looks like below where, in the first line, 0 is the id and rest is
a sequence of numbers.
0 1 4 5 6
1 2 3 6
2 4 5 6
Similarly I have file 2 with following contents.
0 1 4 6
1 2 5 6
2 3 5
I have to find those lines which have similarity value greater or equal to a threshold. Similarity value can be calculated as the intersection of two lines divided by the union of two lines. For example line id- 0 of file1 has seq 1,4,5,6 and line id-0 of file2 has seq 1,4,6. They have intersection size = 3 and union size = 4. Therefore their similarity will be 3/4 = 0.75 which is greater than the threshold.
I have written python code to do this task and trying to convert it to Scala.
with open("file1.txt") as f1:
content1 = f1.readlines()
content1 = [x.strip() for x in content1]
with open("file2.txt") as f2:
content2 = f2.readlines()
content2 = [x.strip() for x in content2]
threshold = 0.5
final_index_list_with_similarity = []
for i in range(len(content1)):
for j in range(len(content2)):
index_content1 = content1[i][0]
index_content2 = content2[j][0]
s = set(content1[i][1:])
t = set(content2[j][1:])
intersect = s & t
intersect_size = len(intersect) - 1
union_size = len(s) + len(t) - intersect_size - 2 #substracting two because I am getting two extra space.
similarity = intersect_size/union_size
if similarity >= threshold:
final_index_list_with_similarity.append((index_content1, index_content2, similarity))
print(final_index_list_with_similarity)
Output : [('0', '0', 0.75), ('1', '1', 0.5), ('2', '0', 0.5), ('2', '1', 0.5)]
What I have tried till now in scala is something looks like this.
val inputFile1 = args(0)
val inputFile2 = args(1)
val threshold = args(2).toDouble
val ouputFolder = args(3)
val conf = new SparkConf().setAppName("SetSimJoin").setMaster("local")
val sc = new SparkContext(conf)
val lines1 = sc.textFile(inputFile1).flatMap(line => line.split("\n"))
val lines2 = sc.textFile(inputFile2).flatMap(line => line.split("\n"))
val elements1 = lines1.map { x => x.drop(x.split(" ")(0).length.toInt + 1) }.flatMap { x => x.split(" ") }.map { x => (x, 1) }.reduceByKey(_+_)
val elements2 = lines2.map { x => x.drop(x.split(" ")(0).length.toInt + 1) }.flatMap { x => x.split(" ") }.map { x => (x, 1) }.reduceByKey(_+_)
This gives me the frequency of every number in the entire file.
Any help or guidance will be much appreciated.

Both files can be joined as RDD, and then formula applied: "intersection size/union size":
val lines1 = sparkContext.textFile("inputFile1")
val lines2 = sparkContext.textFile("inputFile2")
val rdd1 = lines1.map(_.split(" ")).map(v => (v(0), v.tail))
val rdd2 = lines2.map(_.split(" ")).map(v => (v(0), v.tail))
val result = rdd1.join(rdd2).map(r => (
r._1,
r._2._1.intersect(r._2._2).size * 1.0 /
r._2._1.union(r._2._2).distinct.size
)
)
result.foreach(println)
Output is:
(1,0.5)
(0,0.75)
(2,0.25)

Related

Spark PageRank Tuning

I am running a PageRank using scala+spark2.4 on yarn,but always failed after running several hours/jobs.
--driver-memory 60G --driver-cores 4
--num-executors 250 --executor-cores 2 --executor-memory 32g
input data:
weightFile has 1000 .gz files,each 500MB, total 500GB
linkFile has 1000 .gz fules, each 500MB, total 500GB
How should I change my code or spark configs?
sc.setCheckpointDir(checkpointFile)
val weightData = sc.textFile(weightFile).repartition(20000)
val weightUrlData = weightData.map{line => val lines = line.split("\t"); (hash(lines(0)) , lines(0), lines(1).toFloat)}
weightUrlData.persist(StorageLevel.DISK_ONLY)
var dataWeight = weightUrlData.map{x => (x._1,x._3)}
dataWeight = dataWeight.reduceByKey{(a,b) => if(a > b) a else b}
val dataUrl = weightUrlData.map{x => (x._1,x._2)}
val totalZ = dataWeight.count.toFloat
val sum1 = dataWeight.map(x => x._2).sum().toFloat
dataWeight = dataWeight.map{x => (x._1,x._2/sum1)}
val linkData = sc.textFile(linkFile).repartition(20000)
val links = linkData.map{line => val lines = line.split("\t");(hash(lines(0)),(hash(lines(1)),lines(2).toFloat))}.groupByKey()
links.persist(StorageLevel.DISK_ONLY)
links.count()
var ranks = links.mapValues(v => 1.0)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
urls.map(url => (url._1, url._2*rank))
}
contribs.persist(StorageLevel.DISK_ONLY)
contribs.count()
val ranksTmp = contribs.reduceByKey(_ + _).mapValues(0.85 * _)
val Zranks = ranksTmp.map(x => x._2).sum()
val Z = totalZ - Zranks
println("total Z: " + totalZ + " Z: " + Z)
val randnZ = dataWeight.map{x => (x._1,x._2*Z)}
val rankResult = ranksTmp.rightOuterJoin(randnZ).map{case(a,(b,c)) => (a,b.getOrElse(0.0) + c) }
ranks = ranks.join(rankResult).map{case(a,(b,c)) => (a,c)}
if(i % 2 == 0) {
ranks.persist(StorageLevel.MEMORY_AND_DISK)
ranks.checkpoint()
ranks.count()
}else{
ranks.count()
}
if(i == iters) {
rankResult.map{case(a,b) => a.toString + "\t" + b.toString}.saveAsTextFile(outputFile)
dataUrl.join(rankResult).values.map{case (a,b) => a + "\t" + b.toString}.saveAsTextFile(outputFile + "UrlAndWeight")
}
```
It is really hard to guess why your code not working properly just by looking at it. A few years ago I implemented a Pagerank for ranking users in a social graph and it worked without a hitch for me - link. Maybe it'd be helpful for you. Spark's Pregel interface runs pagerank until convergence or you may set a fixed number of iterations.

SPARK: sum of elements with this same indexes from RDD[Array[Int]] in spark-rdd

I have three files like:
file1: 1,2,3,4,5
6,7,8,9,10
file2: 11,12,13,14,15
16,17,18,19,20
file3: 21,22,23,24,25
26,27,28,29,30
I have to find the sum of rows from each file:
1+2+3+4+5 + 11+12+13+14+15 + 21+21+23+24+25
6+7+8+9+10 + 16+17+18+19+20 + 26+27+28+29+30
I have written following code in spark-scala to get the Array of sum of all the rows:
val filesRDD = sc.wholeTextFiles("path to folder\\numbers\\*")
// creating RDD[Array[String]]
val linesRDD = filesRDD.map(elem => elem._2.split("\\n"))
// creating RDD[Array[Array[Int]]]
val rdd1 = linesRDD.map(line => line.map(str => str.split(",").map(_.trim.toInt)))
// creating RDD[Array[Int]]
val rdd2 = rdd1.map(elem => elem.map(e => e.sum))
rdd2.collect.foreach(elem => println(elem.mkString(",")))
the output I am getting is:
15,40
65,90
115,140
What I want is to sum 15+65+115 and 40+90+140
Any help is appreciated!
PS:
the files can have different no. of lines like some with 3 lines other with 4 and there can be any no. of files.
I want to do this using rdds only not dataframes.
You can use reduce to sum up the arrays:
val result = rdd2.reduce((x,y) => (x,y).zipped.map(_ + _))
// result: Array[Int] = Array(195, 270)
and if the files are of different length (e.g. file 3 has only one line 21,22,23,24,25)
val result = rdd2.reduce((x,y) => x.zipAll(y,0,0).map{case (a, b) => a + b})

Join per line two different RDDs in just one - Scala

I'm programming a K-means algorithm in Spark-Scala.
My model predicts in which cluster is each point.
Data
-6.59 -44.68
-35.73 39.93
47.54 -52.04
23.78 46.82
....
Load the data
val data = sc.textFile("/home/borja/flink/kmeans/points")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
Cluster the data into two classes using KMeans
val numClusters = 10
val numIterations = 100
val clusters = KMeans.train(parsedData, numClusters, numIterations)
Predict
val prediction = clusters.predict(parsedData)
However, I need to put the result and the points in the same file, in the next format:
[no title, numberOfCluster (1,2,3,..10), pointX, pointY]:
6 -6.59 -44.68
8 -35.73 39.93
10 47.54 -52.04
7 23.78 46.82
This is the entry of this executable in Python to print really nice the result.
But my best effort has got just this:
(you can check the first numbers are wrong: 68, 384, ...)
var i = 0
val c = sc.parallelize(data.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
i = 0
val c2 = sc.parallelize(prediction.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
val result = c.join(c2)
result.take(5)
Result:
res94: Array[(Int, (String, Int))] = Array((68,(17.79 13.69,0)), (384,(-33.47 -4.87,8)), (440,(-4.75 -42.21,1)), (4,(-33.31 -13.11,6)), (324,(-39.04 -16.68,6)))
Thanks for your help! :)
I don't have a spark cluster handy to test, but something like this should work:
val result = parsedData.map { v =>
val cluster = clusters.predict(v)
s"$cluster ${v(0)} ${v(1)}"
}
result.saveAsTextFile("/some/output/path")

how to sort multiple colmuns (more than ten columns) in scala language?

how to sort multiple colmuns (more than ten columns) in scala language.
for example:
1 2 3 4
4 5 6 3
1 2 1 1
‌2 3 5 10
desired output
1 2 1 1
1 2 3 3
2 3 5 4
4 5 6 10
Not much to it.
val input = io.Source.fromFile("junk.txt") // open file
.getLines // load all contents
.map(_.split("\\W+")) // turn rows into Arrays
.map(_.map(_.toInt)) // Arrays of Ints
val output = input.toList // from Iterator to List
.transpose // swap rows/columns
.map(_.sorted) // sort rows
.transpose // swap back
output.foreach(x => println(x.mkString(" "))) // report results
Note: This allows any whitespace between the numbers but it will fail to create the expected Array[Int] if it encounters other separators (commas, etc.) or if the line begins with a space.
Also, transpose will throw if the rows aren't all the same size.
I followed the following algorithm. First alter the dimension of the row and columns. Then sort the rows, then again alter the dimension to bring back original row-column configuration. Here is a sample proof of concept.
object SO_42720909 extends App {
// generating dummy data
val unsortedData = getDummyData(2, 3)
prettyPrint(unsortedData)
println("----------------- ")
// altering the dimension
val unsortedAlteredData = alterDimension(unsortedData)
// prettyPrint(unsortedAlteredData)
// sorting the altered data
val sortedAlteredData = sort(unsortedAlteredData)
// prettyPrint(sortedAlteredData)
// bringing back original dimension
val sortedData = alterDimension(sortedAlteredData)
prettyPrint(sortedData)
def alterDimension(data: Seq[Seq[Int]]): Seq[Seq[Int]] = {
val col = data.size
val row = data.head.size // make it safe
for (i <- (0 until row))
yield for (j <- (0 until col)) yield data(j)(i)
}
def sort(data: Seq[Seq[Int]]): Seq[Seq[Int]] = {
for (row <- data) yield row.sorted
}
def getDummyData(row: Int, col: Int): Seq[Seq[Int]] = {
val r = scala.util.Random
for (i <- (1 to row))
yield for (j <- (1 to col)) yield r.nextInt(100)
}
def prettyPrint(data: Seq[Seq[Int]]): Unit = {
data.foreach(row => {
println(row.mkString(", "))
})
}
}

Want to parse a file and reformat it to create a pairRDD in Spark through Scala

I have dataset in a file in the form:
1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
3: 9 77935 79583 84707 564578 594898 681805 681886 835470 880698
4: 145
5: 8 57544 58089 60048 65880 284186 313376
6: 8
I need to transform this to something like below using Spark and Scala as a part of preprocessing of data:
1 1664968
2 3
2 747213
2 1664968
2 4095634
2 5535664
3 9
3 77935
3 79583
3 84707
And so on....
Can anyone provide input on how this can be done.
The length of the original rows in the file varies as shown in the dataset example above.
I am not sure, how to go about doing this transformation.
I tried soemthing like below which gives me a pair of the key and the first element after the semi-colon.
But I am not sure how to iterate over the entire data and generate the pairs as needed.
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Graphx").setMaster("local"))
val rawLinks = sc.textFile("src/main/resources/links-simple-sorted-top100.txt")
rawLinks.take(5).foreach(println)
val formattedLinks = rawLinks.map{ rows =>
val fields = rows.split(":")
val fromVertex = fields(0)
val toVerticesArray = fields(1).split(" ")
(fromVertex, toVerticesArray(1))
}
val topFive = formattedLinks.take(5)
topFive.foreach(println)
}
val rdd = sc.parallelize(List("1: 1664968","2: 3 747213 1664968 1691047 4095634 5535664"))
val keyValues = rdd.flatMap(line => {
val Array(key, values) = line.split(":",2)
for(value <- values.trim.split("""\s+"""))
yield (key, value.trim)
})
keyValues.collect
split row in 2 parts and map on variable number of columns.
def transform(s: String): Array[String] = {
val Array(head, tail) = s.split(":", 2)
tail.trim.split("""\s+""").map(x => s"$head $x")
}
> transform("2: 3 747213 1664968 1691047 4095634 5535664")
// Array(2 3, 2 747213, 2 1664968, 2 1691047, 2 4095634, 2 5535664)