Spark scala how to count values in rows - scala

I am new to both Spark and Scala...and I have to read a data file and count the value that are contained in both columns and rows. The data set is structured like this:
0 0 2
0 2 2
0 2 0
2 0 0
0 0 0
0 1 0
In order to count the number of "2" in each column:
I imported the file:
val ip = sc.textFile("/home/../data-scala.txt")
I created an array to save my results
var ArrayCol = Array.ofDim[Long](3)
val cols = ip.map(line => line.split(" "))
for (i <- 0 to 2) {
ArrayCol(i) = cols.map(col => col(i)).filter(_.contains("2")).count()
}
and I counted the number of "2" contained in each column.
Now I would like to do the same for each row. Do you have any suggestion?

cols.map(r => r.count(_ == "2"))
Or shell example:
scala> val cols = sc.parallelize(List("0 1 2", "2 0 2")).map(_.split(" "))
scala> cols.map(_.count(_ == "2")).collect()
res1: Array[Int] = Array(1, 2)

Ok thank you
cols.map(r => r.count(_ == "2"))
works fine to count how many "2" there are in each row.
How would you do to count how many "2" there are in each column?
I think there is a more clear solution than mine.
Thanks.

Related

How to access elements columns wise in spark dataframes

I have a text file which contains the following data
3 5
10 20 30 40 50
0 0 0 2 5
5 10 10 10 10
Question:
first line of the file will give us number of rows and number of columns of data
print the sum of column if every element of the column is not a prime otherwise print zero
Output:
0
30
40
0
0
Explanation:
(0 because column 10 0 5 has prime number 5)
(30 because column 20 0 10 has no prime number so print 20+0+10=30) like wise apply for all columns.
suggest us the method to access the dataframe in column wise manner
General idea : Just zip every value with an index, create a pairRDD then apply a reduceByKey (the key here is the index) verifying at each step the number is a prime number.
val rows = spark.sparkContext.parallelize(
Seq(
Array(10,20,30,40,50),
Array(0,0,0,2,5),
Array(5,10,10,10,10)
)
)
def isPrime(i: Int): Boolean = i>=2 && ! ((2 until i-1) exists (i % _ == 0))
val result = rows.flatMap{arr => arr.map(Option(_)).zipWithIndex.map(_.swap)}
.reduceByKey{
case (None, _) | (_, None) => None
case (Some(a),Some(b)) if isPrime(a) | isPrime(b) => None
case (Some(a),Some(b)) => Some(a+b)
}.map{case (k,v) => k -> v.getOrElse(0)}
result.foreach(println)
Output (you'll have to collect data in order to sort by column index) :
(3,0)
(0,0)
(4,0)
(2,40)
(1,30)

Set similarity join using Spark

I have two text files each line is in the form of (id, sequence of numbers).
I have a threshold value as well.
File 1 looks like below where, in the first line, 0 is the id and rest is
a sequence of numbers.
0 1 4 5 6
1 2 3 6
2 4 5 6
Similarly I have file 2 with following contents.
0 1 4 6
1 2 5 6
2 3 5
I have to find those lines which have similarity value greater or equal to a threshold. Similarity value can be calculated as the intersection of two lines divided by the union of two lines. For example line id- 0 of file1 has seq 1,4,5,6 and line id-0 of file2 has seq 1,4,6. They have intersection size = 3 and union size = 4. Therefore their similarity will be 3/4 = 0.75 which is greater than the threshold.
I have written python code to do this task and trying to convert it to Scala.
with open("file1.txt") as f1:
content1 = f1.readlines()
content1 = [x.strip() for x in content1]
with open("file2.txt") as f2:
content2 = f2.readlines()
content2 = [x.strip() for x in content2]
threshold = 0.5
final_index_list_with_similarity = []
for i in range(len(content1)):
for j in range(len(content2)):
index_content1 = content1[i][0]
index_content2 = content2[j][0]
s = set(content1[i][1:])
t = set(content2[j][1:])
intersect = s & t
intersect_size = len(intersect) - 1
union_size = len(s) + len(t) - intersect_size - 2 #substracting two because I am getting two extra space.
similarity = intersect_size/union_size
if similarity >= threshold:
final_index_list_with_similarity.append((index_content1, index_content2, similarity))
print(final_index_list_with_similarity)
Output : [('0', '0', 0.75), ('1', '1', 0.5), ('2', '0', 0.5), ('2', '1', 0.5)]
What I have tried till now in scala is something looks like this.
val inputFile1 = args(0)
val inputFile2 = args(1)
val threshold = args(2).toDouble
val ouputFolder = args(3)
val conf = new SparkConf().setAppName("SetSimJoin").setMaster("local")
val sc = new SparkContext(conf)
val lines1 = sc.textFile(inputFile1).flatMap(line => line.split("\n"))
val lines2 = sc.textFile(inputFile2).flatMap(line => line.split("\n"))
val elements1 = lines1.map { x => x.drop(x.split(" ")(0).length.toInt + 1) }.flatMap { x => x.split(" ") }.map { x => (x, 1) }.reduceByKey(_+_)
val elements2 = lines2.map { x => x.drop(x.split(" ")(0).length.toInt + 1) }.flatMap { x => x.split(" ") }.map { x => (x, 1) }.reduceByKey(_+_)
This gives me the frequency of every number in the entire file.
Any help or guidance will be much appreciated.
Both files can be joined as RDD, and then formula applied: "intersection size/union size":
val lines1 = sparkContext.textFile("inputFile1")
val lines2 = sparkContext.textFile("inputFile2")
val rdd1 = lines1.map(_.split(" ")).map(v => (v(0), v.tail))
val rdd2 = lines2.map(_.split(" ")).map(v => (v(0), v.tail))
val result = rdd1.join(rdd2).map(r => (
r._1,
r._2._1.intersect(r._2._2).size * 1.0 /
r._2._1.union(r._2._2).distinct.size
)
)
result.foreach(println)
Output is:
(1,0.5)
(0,0.75)
(2,0.25)

pyspark: get size of the second element of a groupby on rdd

I have an rdd which I create from an input like the following:
0 1
0 2
1 2
1 3
I do a groupBy like the following:
rdd2 = rdd1.groupBy(lambda x: x[0])
Now rdd2 would be something like:
[(0,[1,2]),(1,[2,3])]
MY questions is, how can I get the size of that list associated with each element?
Thanks
You can use mapValues and len:
rdd2.mapValues(list).mapValues(len)
Why do you even need a groupBy when you have countByKey()
rdd = sc.parallelize(inputData)
rdd.countByKey()
Output will be a dictionary:
defaultdict(<class 'int'>, {0: 2, 1: 2})

how to sort multiple colmuns (more than ten columns) in scala language?

how to sort multiple colmuns (more than ten columns) in scala language.
for example:
1 2 3 4
4 5 6 3
1 2 1 1
‌2 3 5 10
desired output
1 2 1 1
1 2 3 3
2 3 5 4
4 5 6 10
Not much to it.
val input = io.Source.fromFile("junk.txt") // open file
.getLines // load all contents
.map(_.split("\\W+")) // turn rows into Arrays
.map(_.map(_.toInt)) // Arrays of Ints
val output = input.toList // from Iterator to List
.transpose // swap rows/columns
.map(_.sorted) // sort rows
.transpose // swap back
output.foreach(x => println(x.mkString(" "))) // report results
Note: This allows any whitespace between the numbers but it will fail to create the expected Array[Int] if it encounters other separators (commas, etc.) or if the line begins with a space.
Also, transpose will throw if the rows aren't all the same size.
I followed the following algorithm. First alter the dimension of the row and columns. Then sort the rows, then again alter the dimension to bring back original row-column configuration. Here is a sample proof of concept.
object SO_42720909 extends App {
// generating dummy data
val unsortedData = getDummyData(2, 3)
prettyPrint(unsortedData)
println("----------------- ")
// altering the dimension
val unsortedAlteredData = alterDimension(unsortedData)
// prettyPrint(unsortedAlteredData)
// sorting the altered data
val sortedAlteredData = sort(unsortedAlteredData)
// prettyPrint(sortedAlteredData)
// bringing back original dimension
val sortedData = alterDimension(sortedAlteredData)
prettyPrint(sortedData)
def alterDimension(data: Seq[Seq[Int]]): Seq[Seq[Int]] = {
val col = data.size
val row = data.head.size // make it safe
for (i <- (0 until row))
yield for (j <- (0 until col)) yield data(j)(i)
}
def sort(data: Seq[Seq[Int]]): Seq[Seq[Int]] = {
for (row <- data) yield row.sorted
}
def getDummyData(row: Int, col: Int): Seq[Seq[Int]] = {
val r = scala.util.Random
for (i <- (1 to row))
yield for (j <- (1 to col)) yield r.nextInt(100)
}
def prettyPrint(data: Seq[Seq[Int]]): Unit = {
data.foreach(row => {
println(row.mkString(", "))
})
}
}

Unexpected behavior inside the foreachPartition method of a RDD

I evaluated through the spark-shell the following lines of scala codes:
val a = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
val b = a.coalesce(1)
b.foreachPartition { p =>
p.map(_ + 1).foreach(println)
p.map(_ * 2).foreach(println)
}
The output is the following:
2
3
4
5
6
7
8
9
10
11
Why the partition p becomes empty after the first map?
It does not look strange to me since p is Iterator, when you walk through it with map, it has no more values, and taking into account that length is shortcut for size which is implemented like this:
def size: Int = {
var result = 0
for (x <- self) result += 1
result
}
you get 0.
The answer is in the scala doc http://www.scala-lang.org/api/2.11.8/#scala.collection.Iterator. It explicitely states that an iterator (p is an iterator) must be discarded after calling on it the map method.