How to access elements columns wise in spark dataframes - scala

I have a text file which contains the following data
3 5
10 20 30 40 50
0 0 0 2 5
5 10 10 10 10
Question:
first line of the file will give us number of rows and number of columns of data
print the sum of column if every element of the column is not a prime otherwise print zero
Output:
0
30
40
0
0
Explanation:
(0 because column 10 0 5 has prime number 5)
(30 because column 20 0 10 has no prime number so print 20+0+10=30) like wise apply for all columns.
suggest us the method to access the dataframe in column wise manner

General idea : Just zip every value with an index, create a pairRDD then apply a reduceByKey (the key here is the index) verifying at each step the number is a prime number.
val rows = spark.sparkContext.parallelize(
Seq(
Array(10,20,30,40,50),
Array(0,0,0,2,5),
Array(5,10,10,10,10)
)
)
def isPrime(i: Int): Boolean = i>=2 && ! ((2 until i-1) exists (i % _ == 0))
val result = rows.flatMap{arr => arr.map(Option(_)).zipWithIndex.map(_.swap)}
.reduceByKey{
case (None, _) | (_, None) => None
case (Some(a),Some(b)) if isPrime(a) | isPrime(b) => None
case (Some(a),Some(b)) => Some(a+b)
}.map{case (k,v) => k -> v.getOrElse(0)}
result.foreach(println)
Output (you'll have to collect data in order to sort by column index) :
(3,0)
(0,0)
(4,0)
(2,40)
(1,30)

Related

Element wise addition in spark rdd

I have an RDD of RDD[Array[Array[Double]]]. So essentially each element is a matrix. I need to do element wise addition.
so if the first element of the rdd is
1 2
3 4
and second element is
5 6
7 8
at the end i need to have
6 8
10 12
I have looked into zip but not sure if I can use it for this case.
Yes, you can use zip, but you'll have to use it twice, once for the rows and once for the columns:
val rdd = sc.parallelize(List(Array(Array(1.0,2.0),Array(3.0,4.0)),
Array(Array(5.0,6.0),Array(7.0,8.0))))
rdd.reduce((a,b) => a.zip(b).map {case (c,d) => c.zip(d).map{ case (x,y) => x+y}})

how to sort multiple colmuns (more than ten columns) in scala language?

how to sort multiple colmuns (more than ten columns) in scala language.
for example:
1 2 3 4
4 5 6 3
1 2 1 1
‌2 3 5 10
desired output
1 2 1 1
1 2 3 3
2 3 5 4
4 5 6 10
Not much to it.
val input = io.Source.fromFile("junk.txt") // open file
.getLines // load all contents
.map(_.split("\\W+")) // turn rows into Arrays
.map(_.map(_.toInt)) // Arrays of Ints
val output = input.toList // from Iterator to List
.transpose // swap rows/columns
.map(_.sorted) // sort rows
.transpose // swap back
output.foreach(x => println(x.mkString(" "))) // report results
Note: This allows any whitespace between the numbers but it will fail to create the expected Array[Int] if it encounters other separators (commas, etc.) or if the line begins with a space.
Also, transpose will throw if the rows aren't all the same size.
I followed the following algorithm. First alter the dimension of the row and columns. Then sort the rows, then again alter the dimension to bring back original row-column configuration. Here is a sample proof of concept.
object SO_42720909 extends App {
// generating dummy data
val unsortedData = getDummyData(2, 3)
prettyPrint(unsortedData)
println("----------------- ")
// altering the dimension
val unsortedAlteredData = alterDimension(unsortedData)
// prettyPrint(unsortedAlteredData)
// sorting the altered data
val sortedAlteredData = sort(unsortedAlteredData)
// prettyPrint(sortedAlteredData)
// bringing back original dimension
val sortedData = alterDimension(sortedAlteredData)
prettyPrint(sortedData)
def alterDimension(data: Seq[Seq[Int]]): Seq[Seq[Int]] = {
val col = data.size
val row = data.head.size // make it safe
for (i <- (0 until row))
yield for (j <- (0 until col)) yield data(j)(i)
}
def sort(data: Seq[Seq[Int]]): Seq[Seq[Int]] = {
for (row <- data) yield row.sorted
}
def getDummyData(row: Int, col: Int): Seq[Seq[Int]] = {
val r = scala.util.Random
for (i <- (1 to row))
yield for (j <- (1 to col)) yield r.nextInt(100)
}
def prettyPrint(data: Seq[Seq[Int]]): Unit = {
data.foreach(row => {
println(row.mkString(", "))
})
}
}

Is it possible to join two rdds' values to avoid expensive shuffling?

I have two RDDs both having two columns as (K,V). In the sources for those RDDs keys are appearing one under the other and for each row a different and distinct value is assigned to the keys. The text files to create RDDs are given at the bottom of this post.
Keys are totally different in both RDDs and I would like to join two RDDs based on their values and try to find how many common values exist for each pair. e.g. I am trying to reach a result such as (1-5, 10) meaning that a key value of "1" from RDD1 and a key value of "5" from RDD2 share 10 values in common.
I work on a single machine with 256 GB ram and 72 cores. One text file is 500 MB while the other is 3 MB.
Here is my code:
val conf = new SparkConf().setAppName("app").setMaster("local[*]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4")
.set("spark.executor.memory","128g")
.set("spark.driver.maxResultSize", "0")
val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(0),s(1))}
val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(v1, Iterable())
} yield (s"$k-$v2", 1))
joined.foreach(println)
val result = joined.reduceByKey((a,b) => a+b)
I try to manage this issue by using a broadcast variable as seen from the script. If I join RDD2 (having 250K rows) with itself pairs show up in the same partitions and so less shuffle takes place so it takes 3 minutes to get the results. However, when applying RDD1 vs. RDD2 the pairs are scattered through partitions resulting in very expensive shuffling procedure and it always ends up giving
ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 168591 ms error.
Based on my results:
Should I try to partition text file to create RDD1 in smaller chunks
and join those smaller chunks separately with RDD2?
Is there another way of joining two RDDs based on their Value fields? If I describe the original values as keys and join them with the join function the value pairs are again scattered over the partitions which results in again a very expensive reducebykey operation. e.g.
val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
RDD1.join(RDD2).map(line => (line._2,1)).reduceByKey((a,b) => (a+b))
PSEUDO DATA SAMPLE:
KEY VALUE
1 13894
1 17376
1 15688
1 22434
1 2282
1 14970
1 11549
1 26027
1 2895
1 15052
1 20815
2 9782
2 3393
2 11783
2 22737
2 12102
2 10947
2 24343
2 28620
2 2486
2 249
2 3271
2 30963
2 30532
2 2895
2 13894
2 874
2 2021
3 6720
3 3402
3 25894
3 1290
3 21395
3 21137
3 18739
...
A SMALL EXAMPLE
RDD1
2 1
2 2
2 3
2 4
2 5
2 6
3 1
3 6
3 7
3 8
3 9
4 3
4 4
4 5
4 6
RDD2
21 1
21 2
21 5
21 11
21 12
21 10
22 7
22 8
22 13
22 9
22 11
BASED ON THIS DATA JOIN RESULTS:
(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(3-22,1)
(4-21,1)
(2-21,1)
(3-21,1)
(3-22,1)
(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(4-21,1)
(2-21,1)
(3-21,1)
REDUCEBYKEY RESULTS:
(4-21,1)
(3-21,1)
(2-21,3)
(3-22,3)
Have you looked at using a cartesian join? You could maybe try something like below:
val rdd1 = sc.parallelize(for { x <- 1 to 3; y <- 1 to 5 } yield (x, y)) // sample RDD
val rdd2 = sc.parallelize(for { x <- 1 to 3; y <- 3 to 7 } yield (x, y)) // sample RDD with slightly displaced values from the first
val g1 = rdd1.groupByKey()
val g2 = rdd2.groupByKey()
val cart = g1.cartesian(g2).map { case ((key1, values1), (key2, values2)) =>
((key1, key2), (values1.toSet & values2.toSet).size)
}
When I try running the above example in a cluster, I see the following:
scala> rdd1.take(5).foreach(println)
...
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
scala> rdd2.take(5).foreach(println)
...
(1,3)
(1,4)
(1,5)
(1,6)
(1,7)
scala> cart.take(5).foreach(println)
...
((1,1),3)
((1,2),3)
((1,3),3)
((2,1),3)
((2,2),3)
The result indicates that for (key1, key2), there are 3 matching elements between the sets. Note that the result is always 3 here since the initialized input tuples' ranges overlapped by 3 elements.
The cartesian transformation does not cause a shuffle either since it just iterates over the elements of each RDD and produces a cartesian product. You can see this by calling the toDebugString() function on an example.
scala> val carts = rdd1.cartesian(rdd2)
carts: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = CartesianRDD[9] at cartesian at <console>:25
scala> carts.toDebugString
res11: String =
(64) CartesianRDD[9] at cartesian at <console>:25 []
| ParallelCollectionRDD[1] at parallelize at <console>:21 []
| ParallelCollectionRDD[2] at parallelize at <console>:21 []

scala deal with string and do accumulative

I have variable:
val list= rows.sortBy(- _._2).map{case (user , list) => list}.take(20).mkString("::")
the result println(list) should be:
60::58::51::48::47::47::45::45::43::43::42::42::42::41::41::41::40::40::40::39
And now I have to deal with these numbers (like histogram concept)
If I set the break is 10, it should divide the max number (60) by 10 and make 6 buckets:
the scope between 0~ 10(0<x<=10) have 0 numbers match
the scope between 10~ 20(10<x<=20) have 0 numbers match
the scope between 20~ 30(20<x<=30) have 0 numbers match
the scope between 30~ 40(30<x<=40) have 4 numbers match
the scope between 40~ 50(40<x<=50) have 13 numbers match
the scope between 50~ 60(50<x<=60) have 3 numbers match
And then I have to save with 2 variables x and y :
x: 0~10::10~20::20~30::30~40::40~50::50~60
y: 0::0::0::4::13::3
How can I do this?
val actualList = list.split("::").map(_.toInt).toList
val step = 10
val steps = step to actualList.max by step
//for each step, output a count of all items in the list that are
//between the current step and currentStep - stepVal
val counts = steps.map(x=>actualList.count(y=>y <= x && y > x - step))
val stepsAsString = steps.map(x=>s"${x-step}~$x")
And you can map them too:
steps.zip(counts).toMap
Note that this could be made more performant if the list were sorted first, but I wouldn't worry about tuning unless you need it

transforming from native matrix format, scalding

So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))