I have two RDDs both having two columns as (K,V). In the sources for those RDDs keys are appearing one under the other and for each row a different and distinct value is assigned to the keys. The text files to create RDDs are given at the bottom of this post.
Keys are totally different in both RDDs and I would like to join two RDDs based on their values and try to find how many common values exist for each pair. e.g. I am trying to reach a result such as (1-5, 10) meaning that a key value of "1" from RDD1 and a key value of "5" from RDD2 share 10 values in common.
I work on a single machine with 256 GB ram and 72 cores. One text file is 500 MB while the other is 3 MB.
Here is my code:
val conf = new SparkConf().setAppName("app").setMaster("local[*]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4")
.set("spark.executor.memory","128g")
.set("spark.driver.maxResultSize", "0")
val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(0),s(1))}
val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(v1, Iterable())
} yield (s"$k-$v2", 1))
joined.foreach(println)
val result = joined.reduceByKey((a,b) => a+b)
I try to manage this issue by using a broadcast variable as seen from the script. If I join RDD2 (having 250K rows) with itself pairs show up in the same partitions and so less shuffle takes place so it takes 3 minutes to get the results. However, when applying RDD1 vs. RDD2 the pairs are scattered through partitions resulting in very expensive shuffling procedure and it always ends up giving
ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 168591 ms error.
Based on my results:
Should I try to partition text file to create RDD1 in smaller chunks
and join those smaller chunks separately with RDD2?
Is there another way of joining two RDDs based on their Value fields? If I describe the original values as keys and join them with the join function the value pairs are again scattered over the partitions which results in again a very expensive reducebykey operation. e.g.
val RDD1 = sc.textFile("\\t1.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
val RDD2 = sc.textFile("\\t2.txt",1000).map{line => val s = line.split("\t"); (s(1),s(0))}
RDD1.join(RDD2).map(line => (line._2,1)).reduceByKey((a,b) => (a+b))
PSEUDO DATA SAMPLE:
KEY VALUE
1 13894
1 17376
1 15688
1 22434
1 2282
1 14970
1 11549
1 26027
1 2895
1 15052
1 20815
2 9782
2 3393
2 11783
2 22737
2 12102
2 10947
2 24343
2 28620
2 2486
2 249
2 3271
2 30963
2 30532
2 2895
2 13894
2 874
2 2021
3 6720
3 3402
3 25894
3 1290
3 21395
3 21137
3 18739
...
A SMALL EXAMPLE
RDD1
2 1
2 2
2 3
2 4
2 5
2 6
3 1
3 6
3 7
3 8
3 9
4 3
4 4
4 5
4 6
RDD2
21 1
21 2
21 5
21 11
21 12
21 10
22 7
22 8
22 13
22 9
22 11
BASED ON THIS DATA JOIN RESULTS:
(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(3-22,1)
(4-21,1)
(2-21,1)
(3-21,1)
(3-22,1)
(3-22,1)
(2-21,1)
(3-22,1)
(2-21,1)
(4-21,1)
(2-21,1)
(3-21,1)
REDUCEBYKEY RESULTS:
(4-21,1)
(3-21,1)
(2-21,3)
(3-22,3)
Have you looked at using a cartesian join? You could maybe try something like below:
val rdd1 = sc.parallelize(for { x <- 1 to 3; y <- 1 to 5 } yield (x, y)) // sample RDD
val rdd2 = sc.parallelize(for { x <- 1 to 3; y <- 3 to 7 } yield (x, y)) // sample RDD with slightly displaced values from the first
val g1 = rdd1.groupByKey()
val g2 = rdd2.groupByKey()
val cart = g1.cartesian(g2).map { case ((key1, values1), (key2, values2)) =>
((key1, key2), (values1.toSet & values2.toSet).size)
}
When I try running the above example in a cluster, I see the following:
scala> rdd1.take(5).foreach(println)
...
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
scala> rdd2.take(5).foreach(println)
...
(1,3)
(1,4)
(1,5)
(1,6)
(1,7)
scala> cart.take(5).foreach(println)
...
((1,1),3)
((1,2),3)
((1,3),3)
((2,1),3)
((2,2),3)
The result indicates that for (key1, key2), there are 3 matching elements between the sets. Note that the result is always 3 here since the initialized input tuples' ranges overlapped by 3 elements.
The cartesian transformation does not cause a shuffle either since it just iterates over the elements of each RDD and produces a cartesian product. You can see this by calling the toDebugString() function on an example.
scala> val carts = rdd1.cartesian(rdd2)
carts: org.apache.spark.rdd.RDD[((Int, Int), (Int, Int))] = CartesianRDD[9] at cartesian at <console>:25
scala> carts.toDebugString
res11: String =
(64) CartesianRDD[9] at cartesian at <console>:25 []
| ParallelCollectionRDD[1] at parallelize at <console>:21 []
| ParallelCollectionRDD[2] at parallelize at <console>:21 []
Related
I have a text file which contains the following data
3 5
10 20 30 40 50
0 0 0 2 5
5 10 10 10 10
Question:
first line of the file will give us number of rows and number of columns of data
print the sum of column if every element of the column is not a prime otherwise print zero
Output:
0
30
40
0
0
Explanation:
(0 because column 10 0 5 has prime number 5)
(30 because column 20 0 10 has no prime number so print 20+0+10=30) like wise apply for all columns.
suggest us the method to access the dataframe in column wise manner
General idea : Just zip every value with an index, create a pairRDD then apply a reduceByKey (the key here is the index) verifying at each step the number is a prime number.
val rows = spark.sparkContext.parallelize(
Seq(
Array(10,20,30,40,50),
Array(0,0,0,2,5),
Array(5,10,10,10,10)
)
)
def isPrime(i: Int): Boolean = i>=2 && ! ((2 until i-1) exists (i % _ == 0))
val result = rows.flatMap{arr => arr.map(Option(_)).zipWithIndex.map(_.swap)}
.reduceByKey{
case (None, _) | (_, None) => None
case (Some(a),Some(b)) if isPrime(a) | isPrime(b) => None
case (Some(a),Some(b)) => Some(a+b)
}.map{case (k,v) => k -> v.getOrElse(0)}
result.foreach(println)
Output (you'll have to collect data in order to sort by column index) :
(3,0)
(0,0)
(4,0)
(2,40)
(1,30)
I have a input file something looks like:
1: 3 5 7
3: 6 9
2: 5
......
I hope to get two list
the first list is made up of numbers before ":", the second list is made up of numbers after ":".
the two lists in the above example are:
1 3 2
3 5 7 6 9 5
I just write code as following:
val rdd = sc.textFile("input.txt");
val s = rdd.map(_.split(":"));
But do not know how to implement following things. Thanks.
I would use flatmaps!
So,
val rdd = sc.textFile("input.txt")
val s = rdd.map(_.split(": ")) # I recommend adding a space after the colon
val before_colon = s.map(x => x(0))
val after_colon = s.flatMap(x => x(1).split(" "))
Now you have two RDDs, one with the items from before the colon, and one with the items after the colon!
If it is possible for your the part of the text before the colon to have multiple numbers, such as an example like 1 2 3: 4 5 6, I would write val before_colon = s.flatMap(x => x(0).split(" "))
I have dataset in a file in the form:
1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
3: 9 77935 79583 84707 564578 594898 681805 681886 835470 880698
4: 145
5: 8 57544 58089 60048 65880 284186 313376
6: 8
I need to transform this to something like below using Spark and Scala as a part of preprocessing of data:
1 1664968
2 3
2 747213
2 1664968
2 4095634
2 5535664
3 9
3 77935
3 79583
3 84707
And so on....
Can anyone provide input on how this can be done.
The length of the original rows in the file varies as shown in the dataset example above.
I am not sure, how to go about doing this transformation.
I tried soemthing like below which gives me a pair of the key and the first element after the semi-colon.
But I am not sure how to iterate over the entire data and generate the pairs as needed.
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Graphx").setMaster("local"))
val rawLinks = sc.textFile("src/main/resources/links-simple-sorted-top100.txt")
rawLinks.take(5).foreach(println)
val formattedLinks = rawLinks.map{ rows =>
val fields = rows.split(":")
val fromVertex = fields(0)
val toVerticesArray = fields(1).split(" ")
(fromVertex, toVerticesArray(1))
}
val topFive = formattedLinks.take(5)
topFive.foreach(println)
}
val rdd = sc.parallelize(List("1: 1664968","2: 3 747213 1664968 1691047 4095634 5535664"))
val keyValues = rdd.flatMap(line => {
val Array(key, values) = line.split(":",2)
for(value <- values.trim.split("""\s+"""))
yield (key, value.trim)
})
keyValues.collect
split row in 2 parts and map on variable number of columns.
def transform(s: String): Array[String] = {
val Array(head, tail) = s.split(":", 2)
tail.trim.split("""\s+""").map(x => s"$head $x")
}
> transform("2: 3 747213 1664968 1691047 4095634 5535664")
// Array(2 3, 2 747213, 2 1664968, 2 1691047, 2 4095634, 2 5535664)
So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))
Is there a generic way using Breeze to achieve what you can do using broadcasting in NumPy?
Specifically, if I have an operator I'd like to apply to two 3x4 matrices, I can apply that operation element-wise. However, what I have is a 3x4 matrix and a 3-element column vector. I'd like a function which produces a 3x4 matrix created from applying the operator to each element of the matrix with the element from the vector for the corresponding row.
So for a division:
2 4 6 / 2 3 = 1 2 3
3 6 9 1 2 3
If this isn't available. I'd be willing to look at implementing it.
You can use mapPairs to achieve what I 'think' you're looking for:
val adder = DenseVector(1, 2, 3, 4)
val result = DenseMatrix.zeros[Int](3, 4).mapPairs({
case ((row, col), value) => {
value + adder(col)
}
})
println(result)
1 2 3 4
1 2 3 4
1 2 3 4
I'm sure you can adapt what you want from simple 'adder' above.
Breeze now supports broadcasting of this sort:
scala> val dm = DenseMatrix( (2, 4, 6), (3, 6, 9) )
dm: breeze.linalg.DenseMatrix[Int] =
2 4 6
3 6 9
scala> val dv = DenseVector(2,3)
dv: breeze.linalg.DenseVector[Int] = DenseVector(2, 3)
scala> dm(::, *) :/ dv
res4: breeze.linalg.DenseMatrix[Int] =
1 2 3
1 2 3
The * operator says which axis to broadcast along. Breeze doesn't allow implicit broadcasting, except for scalar types.