Group pair of elements in a List - scala

I have a list (in Scala).
val seqRDD = sc.parallelize(Seq(("a","b"),("b","c"),("c","a"),("d","b"),("e","c"),("f","b"),("g","a"),("h","g"),("i","e"),("j","m"),("k","b"),("l","m"),("m","j")))
I group by the second element for a particular statistics and flatten the result into one list.
val checkItOut = seqRDD.groupBy(each => (each._2))
.map(each => each._2.toList)
.collect
.flatten
.toList
The output looks like this:
checkItOut: List[(String, String)] = List((c,a), (g,a), (a,b), (d,b), (f,b), (k,b), (m,j), (b,c), (e,c), (i,e), (j,m), (l,m), (h,g))
Now, what I'm trying to do is "group" all elements (not pairs) that are connected to other elements in any pair to one list.
For example:
c is with a in one pair, a is with g in its next, so (a,c,g) are connected. Then, c is also with b and e, that b is with a, d, f, k and these are with other characters in some other pair. I want to have them in a list.
I know this can be done with a BFS traversal. BUt wondering if there was an API in Spark that does this?

GraphX, Connected Components: http://spark.apache.org/docs/latest/graphx-programming-guide.html#connected-components

Related

Order Spark RDD based on ordering in another RDD

I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)

Is it possible to extract the substream key in akkastreams?

I can't seem to find any documentation on this but I know that AkkaStreams stores the keys used to group a stream into substreams when calling groupBy in memory. Is it possible to extract those keys from the substream? Say I create a bunch of substreams from my main stream, pass those through a fold that counts the objects in each substream and then store the count in a class. Can I get the key of the substream to also pass to that class? Or is there a better way of doing this? I need to count each element per substream but I also need to store which group the count belongs to.
A nice example is shown in the stream-cookbook:
val counts: Source[(String, Int), NotUsed] = words
// split the words into separate streams first
.groupBy(MaximumDistinctWords, identity)
//transform each element to pair with number of words in it
.map(_ -> 1)
// add counting logic to the streams
.reduce((l, r) => (l._1, l._2 + r._2))
// get a stream of word counts
.mergeSubstreams
Then the following:
val words = Source(List("Hello", "world", "let's", "say", "again", "Hello", "world"))
counts.runWith(Sink.foreach(println))
Will print:
(world,2)
(Hello,2)
(let's,1)
(again,1)
(say,1)
Another example I thought of, was counting numbers by their remainders. So the following, as example:
Source(0 to 101)
.groupBy(10, x => x % 4)
.map(e => e % 4 -> 1)
.reduce((l, r) => (l._1, l._2 + r._2))
.mergeSubstreams.to(Sink.foreach(println)).run()
will print:
(0,26)
(1,26)
(2,25)
(3,25)

Split rdd and access subgroup of elements

I would like to split my RDD regarding commas and access a predefined set of elements.
For example, I have a RDD like that:
a, b, c, d
e, f, g, h
and I need to split then access the first and fourth element on the first line and the second and third element on the second line to get this resulting RDD:
a, d
f, g
I can't hard write "1" and "4" on my code, that's why solution like that won't work:
rdd.map{line => val words = line.split(",") (words(0),words(3)) }
Lets assume I have a second RRD with the same number of lines which contains the elements I want to get for each line
1,4
2,3
Is there a way to get my elements ?
Lets assume I have a second RRD with the same number of lines which contains the elements I want to get for each line
1,4
2,3
Is there a way to get my elements ?
If you have a second RDD that already has the numbers of the groups you want for each line, you could zip them.
From Spark docs:
<U> RDD<scala.Tuple2<T,U>> zip(RDD<U> other, scala.reflect.ClassTag<U> evidence$13)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc.
So in your example, a, b, c, d would be in a key-value pair with 1,4 and e, f, g, h with 2,3 . So you could do something like:
val groupNumbers = lettersRDD zip numbersRDD
groupnumbers.map{tuple ->
val numbers: Seq[Int] = // get the numbers from tuple._2
val words = tuple._1.split(",") (words(numbers.head),words(numbers(1) ) }
}

Multiline Spark sliding window

I am learning Apache Spark with Scala and would like to use it to process a DNA data set that spans multiple lines like this:
ATGTAT
ACATAT
ATATAT
I want to map this into groups of a fixed size k and count the groups. So for k=3, we would get groups of each character with the next two characters:
ATG TGT GTA TAT ATA TAC
ACA CAT ATA TAT ATA TAT
ATA TAT ATA TAT
...then count the groups (like word count):
(ATA,5), (TAT,5), (TAC,1), (ACA,1), (CAT,1), (ATG,1), (TGT,1), (GTA,1)
The problem is that the "words" span multiple lines, as does TAC in the example above. It spans the line wrap. I don't want to just count the groups in each line, but in the whole file, ignoring line endings.
In other words, I want to process the entire sequence as a sliding window of width k over the entire file as though there were no line breaks. The problem is looking ahead (or back) to the next RDD row to complete a window when I get to the end of a line.
Two ideas I had were:
Append k-1 characters from the next line:
ATATATAC
ACATATAT
ATATAT
I tried this with the Spark SQL lead() function, but when I tried executing a flatMap, I got a NotSerializableException for WindowSpec. Is there any other way to reference the next line? Would I need to write a custom input format?
Read the entire sequence in as a single line (or join lines after reading):
ATATATACATATATATAT
Is there a way to read multiple lines so they can be processed as one? If so, would it all need to fit into the memory of a single machine?
I realize either of these could be done as a pre-processing step. I was wondering the best way is to do it within Spark. Once I have it in either of these formats, I know how to do the rest, but I am stuck here.
You can make a rdd of single character string instead of join them as one line, since that will make the result a string which can not be distributed:
val rdd = sc.textFile("gene.txt")
// rdd: org.apache.spark.rdd.RDD[String] = gene.txt MapPartitionsRDD[4] at textFile at <console>:24
So simply use flatMap to split the lines into List of characters:
rdd.flatMap(_.split("")).collect
// res4: Array[String] = Array(A, T, G, T, A, T, A, C, A, T, A, T, A, T, A, T, A, T)
A more complete solution borrowed from this answer:
val rdd = sc.textFile("gene.txt")
// create the sliding 3 grams for each partition and record the edges
val rdd1 = rdd.flatMap(_.split("")).mapPartitionsWithIndex((i, iter) => {
val slideList = iter.toList.sliding(3).toList
Iterator((slideList, (slideList.head, slideList.last)))
})
// collect the edge values, concatenate edges from adjacent partitions and broadcast it
val edgeValues = rdd1.values.collect
val sewedEdges = edgeValues zip edgeValues.tail map { case (x, y) => {
(x._2 ++ y._1).drop(1).dropRight(1).sliding(3).toList
}}
val sewedEdgesMap = sc.broadcast(
(0 until rdd1.partitions.size) zip sewedEdges toMap
)
// sew the edge values back to the result
rdd1.keys.mapPartitionsWithIndex((i, iter) => iter ++ List(sewedEdgesMap.value.getOrElse(i, Nil))).
flatMap(_.map(_ mkString "")).collect
// res54: Array[String] = Array(ATG, TGT, GTA, TAT, ATA, TAC, ACA, CAT, ATA, TAT, ATA, TAT, ATA, TAT, ATA, TAT)

Simplest way to define a type that is a sequence of a specific number of elements in Scala?

Suppose I'm doing something like the following:
val a = complicatedChainOfSteps("c")
val b = complicatedChainOfSteps("d")
I'm interested in writing code like the following (to reduce code and copy/paste errors):
val Seq(a, b) = Seq("c", "d").map(complicatedChainOfSteps(_))
but having the compiler ensure that the number of elements matches, so the following don't compile:
val Seq(a, b) = Seq("c", "d", "e").map(s => s + s)
val Seq(a, b) = Seq("c").map(s => s + s)
I know that using tuples instead to ensure that the number of elements matches works when performing multiple assignment (e.g., val (a, b) = ("c", "d")), but you cannot map over tuples (which makes sense because they have heterogeneous types).
I also know I can just define my own types for sequence of 2 elements and sequence of 3 elements or whatever, but is there a convenient built in way of doing this? If not, what's the simplest way to define a type that is a sequence of a specific number of elements?