I am learning Apache Spark with Scala and would like to use it to process a DNA data set that spans multiple lines like this:
ATGTAT
ACATAT
ATATAT
I want to map this into groups of a fixed size k and count the groups. So for k=3, we would get groups of each character with the next two characters:
ATG TGT GTA TAT ATA TAC
ACA CAT ATA TAT ATA TAT
ATA TAT ATA TAT
...then count the groups (like word count):
(ATA,5), (TAT,5), (TAC,1), (ACA,1), (CAT,1), (ATG,1), (TGT,1), (GTA,1)
The problem is that the "words" span multiple lines, as does TAC in the example above. It spans the line wrap. I don't want to just count the groups in each line, but in the whole file, ignoring line endings.
In other words, I want to process the entire sequence as a sliding window of width k over the entire file as though there were no line breaks. The problem is looking ahead (or back) to the next RDD row to complete a window when I get to the end of a line.
Two ideas I had were:
Append k-1 characters from the next line:
ATATATAC
ACATATAT
ATATAT
I tried this with the Spark SQL lead() function, but when I tried executing a flatMap, I got a NotSerializableException for WindowSpec. Is there any other way to reference the next line? Would I need to write a custom input format?
Read the entire sequence in as a single line (or join lines after reading):
ATATATACATATATATAT
Is there a way to read multiple lines so they can be processed as one? If so, would it all need to fit into the memory of a single machine?
I realize either of these could be done as a pre-processing step. I was wondering the best way is to do it within Spark. Once I have it in either of these formats, I know how to do the rest, but I am stuck here.
You can make a rdd of single character string instead of join them as one line, since that will make the result a string which can not be distributed:
val rdd = sc.textFile("gene.txt")
// rdd: org.apache.spark.rdd.RDD[String] = gene.txt MapPartitionsRDD[4] at textFile at <console>:24
So simply use flatMap to split the lines into List of characters:
rdd.flatMap(_.split("")).collect
// res4: Array[String] = Array(A, T, G, T, A, T, A, C, A, T, A, T, A, T, A, T, A, T)
A more complete solution borrowed from this answer:
val rdd = sc.textFile("gene.txt")
// create the sliding 3 grams for each partition and record the edges
val rdd1 = rdd.flatMap(_.split("")).mapPartitionsWithIndex((i, iter) => {
val slideList = iter.toList.sliding(3).toList
Iterator((slideList, (slideList.head, slideList.last)))
})
// collect the edge values, concatenate edges from adjacent partitions and broadcast it
val edgeValues = rdd1.values.collect
val sewedEdges = edgeValues zip edgeValues.tail map { case (x, y) => {
(x._2 ++ y._1).drop(1).dropRight(1).sliding(3).toList
}}
val sewedEdgesMap = sc.broadcast(
(0 until rdd1.partitions.size) zip sewedEdges toMap
)
// sew the edge values back to the result
rdd1.keys.mapPartitionsWithIndex((i, iter) => iter ++ List(sewedEdgesMap.value.getOrElse(i, Nil))).
flatMap(_.map(_ mkString "")).collect
// res54: Array[String] = Array(ATG, TGT, GTA, TAT, ATA, TAC, ACA, CAT, ATA, TAT, ATA, TAT, ATA, TAT, ATA, TAT)
Related
I am trying to use Scala's foldLeft function to compare a Stream with a List of Lists.
This is a snippet of the code I have
def freqParse(pairs: (Pair,Int,String), record: String): (Pair,Int,String) ={
val m: Pair = ("","")
val t: FreqPairs = Map((m,0))
(pairs._1,pairs._2,pairs._3)
}
val freqItems = items.map(v => (v._1)).toList
val cross = freqItems.flatMap(x => freqItems.map(y => (x, y))) // cross product to get pair of frequent items
val freq = lines.foldLeft(pairs,0,delim)(freqParse)
cross is basically a lists where each element is a pair of strings List(List(a,b), List(a,c)...).
lines is an input stream with a record per line (25902 lines in total).
I want to count how many times each pair (or each element in cross) occurs in the entirety of the stream. Essentially comparing all elements in cross to all elements in lines.
I decided foldLeft because that way I can take each line in the stream and split it at a delimiter and then check if both the elements in cross appear or not.
I am able to split each record in lines but I don't know how to pass the cross variable to the function to begin the comparison.
I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)
I would like to split my RDD regarding commas and access a predefined set of elements.
For example, I have a RDD like that:
a, b, c, d
e, f, g, h
and I need to split then access the first and fourth element on the first line and the second and third element on the second line to get this resulting RDD:
a, d
f, g
I can't hard write "1" and "4" on my code, that's why solution like that won't work:
rdd.map{line => val words = line.split(",") (words(0),words(3)) }
Lets assume I have a second RRD with the same number of lines which contains the elements I want to get for each line
1,4
2,3
Is there a way to get my elements ?
Lets assume I have a second RRD with the same number of lines which contains the elements I want to get for each line
1,4
2,3
Is there a way to get my elements ?
If you have a second RDD that already has the numbers of the groups you want for each line, you could zip them.
From Spark docs:
<U> RDD<scala.Tuple2<T,U>> zip(RDD<U> other, scala.reflect.ClassTag<U> evidence$13)
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc.
So in your example, a, b, c, d would be in a key-value pair with 1,4 and e, f, g, h with 2,3 . So you could do something like:
val groupNumbers = lettersRDD zip numbersRDD
groupnumbers.map{tuple ->
val numbers: Seq[Int] = // get the numbers from tuple._2
val words = tuple._1.split(",") (words(numbers.head),words(numbers(1) ) }
}
I am given a input file in following format
(a,[b,c,d])
(b,[d,a])
How can format this input to get values in form
key => List()
Following code is used to split lines on basis of space.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
How to store this kind of formatted input ?
To tackle this I started with multiple data elements with and without whitespace separation.
%> cat junk.txt
(a,[b,c,d,e]) (w,[x,y,z])
(q,[wert,cv])(xx,[aa])
Then I opened the file and split the input on every leading ( paren without consuming the character.
val input = io.Source.fromFile("junk.txt")
.getLines()
.flatMap(_.split("(?=\\()"))
I also need a way to recognize the pattern I'm looking for.
val dataRE = "\\(([^,]+),\\[([^\\]]+)]".r.unanchored
Now to transform the data from Strings to Maps.
input.collect{case dataRE(k,v) => k -> v.split(",").toList}.toMap
Result: Map[String,List[String]]
Map(a -> List(b, c, d, e), w -> List(x, y, z), q -> List(wert, cv), xx -> List(aa))
My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))