add prefix to spark rdd elements - scala

I have two string elements in my rdd as :
"53 45 61","0 1 2".
I would like to zip and map it together as Key value pair ,adding a prefix "C" to each of keys
expected output:
C53 -> 0, C45-> 1, C61-> 2
Currently this is the code I am using
val prefix = "C"
newRDD = RDD.map(x=>(prefix + (x._1.split(" ")) zip x._2.split(" "))
receiving result below:
53 -> 0, C45-> 1, 61-> 2 .
What am I missing here?

you're currently adding your prefix to an Array(53, 45, 61) (didn't know you could do that). Do you mean to do x._1.split(" ").map(prefix + _) to add it to each element instead?

Related

Scala: find character occurrences from a file

Problem:
suppose, I have a text file containing data like
TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT
And I want to find occurrences of character 'A', 'T', 'AAA' , etc. in it.
My Approach
val source = scala.io.Source.fromFile(filePath)
val lines = source.getLines().filter(char => char != '\n')
for (line <- lines) {
val aList = line.filter(ele => ele == 'A')
println(aList)
}
This will give me output like
AAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA
My Question
How can I find total count of occurrences of 'A', 'T', 'AAA' etc. here? can I use map reduce functions for that? How?
There is even a shorter way:
lines.map(_.count(_ == 'A')).sum
This counts all A of each line, and sums up the result.
By the way there is no filter needed here:
val lines = source.getLines()
And as Leo C mentioned in his comment, if you start with Source.fromFile(filePath) it can be just like this:
source.count(_ == 'A')
As SoleQuantum mentions in his comment he wants call count more than once. The problem here is that source is a BufferedSource which is not a Collection, but just an Iterator, which can only be used (iterated) once.
So if you want to use the source mire than once you have to translate it first to a Collection.
Your example:
val stream = Source.fromResource("yourdata").mkString
stream.count(_ == 'A') // 48
stream.count(_ == 'T') // 65
Remark: String is a Collection of Chars.
For more information check: iterators
And here is the solution to get the count for all Chars:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.groupBy(identity) // group by each char
.view.mapValues(_.length) // count each group > HashMap(T -> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, A -> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, G -> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG, C -> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
.toMap // Map(T -> 65, A -> 48, G -> 36, C -> 61)
Or as suggested by jwvh:
stream
.filterNot(_ == '\n')
.groupMapReduce(identity)(_=>1)(_+_))
This is Scala 2.13, let me know if you have problems with your Scala version.
Ok after the last update of the question:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.foldLeft(("", Map.empty[String, Int])){case ((a, m), c ) =>
if(a.contains(c))
(a + c, m)
else
(s"$c",
m.updated(a, m.get(a).map(_ + 1).getOrElse(1)))
}._2 // you only want the Map -> HashMap( -> 1, CCCC -> 1, A -> 25, GGG -> 1, AA -> 4, GG -> 3, GGGGG -> 1, AAA -> 5, CCC -> 1, TTTT -> 1, T -> 34, CC -> 9, TTT -> 4, G -> 22, CCCCC -> 1, C -> 31, TT -> 7)
Short explanation:
The solution uses a foldLeft.
The initial value is a pair:
a String that holds the actual characters (none to start)
a Map with the Strings and their count (empty at the start)
We have 2 main cases:
the character is the same we have a already a String.
Just add the character to the actual String.
the character is different. Update the Map with the actual String; the new character is the now the actual String.
Quite complex, let me know if you need more help.
Since scala.io.Source.fromFile(filePath) produces stream of chars you can use count(Char => Boolean) function directly on your source object.
val source = scala.io.Source.fromFile(filePath)
val result = source.count(_ == 'A')
You can use Partition method and then just use length on it.
val y = x.partition(_ == 'A')._1.length
You can get the count by doing the following:
lines.flatten.filter(_ == 'A').size
In general regular expressions are a very good tool to find sequences of characters in a string.
You can use the r method, defined with an implicit conversion over strings, to turn a string into a pattern, e.g.
val pattern = "AAA".r
Using it is then fairly easy. Assuming your sample input
val input =
"""TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT"""
Counting the number of occurrences of a pattern is straightforward and very readable:
pattern.findAllIn(input).size // returns 4
The iterator returned by regular expressions operations can also be used for more complex operations using the matchData method, e.g. printing the index of each match:
pattern. // this code would print the following lines
findAllIn(input). // 98
matchData. // 125
map(_.start). // 131
foreach(println) // 165
You can read more on Regex in Scala on the API docs (here for version 2.13.1)

Spark rdd split doesn't return last column?

I have the following data:
17|ABC|3|89|89|0|0|2|
17|DFD|3|89|89|0|0|2|
17|RFG|3|89|89|0|0|2|
17|TRF|3|89|89|0|0|2|
When I use the following code, I just get 8 elements instead of 9 since the last one doesn't contain any value. I can't use Dataframes as my csv is not fixed, every line can have a different number of elements. How can I get last column value even if its Null/None?
My current code:
data_rdd.filter(x => x contains '|').map{line => line.split('|')}.foreach(elem => {
println("size of element ->" + elem.size)
elem.foreach{elem =>
println(elem)
}
})
In both Scala and Java, split will not return any trailing empty strings by default. Instead, you can use a slightly different version of split with a second argument (overloaded to Scala and seen in the Java docs here).
The method definition is:
split(String regex, int limit)
The second argument here limits how many time the regex pattern is applied, using a negative number will apply it as many times as possible.
Therefore, change the code to use:
.map{line => line.split("\\|", -1)}
Note that this split function takes a regex and not a normal string or char.
You can split your string as below:
scala> "17|ABC|3|89|89|0|0|2|".split("\\|", -1)
res24: Array[String] = Array(17, ABC, 3, 89, 89, 0, 0, 2, "")
updated code:
data_rdd.filter(x => x contains '|').map{line => line.split("\\|", -1)}.foreach(elem => {
println("size of element ->" + elem.size)
elem.foreach{elem =>
println(elem)
}
}

How does the fold action work in Spark?

Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?
Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.
You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })

Scala Processing a file in batches

I have a flat file which contains several million lines like one below
59, 254, 2016-09-09T00:00, 1, 6, 3, 40, 18, 0
I want to process this file in batches of X rows at a time. So I wrote this code
def func(x: Int) = {
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(x, x)
} yield batches.map("(" + _ + ")").mkString(",")
}
func(2).foreach(println)
This code produces exactly the output I want. the function walks through entire file taking 2 rows at a time batch them into 1 string.
(59, 828, 2016-09-09T00:00, 0, 8, 2, 52, 0, 0),(59, 774, 2016-09-09T00:00, 0, 10, 2, 51, 0, 0)
But when I see scala pros write code everything happens inside the for comprehension and you just return the last thing from your comprehension.
So in order to be a scala pro I change my code
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(2, 2)
line <- batches.map("(" + _ + ")").mkString(",")
} yield line
This produces 1 character per line and not the output I expected. Why did the code behavior totally change? At least on reading they look the same to me.
In the line line <- batches.map("(" + _ + ")").mkString(","), the right-hand side produces a String (the result of mkString), and the loop iterates over this string. When you iterate over a string, the individual items are characters, so in your case line is going to be a character. What you actually want is not to iterate over that string, but to assign it to the variable name line, which you can do by replacing the <- with =: line = batches.map("(" + _ + ")").mkString(",").
By the way, sliding(2,2) can be more clearly written as grouped(2).
#dhg has given the explaination, here's my suggestion on how this could be done in another way
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(2, 2)
batch <- batches.map("(" + _ + ")")
} yield batch.mkString(",")
so batch would be a traversable consist of 2 lines

Scala comprehension from input

I am new to Scala and I am having troubles constructing a Map from inputs.
Here is my problem :
I am getting an input for elevators information. It consists of n lines, each one has the elevatorFloor number and the elevatorPosition on the floor.
Example:
0 5
1 3
4 5
So here I have 3 elevators, first one is on floor 0 at position 5, second one at floor 1 position 3 etc..
Is there a way in Scala to put it in a Map without using var ?
What I get so far is a Vector of all the elevators' information :
val elevators = {
for{i <- 0 until n
j <- readLine split " "
} yield j.toInt
}
I would like to be able split the lines in two variables "elevatorFloor" and "elevatorPos" and group them in a data structure (my guess is Map would be the appropriate choice) I would like to get something looking like:
elevators: SomeDataStructure[Int,Int] = ( 0->5, 1 -> 3, 4 -> 5)
I would like to clarify that I know I could write Javaish code, initialise a Map and then add the values to it, but I am trying to keep as close to functionnal programming as possible.
Thanks for the help or comments
You can do:
val res: Map[Int, Int] =
Source.fromFile("myfile.txt")
.getLines
.map { line =>
Array(floor, position) = line.split(' ')
(floor.toInt -> position.toInt)
}.toMap