Scala: find character occurrences from a file - scala

Problem:
suppose, I have a text file containing data like
TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT
And I want to find occurrences of character 'A', 'T', 'AAA' , etc. in it.
My Approach
val source = scala.io.Source.fromFile(filePath)
val lines = source.getLines().filter(char => char != '\n')
for (line <- lines) {
val aList = line.filter(ele => ele == 'A')
println(aList)
}
This will give me output like
AAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA
My Question
How can I find total count of occurrences of 'A', 'T', 'AAA' etc. here? can I use map reduce functions for that? How?

There is even a shorter way:
lines.map(_.count(_ == 'A')).sum
This counts all A of each line, and sums up the result.
By the way there is no filter needed here:
val lines = source.getLines()
And as Leo C mentioned in his comment, if you start with Source.fromFile(filePath) it can be just like this:
source.count(_ == 'A')
As SoleQuantum mentions in his comment he wants call count more than once. The problem here is that source is a BufferedSource which is not a Collection, but just an Iterator, which can only be used (iterated) once.
So if you want to use the source mire than once you have to translate it first to a Collection.
Your example:
val stream = Source.fromResource("yourdata").mkString
stream.count(_ == 'A') // 48
stream.count(_ == 'T') // 65
Remark: String is a Collection of Chars.
For more information check: iterators
And here is the solution to get the count for all Chars:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.groupBy(identity) // group by each char
.view.mapValues(_.length) // count each group > HashMap(T -> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, A -> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, G -> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG, C -> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
.toMap // Map(T -> 65, A -> 48, G -> 36, C -> 61)
Or as suggested by jwvh:
stream
.filterNot(_ == '\n')
.groupMapReduce(identity)(_=>1)(_+_))
This is Scala 2.13, let me know if you have problems with your Scala version.
Ok after the last update of the question:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.foldLeft(("", Map.empty[String, Int])){case ((a, m), c ) =>
if(a.contains(c))
(a + c, m)
else
(s"$c",
m.updated(a, m.get(a).map(_ + 1).getOrElse(1)))
}._2 // you only want the Map -> HashMap( -> 1, CCCC -> 1, A -> 25, GGG -> 1, AA -> 4, GG -> 3, GGGGG -> 1, AAA -> 5, CCC -> 1, TTTT -> 1, T -> 34, CC -> 9, TTT -> 4, G -> 22, CCCCC -> 1, C -> 31, TT -> 7)
Short explanation:
The solution uses a foldLeft.
The initial value is a pair:
a String that holds the actual characters (none to start)
a Map with the Strings and their count (empty at the start)
We have 2 main cases:
the character is the same we have a already a String.
Just add the character to the actual String.
the character is different. Update the Map with the actual String; the new character is the now the actual String.
Quite complex, let me know if you need more help.

Since scala.io.Source.fromFile(filePath) produces stream of chars you can use count(Char => Boolean) function directly on your source object.
val source = scala.io.Source.fromFile(filePath)
val result = source.count(_ == 'A')

You can use Partition method and then just use length on it.
val y = x.partition(_ == 'A')._1.length

You can get the count by doing the following:
lines.flatten.filter(_ == 'A').size

In general regular expressions are a very good tool to find sequences of characters in a string.
You can use the r method, defined with an implicit conversion over strings, to turn a string into a pattern, e.g.
val pattern = "AAA".r
Using it is then fairly easy. Assuming your sample input
val input =
"""TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT"""
Counting the number of occurrences of a pattern is straightforward and very readable:
pattern.findAllIn(input).size // returns 4
The iterator returned by regular expressions operations can also be used for more complex operations using the matchData method, e.g. printing the index of each match:
pattern. // this code would print the following lines
findAllIn(input). // 98
matchData. // 125
map(_.start). // 131
foreach(println) // 165
You can read more on Regex in Scala on the API docs (here for version 2.13.1)

Related

Saving to a custom output format in Spark / Hadoop

I have one RDD which contains multiple datastructures, whereas one of these data structures is a Map[String, Int].
To visualize it easily I get the following after a map transformation:
val data = ... // This is a RDD[Map[String, Int]]
In one of the elements of this RDD, the Map contains the following:
*key value*
map_id -> 7753
Oscar -> 39
Jaden -> 13
Thomas -> 1
Chris -> 52
And then it contains other names and numbers in other elements of the RDD, each map contains a certain map_id. Anyhow, if I simply do data.saveAsTextFile(path), I will get the following output in my file:
Map(map_id -> 7753, Oscar -> 39, Jaden -> 13, Thomas -> 1, Chris -> 52)
Map(...)
Map(...)
However, I would like to format it as the following:
---------------------------
map_id: 7753
---------------------------
Oscar: 39
Jaden: 13
Thomas: 1
Chris: 52
---------------------------
map_id: <some other id>
---------------------------
Name: nbr
Name2: nbr2
Basically, the map_id as some kind of header, then the contents, one line of space and then the next element.
To my question, data RDD only has two options, save as text file or as object file, which neither as far as I can see support my to customize the formatting. How could I go about doing this?
You can just map to String and write the result. For example:
def format(map: Map[String, Int]): String = {
val id = map.get("map_id").map(_.toString).getOrElse("unknown")
val content = map.collect {
case (k, v) if k != "map_id" => s"$k: $v"
}.mkString("\n")
s"""|---------------------------
|map_id: $id
|-------------------------------
|$content
""".stripMargin
}
data.map(format(_)).saveAsTextFile(path)

Scala - have List of String, must split by comma, and put into Map

I have a List of the form: ("string,num", "string,num", ...)
I have found solutions online for how to do this with a single string, but have not been able to adapt it to a list of strings.
Also, numerical values should be cast to Int/Double before being mapped.
I would appreciate any help.
This is a perfect job for a fold
// Your input
val lines = List("a,1", "b,2", "gretzky,99", "tyler,11")
// Fold over the lines, and insert them into a map
val map = lines.foldLeft(Map.empty[String, Int]) {
case (map, line) =>
// Split the line on the comma and separate the two parts
val Array(string, num) = line.split(",")
// Add new entry to the map
map + (string -> num.toInt)
}
println(map)
Output:
Map(a -> 1, b -> 2, gretzky -> 99, tyler -> 11)
Probably not the best way to do that, but it should meet your needs
yourlist.groupBy( _.split(',')(0) ).mapValues(v=>v(0).split(',')(1))

Read input from input file given in specified format in scala

I am given a input file in following format
(a,[b,c,d])
(b,[d,a])
How can format this input to get values in form
key => List()
Following code is used to split lines on basis of space.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
How to store this kind of formatted input ?
To tackle this I started with multiple data elements with and without whitespace separation.
%> cat junk.txt
(a,[b,c,d,e]) (w,[x,y,z])
(q,[wert,cv])(xx,[aa])
Then I opened the file and split the input on every leading ( paren without consuming the character.
val input = io.Source.fromFile("junk.txt")
.getLines()
.flatMap(_.split("(?=\\()"))
I also need a way to recognize the pattern I'm looking for.
val dataRE = "\\(([^,]+),\\[([^\\]]+)]".r.unanchored
Now to transform the data from Strings to Maps.
input.collect{case dataRE(k,v) => k -> v.split(",").toList}.toMap
Result: Map[String,List[String]]
Map(a -> List(b, c, d, e), w -> List(x, y, z), q -> List(wert, cv), xx -> List(aa))

Scala - List to List of tuples

Beginner here.
Sorry but I did'nt found an answer so I ask the question here.
I want to know how to do this by using the Scala API :
(blabla))( -> List(('(',2),(')',2))
Currently I have this :
"(blabla))(".toCharArray.toList.filter(p => (p == '(' || p == ')')).sortBy(x => x)
Output :
List((, (, ), ))
Now how can I map each character to the tuples I describe ?
Example for a general case :
"t:e:s:t" -> List(('t',2),('e',1),('s',1),(':',3))
Thanks
val source = "ok:ok:k::"
val chars = source.toList
val shorter = chars.distinct.map( c => (c, chars.count(_ == c)))
//> shorter : List[(Char, Int)] = List((o,2), (k,3), (:,4))
Classic groupBy . mapValues use case:
scala> val str = "ok:ok:k::"
str: String = ok:ok:k::
scala> str.groupBy(identity).mapValues(_.size) // identity <=> (x => x)
res0: scala.collection.immutable.Map[Char,Int] = Map(k -> 3, : -> 4, o -> 2)
I like sschaef's solution very much, but I was wondering if anyone could weigh in on how efficient that solution is compared to this one:
scala> val str = "ok:ok:k::"
str: String = ok:ok:k::
scala> str.foldLeft(Map[Char,Int]().withDefaultValue(0))((current, c) => current.updated(c, current(c) + 1))
res29: scala.collection.immutable.Map[Char,Int] = Map(o -> 2, k -> 3, : -> 4)
I think my solution is slower. If we have n total occurrences and m unique values:
My solution: we have the fold left over all occurrences or n. For each of these occurrences we look up once to find the current count and then again to create the updated the map. I'm assuming that the creating of the updated map is constant time.
Total complexity: n * 2m or O(n*m)
sschaef's solution: we have the groupBy which I'm assuming just adds entries onto a list without checking the map (so for all values this would be a constant time look up plus appending to the list) so n. Then for the mapValues it probably iterates over the unique values and grabs the size for each key's list. I'm assuming that getting the size of each entry's list is constant time.
Total complexity: O(n + m)
Does this seem correct or am I mistaken in my assumptions?

Storing the contents of a file in an immutable Map in scala

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:
Create an empty immutable map
Create a scanner that reads through the file.
While the scanner.hasNext() is true:
Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero
Create a new entry with the key=word and the value=count+1
Update the map
At the end of the iteration, the map is populated with all the values.
My code is as follows:
val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
val currentCount = wordMap.getOrElse(token,0) + 1
val wordMap = wordMap + (token,currentCount)
}
The ides is that wordMap will have all the wordCounts at the end of the iteration...
Whenever I try to run this snippet, I get the following exception
recursive value wordMap needs type.
Can somebody point out why I am getting this exception and what can I do to remedy it?
Thanks
val wordMap = wordMap + (token,currentCount)
This line is redefining an already-defined variable. If you want to do this, you need to define wordMap with var and then just use
wordMap = wordMap + (token,currentCount)
Though how about this instead?:
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
.toStream // make a Stream so we can groupBy
.groupBy(_._1).mapValues(_.map(_._2).sum) // combine all the per-line counts
.toList
Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.
If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).
EDIT: Detailed explanation:
Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:
val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)
Now identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:
val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))
And finally, we want to count up the number of tokens for each type:
val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)
Since we map this over all the lines in the file, we end up with token counts for each line.
So what does flatMap do? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.
Assume the file is:
a b c b
b c d d d
e f c
Then we get:
val countsByLine =
io.Source.fromFile("textfile.txt") // read from the file
.getLines.flatMap{ line => // for each line
line.split("\\s+") // split the line into tokens
.groupBy(identity).mapValues(_.size) // count each token in the line
} // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))
So now we need to combine the counts of each line into one big set of counts. The countsByLine variable is an Iterator, so it doesn't have a groupBy method. Instead we can convert it to a Stream, which is basically a lazy list. We want laziness because we don't want to have to read the entire file into memory before we start. Then the groupBy groups all counts of the same word type together.
val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))
And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:
val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))
And there you have it.
You have a few mistakes: you've defined wordMap twice (val is to declare a value). Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former).
Try this:
var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
val token = input.next()
wordMap += token -> (wordMap(token) + 1)
}