Efficientley counting occurrences of each character in a file - scala - scala

I am new to Scala, I want the fastest way to get a map of count of occurrences for each character in a text file, how can I do that?(I used groupBy but I believe it is too slow)

I think that groupBy() is probably pretty efficient, but it simply collects the elements, which means that counting them requires a 2nd traversal.
To count all Chars in a single traversal you'd probably need something like this.
val tally = Array.ofDim[Long](127)
io.Source.fromFile("someFile.txt").foreach(tally(_) += 1)
Array was used for its fast indexing. The index is the character that was counted.
tally('e') //res0: Long = 74
tally('x') //res1: Long = 1

You can do the following:
Read the file first:
val lines = Source.fromFile("/Users/Al/.bash_profile").getLines.toSeq
You can then write a method that takes the List of lines read and counts the occurence for a given character:
def getCharCount(c: Char, lines: Seq[String]) = {
lines.foldLeft(0){(acc, elem) =>
elem.toSeq.count(_ == c) + acc
}
}

Related

Functional programming in Scala: Output the word (or list of words) that occurs the most times in the text file?

Output the word (or list of words) that occurs the most times in the text file (irrespective of case – i.e. “word” and “Word” are treated the same for this purpose). We are only interested in words that contain alphabetic characters [A-Z a-z], so ignore any digits (numbers), punctuation, etc.
If there are several words that occur most often with equal frequency then all these words should be printed as a list. Alongside the word(s) you should output the number of occurrences. For example:
The word(s) that occur most often are [“and”, “it”, “the”] each with 10 occurrences in the text.
I have the following code:
val counter: Map[String, Int] = scala.io.Source.fromFile(file).getLines
.flatMap(_.split("[^-A-Za-z]+")).foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word.toLowerCase -> (count.getOrElse(word, 0) + 1))
}
val list = counter.toList.sortBy(_._2).reverse
This goes as far as creating a list of the words in descending order of occurrences. I don't know how to proceed from here.
Well, you are almost there ...
val maxNum = counter.headOption.fold(0)(_._2) // What's the max number?
list
.iterator // not necessary, but makes it a bit faster to perform chained transformations
.takeWhile(_._2 == maxNum) // Get all words that have that count
.map(_._1) // drop the counts, keep only words
.foreach(println) // Print them out
One kinda major problem with your solution is that you shouldn't sort the list just to find the maximum, as pointed out in the comment.
Just do
val maxNum = counter.maxByOption(_._2).fold(0)(_._2)
counter
.iterator
.collect { case (w, `maxNum`) => w }
.foreach(println)
Also, a bit of a "cosmetic" improvement to your counting is to use groupMapReduce that does what you've accomplished with foldLeft a bit more elegantly:
val counter = source.getLines
.flatMap("\\b") // \b is a regex symbol for "word boundary"
.filter(_.contains("\\w")) // filter out the delimiters - you have a little bug here, that results in your counting spaces as "words"
.groupMapReduce(identity)(_ => 1)(_ + _) // group data by word, replace each occurrence of a word with `1`, and add them all up

How to consecutive and non-consecutive list in scala?

val keywords = List("do", "abstract","if")
val resMap = io.Source
.fromFile("src/demo/keyWord.txt")
.getLines()
.zipWithIndex
.foldLeft(Map.empty[String,Seq[Int]].withDefaultValue(Seq.empty[Int])){
case (m, (line, idx)) =>
val subMap = line.split("\\W+")
.toSeq //separate the words
.filter(keywords.contains) //keep only key words
.groupBy(identity) //make a Map w/ keyword as key
.mapValues(_.map(_ => idx+1)) //and List of line numbers as value
.withDefaultValue(Seq.empty[Int])
keywords.map(kw => (kw, m(kw) ++ subMap(kw))).toMap
}
println("keyword\t\tlines\t\tcount")
keywords.sorted.foreach{kw =>
println(kw + "\t\t" +
resMap(kw).distinct.mkString("[",",","]") + "\t\t" +
resMap(kw).length)
}
This code is not mine and i don't own it ... .using for study purpose. However, I am still learning and I am stuck at implement consecutive to nonconsecutive list, such as the word "if" is in many line and when three or more consecutive line numbers appear then they should be written with a dash in between, e.g. 20-22, but not 20, 21, 22. How can I implement? I just wanted to learn this.
output:
keyword lines count
abstract [1] 1
do [6] 1
if [14,15,16,17,18] 5
But I want the result to be such as [14-18] because word "if" is in line 14 to 18.
First off, I'll give the customary caution that SO isn't meant to be a place to crowdsource answers to homework or projects. I'll give you the benefit of the doubt that this isn't the case.
That said, I hope you gain some understanding about breaking down this problem from this suggestion:
your existing implementation has nothing in place to understand if the int values are indeed consecutive, so you are going to need to add some code that sorts the Ints returned from resMap(kw).distinct in order to set yourself up for the next steps. You can figure out how to do this.
you will then need to group the Ints by their consecutive nature. For example, if you have (14,15,16,18,19,20,22) then this really needs to be further grouped into ((14,15,16),(18,19,20),(22)). You can come up with your algorithm for this.
map over the outer collection (which is a Seq[Seq[Int]] at this point), having different handling depending on whether or not the length of the inside Seq is greater than 1. If greater than one, you can safely call head and tail to get the Ints that you need for rendering your range. Alternatively, you can more idiomatically make a for-comprehension that composes the values from headOption and tailOption to build the same range string. You said something about length of 3 in your question, so you can adjust this step to meet that need as necessary.
lastly, now you have Seq[String] looking like ("14-16","18-20","22") that you need to join together using a mkString call similar to what you already have with the square brackets
For reference, you should get further acquainted with the Scaladoc for the Seq trait:
https://www.scala-lang.org/api/2.12.8/scala/collection/Seq.html
Here's one way to go about it.
def collapseConsecutives(nums :Seq[Int]) :List[String] =
nums.foldRight((nums.last, List.empty[List[Int]])) {
case (n, (prev,acc)) if prev-n == 1 => (n, (n::acc.head) :: acc.tail)
case (n, ( _ ,acc)) => (n, List(n) :: acc)
}._2.map{ ns =>
if (ns.length < 3) ns.mkString(",") //1 or 2 non-collapsables
else s"${ns.head}-${ns.last}" //3 or more, collapsed
}
usage:
println(kw + "\t\t" +
collapseConsecutives(resMap(kw).distinct).mkString("[",",","]") + "\t\t" +
resMap(kw).length)

Scala: Split array into chunks by some logic

Is there some predefined function in Scala to split list into several lists by some logic? I found grouped method, but it doesn't fit my needs.
For example, I have List of strings: List("questions", "tags", "users", "badges", "unanswered").
And I want to split this list by max length of strings (for example 12). In other words in each resulting chunk sum of the length of all strings should not be more than 12:
List("questions"), List("tags", "users"), List("badges"), List("unanswered")
EDIT: I'm not necessarily need to find the most optimal way of combining strings into chunks, just linear loop which checks next string in list, and if its length doesn't fit to required (12) then return current chunk and next string will belong to next chunk
There is no buildIn mechanism to do that, that I know of, but you could achieve something like that with a foldLeft and bit of coding:
val test = List("questions", "tags", "users", "badges", "unanswered")
test.foldLeft(List.empty[List[String]]) {
case ((head :: tail), word) if head.map(_.length).sum + word.length < 12 =>
(word :: head) :: tail
case (result, word) =>
List(word) :: result
}
-> res0: List[List[String]] = List(List(unanswered), List(badges), List(users, tags), List(questions))
If you can make reasonable assumptions about the max length of a string (e.g. 10 characters), you could then use sliding which is much faster for a long list:
val elementsPerChunk = ??? // do some maths like
// CHUNK_SIZE / MAX_ELEM_LENGTH
val test = List("questions", "tags", "users", "badges", "unanswered")
test.sliding(elementsPerChunk, elementsPerChunk).toList

Function to return List of Map while iterating over String, kmer count

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.
The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.
Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)
I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.
Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers
import scala.io.Source
object Main {
def main(args: Array[String]) {
// Get all of the lines from the input file
val input = Source.fromFile("input.txt").getLines.toArray
// Create one huge string which contains all the lines but the first
val lines = input.tail.mkString.replace("\n","")
val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)
}
def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
for (i <- 0 until seq.length - k) {
Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
}
}
}
Couple of questions:
How to create/generate List[Map[String,Int]]?
How would you do it?
Any help and/or advice is definitely appreciated!
You're pretty close—there are three fairly minor problems with your code.
The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.
The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.
The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.
So the following should work:
def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
for (i <- 0 until seq.length - k) yield {
Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
}
}
You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.
It's worth noting, by the way, that the sliding method on Seq does exactly what you want:
scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC
I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.

Scala split argument over several lines and parse to Int

This is probably going to end up being very simple, but I ask more to help me learn better Scala idioms (Python guy by trade looking to learn some scala tricks.)
I'm doing some hacker rank problems and the method of input requires is a read over lines from stdin. The spec is quoted below:
The first line contains the number of test cases T. T test cases
follow. Each case contains two integers N and M.
So in the input passed to the script looks something like this:
4
2 2
3 2
2 3
4 4
I'm wondering what would be the proper, idiomatic way to do this. I've thought of a few:
Use io.Source.stdin.readLines.zipWithIndex, then from within a foreach, if the index is greater than 0, split on whitespace and map to (_.toInt)
Use the same readLines function to get the input and then pattern match against the index.
Split on whitespace and newlines to make a single list of digits, map toInt, pop the first element (problem size) and then modulo 2 to make tuples of arguments for my problem function.
I'm wondering what more experienced scala programmers would consider the best way to parse these args, where the 2 element lines would be args to a function and the first, single digit line is just the number of problems to solve.
Maybe you're looking for something like this?
def f(x: Int, y: Int) = { f"do something with $x and $y" }
io.Source.stdin.readLines
.map(_.trim.split("\\s+").map(_.toInt)) // split and convert to ints
.collect { case Array(a, b) => f(a, b) } // pass to f if there are two arguments
.foreach(println) // print the result of each function call
Another way to read the input for Hacker Rank problems is with scala.io.Stdin
import scala.io.StdIn
import scala.collection.mutable.ArrayBuffer
object Solution {
def main(args: Array[String]) = {
val q = StdIn.readInt
var lines = ArrayBuffer[Array[Int]]()
(1 to q).foreach(_ => lines += StdIn.readLine.split(" ").map(_.toInt))
for (a <- lines){
val n = a(0)
val m = a(1)
val ans = n * m
println(ans)
}
}
}
I have tested it on Hacker Rank platform today and the output is:
4
6
6
16