Scala: Split array into chunks by some logic - scala

Is there some predefined function in Scala to split list into several lists by some logic? I found grouped method, but it doesn't fit my needs.
For example, I have List of strings: List("questions", "tags", "users", "badges", "unanswered").
And I want to split this list by max length of strings (for example 12). In other words in each resulting chunk sum of the length of all strings should not be more than 12:
List("questions"), List("tags", "users"), List("badges"), List("unanswered")
EDIT: I'm not necessarily need to find the most optimal way of combining strings into chunks, just linear loop which checks next string in list, and if its length doesn't fit to required (12) then return current chunk and next string will belong to next chunk

There is no buildIn mechanism to do that, that I know of, but you could achieve something like that with a foldLeft and bit of coding:
val test = List("questions", "tags", "users", "badges", "unanswered")
test.foldLeft(List.empty[List[String]]) {
case ((head :: tail), word) if head.map(_.length).sum + word.length < 12 =>
(word :: head) :: tail
case (result, word) =>
List(word) :: result
}
-> res0: List[List[String]] = List(List(unanswered), List(badges), List(users, tags), List(questions))

If you can make reasonable assumptions about the max length of a string (e.g. 10 characters), you could then use sliding which is much faster for a long list:
val elementsPerChunk = ??? // do some maths like
// CHUNK_SIZE / MAX_ELEM_LENGTH
val test = List("questions", "tags", "users", "badges", "unanswered")
test.sliding(elementsPerChunk, elementsPerChunk).toList

Related

Functional programming in Scala: Output the word (or list of words) that occurs the most times in the text file?

Output the word (or list of words) that occurs the most times in the text file (irrespective of case – i.e. “word” and “Word” are treated the same for this purpose). We are only interested in words that contain alphabetic characters [A-Z a-z], so ignore any digits (numbers), punctuation, etc.
If there are several words that occur most often with equal frequency then all these words should be printed as a list. Alongside the word(s) you should output the number of occurrences. For example:
The word(s) that occur most often are [“and”, “it”, “the”] each with 10 occurrences in the text.
I have the following code:
val counter: Map[String, Int] = scala.io.Source.fromFile(file).getLines
.flatMap(_.split("[^-A-Za-z]+")).foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word.toLowerCase -> (count.getOrElse(word, 0) + 1))
}
val list = counter.toList.sortBy(_._2).reverse
This goes as far as creating a list of the words in descending order of occurrences. I don't know how to proceed from here.
Well, you are almost there ...
val maxNum = counter.headOption.fold(0)(_._2) // What's the max number?
list
.iterator // not necessary, but makes it a bit faster to perform chained transformations
.takeWhile(_._2 == maxNum) // Get all words that have that count
.map(_._1) // drop the counts, keep only words
.foreach(println) // Print them out
One kinda major problem with your solution is that you shouldn't sort the list just to find the maximum, as pointed out in the comment.
Just do
val maxNum = counter.maxByOption(_._2).fold(0)(_._2)
counter
.iterator
.collect { case (w, `maxNum`) => w }
.foreach(println)
Also, a bit of a "cosmetic" improvement to your counting is to use groupMapReduce that does what you've accomplished with foldLeft a bit more elegantly:
val counter = source.getLines
.flatMap("\\b") // \b is a regex symbol for "word boundary"
.filter(_.contains("\\w")) // filter out the delimiters - you have a little bug here, that results in your counting spaces as "words"
.groupMapReduce(identity)(_ => 1)(_ + _) // group data by word, replace each occurrence of a word with `1`, and add them all up

How to consecutive and non-consecutive list in scala?

val keywords = List("do", "abstract","if")
val resMap = io.Source
.fromFile("src/demo/keyWord.txt")
.getLines()
.zipWithIndex
.foldLeft(Map.empty[String,Seq[Int]].withDefaultValue(Seq.empty[Int])){
case (m, (line, idx)) =>
val subMap = line.split("\\W+")
.toSeq //separate the words
.filter(keywords.contains) //keep only key words
.groupBy(identity) //make a Map w/ keyword as key
.mapValues(_.map(_ => idx+1)) //and List of line numbers as value
.withDefaultValue(Seq.empty[Int])
keywords.map(kw => (kw, m(kw) ++ subMap(kw))).toMap
}
println("keyword\t\tlines\t\tcount")
keywords.sorted.foreach{kw =>
println(kw + "\t\t" +
resMap(kw).distinct.mkString("[",",","]") + "\t\t" +
resMap(kw).length)
}
This code is not mine and i don't own it ... .using for study purpose. However, I am still learning and I am stuck at implement consecutive to nonconsecutive list, such as the word "if" is in many line and when three or more consecutive line numbers appear then they should be written with a dash in between, e.g. 20-22, but not 20, 21, 22. How can I implement? I just wanted to learn this.
output:
keyword lines count
abstract [1] 1
do [6] 1
if [14,15,16,17,18] 5
But I want the result to be such as [14-18] because word "if" is in line 14 to 18.
First off, I'll give the customary caution that SO isn't meant to be a place to crowdsource answers to homework or projects. I'll give you the benefit of the doubt that this isn't the case.
That said, I hope you gain some understanding about breaking down this problem from this suggestion:
your existing implementation has nothing in place to understand if the int values are indeed consecutive, so you are going to need to add some code that sorts the Ints returned from resMap(kw).distinct in order to set yourself up for the next steps. You can figure out how to do this.
you will then need to group the Ints by their consecutive nature. For example, if you have (14,15,16,18,19,20,22) then this really needs to be further grouped into ((14,15,16),(18,19,20),(22)). You can come up with your algorithm for this.
map over the outer collection (which is a Seq[Seq[Int]] at this point), having different handling depending on whether or not the length of the inside Seq is greater than 1. If greater than one, you can safely call head and tail to get the Ints that you need for rendering your range. Alternatively, you can more idiomatically make a for-comprehension that composes the values from headOption and tailOption to build the same range string. You said something about length of 3 in your question, so you can adjust this step to meet that need as necessary.
lastly, now you have Seq[String] looking like ("14-16","18-20","22") that you need to join together using a mkString call similar to what you already have with the square brackets
For reference, you should get further acquainted with the Scaladoc for the Seq trait:
https://www.scala-lang.org/api/2.12.8/scala/collection/Seq.html
Here's one way to go about it.
def collapseConsecutives(nums :Seq[Int]) :List[String] =
nums.foldRight((nums.last, List.empty[List[Int]])) {
case (n, (prev,acc)) if prev-n == 1 => (n, (n::acc.head) :: acc.tail)
case (n, ( _ ,acc)) => (n, List(n) :: acc)
}._2.map{ ns =>
if (ns.length < 3) ns.mkString(",") //1 or 2 non-collapsables
else s"${ns.head}-${ns.last}" //3 or more, collapsed
}
usage:
println(kw + "\t\t" +
collapseConsecutives(resMap(kw).distinct).mkString("[",",","]") + "\t\t" +
resMap(kw).length)

Efficientley counting occurrences of each character in a file - scala

I am new to Scala, I want the fastest way to get a map of count of occurrences for each character in a text file, how can I do that?(I used groupBy but I believe it is too slow)
I think that groupBy() is probably pretty efficient, but it simply collects the elements, which means that counting them requires a 2nd traversal.
To count all Chars in a single traversal you'd probably need something like this.
val tally = Array.ofDim[Long](127)
io.Source.fromFile("someFile.txt").foreach(tally(_) += 1)
Array was used for its fast indexing. The index is the character that was counted.
tally('e') //res0: Long = 74
tally('x') //res1: Long = 1
You can do the following:
Read the file first:
val lines = Source.fromFile("/Users/Al/.bash_profile").getLines.toSeq
You can then write a method that takes the List of lines read and counts the occurence for a given character:
def getCharCount(c: Char, lines: Seq[String]) = {
lines.foldLeft(0){(acc, elem) =>
elem.toSeq.count(_ == c) + acc
}
}

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Scala Sub string combinations with delimiter

My input string is
element1-element2-element3-element4a|element4b-element5-element6a|element6b
All the elements (sub strings) are separated by - and for some of the elements there will be alternatives separated by | (pipe).
A valid output string is which contains the elements separated by - (dash) only and any one of the alternative elements separated by |
All the List of valid possible combinations of output strings have to be returned.
Output:
element1-element2-element3-element4a-element5-element6a
element1-element2-element3-element4b-element5-element6a
element1-element2-element3-element4a-element5-element6b
element1-element2-element3-element4b-element5-element6b
This can be done using while loop and string functions but it takes more complexity.
(I'm a traditional Java programmer)
Can this be implemented using Scala features making it more efficient
Note: Input can contain any no of elements and pipes
This seems to fit the bill.
def getCombinations(input: String) = {
val group = """(\w|\|)+""".r // Match groups of letters and pipes
val word = """\w+""".r // Match groups of letters in between pipes
val groups = group.findAllIn(input).map(word.findAllIn(_).toVector).toList
// Use fold to construct a 'tree' of vectors, appending each possible entry in a
// pipe-separated group to each previous prefix. We're using vectors because
// the append time is O(1) rather than O(n).
val tree = groups match {
case (x :: tail) => {
val head = x.map(Vector(_)) // split each element in the head into its own node
tail.foldLeft(head) { (acc, elems) =>
for (elem <- elems; xs <- acc) yield (xs :+ elem)
}
}
case _ => Nil // Handle the case of 0 inputs
}
tree.map(_.mkString("-")) // Combine each of our trees back into a dash-separated string
}
I haven't tested this with extensive input, but the runtime complexity shouldn't be too bad. Introducing an 'Or' pipe causes the output to grow, by that's due the nature of the problem.