Iterator of repeated words in a file

Iterator of repeated words in a file - scala

Suppose, I'm writing a function to find "repeated words" in a text file. For example, in aaa aaa bb cc cc bb dd repeated words are aaa and cc but not bb, because two bb instances don't appear next to each other.
The function receives an iterator and returns iterator like that:
def foo(in: Iterator[String]): Iterator[String] = ???
foo(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb")) // Iterator("aaa", "cc")
foo(Iterator("a", "a", "a", "b", "c", "b")) // Iterator("a")
How would you write foo ? Note that the input is huge and all words do not fit in memory (but the number of repeated words is relatively small).
P.S. I would like also to enhance foo later to return also positions of the repeated words, the number of repetitions, etc.

UPDATE:
OK then. Let specify bit what you want:
input | expected
|
a |
aa | a
abc |
aabc | a
aaabbbbbbc | ab
aabaa | aa
aabbaa | aba
aabaa | aa
Is it true? If so this is working solution. Not sure about performance but at least it is lazy (don't load everything into memory).
//assume we have no nulls in iterator.
def foo[T >: Null](it:Iterator[T]) = {
(Iterator(null) ++ it).sliding(3,1).collect {
case x # Seq(a,b,c) if b == c && a != b => c
}
}
We need this ugly Iterator(null) ++ because we are looking for 3 elements and we need a way to see if first two are the same.
This is pure implementation and it has some advantages over imperative one (eg. in other answers). Most important one is that it is lazy:
//infinite iterator!!!
val it = Iterator.iterate('a')(s => (s + (if(Random.nextBoolean) 1 else 0)).toChar)
//it'll take only as much as needs to take this 10 items.
//should not blow up
foo(it).take(10)
//imperative implementation will blow up in such situation.
fooImp(it).take(10)
here are all implementations from this and other posts seen in this topic:
https://scalafiddle.io/sf/w5yozTA/15
WITH INDEXES AND POSITIONS
In comment you have asked if it would be easy to add the number of repeated words and their indices. I thought about it for a while and i've made something like this. Not sure if it has great performance but it should be lazy (eg. should work for big files).
/** returns Iterator that replace consecutive items with (item, index, count).
It contains all items from orginal iterator. */
def pack[T >: Null](it:Iterator[T]) = {
//Two nulls, each for one sliding(...)
(Iterator(null:T) ++ it ++ Iterator(null:T))
.sliding(2,1).zipWithIndex
//skip same items
.filter { case (x, _) => x(0) != x(1) }
//calculate how many items was skipped
.sliding(2,1).collect {
case Seq((a, idx1), (b, idx2)) => (a(1), idx1 ,idx2-idx1)
}
}
def foo[T >: Null](it:Iterator[T]) = pack(it).filter(_._3 > 1)
OLD ANSWER (BEFORE UPDATE QUESTION)
Another (simpler) solution could be something like this:
import scala.collection.immutable._
//Create new iterator each time we'll print it.
def it = Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "dd", "ee", "ee", "ee", "ee", "ee", "aaa", "aaa", "ff", "ff", "zz", "gg", "aaa", "aaa")
//yep... this is whole implementation :)
def foo(it:Iterator[String]) = it.sliding(2,1).collect { case Seq(a,b) if a == b => a }
println(foo(it).toList) //dont care about duplication
//List(aaa, cc, dd, ee, ee, ee, ff)
println(foo(it).toSet) //throw away duplicats but don't keeps order
//Set(cc, aaa, ee, ff, dd)
println(foo(it).to[ListSet]) //throw away duplicats and keeps order
//ListSet(aaa, cc, dd, ee, ff)
//oh... and keep result longer than 5 items while testing.
//Scala collections (eg: Sets) behaves bit diffrently up to this limit (they keeps order)
//just test with bit bigger Sequences :)
https://scalafiddle.io/sf/w5yozTA/1
(if answer is helpful up-vote please)

Here is a solution with an Accumulator:
case class Acc(word: String = "", count: Int = 0, index: Int = 0)
def foo(in: Iterator[String]) =
in.zipWithIndex
.foldLeft(List(Acc())) { case (Acc(w, c, i) :: xs, (word: String, index)) =>
if (word == w) // keep counting
Acc(w, c + 1, i) :: xs
else
Acc(word, 1, index) :: Acc(w, c, i) :: xs
}.filter(_.count > 1)
.reverse
val it = Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "aaa", "aaa", "aaa", "aaa")
This returns List(Acc(aaa,2,0), Acc(cc,2,3), Acc(aaa,4,7))
It also handles if the same word has another group with repeated words.
And you have the index of the occurrences as well as the count.
Let me know if you need more explanation.

Here's a solution that uses only the original iterator. No intermediate collections. So everything stays completely lazy and is suitable for very large input data.
def foo(in: Iterator[String]): Iterator[String] =
Iterator.unfold(in.buffered){ itr => // <--- Scala 2.13
def loop :Option[String] =
if (!itr.hasNext) None
else {
val str = itr.next()
if (!itr.hasNext) None
else if (itr.head == str) {
while (itr.hasNext && itr.head == str) itr.next() //remove repeats
Some(str)
}
else loop
}
loop.map(_ -> itr)
}
testing:
val it = Iterator("aaa", "aaa", "aaa", "bb", "cc", "cc", "bb", "dd")
foo(it) // Iterator("aaa", "cc")
//pseudo-infinite iterator
val piIt = Iterator.iterate(8)(_+1).map(_/3) //2,3,3,3,4,4,4,5,5,5, etc.
foo(piIt.map(_.toString)) //3,4,5,6, etc.

It's some complex compare to another answers, but it use relatively small additional memory. And probably more fast.
def repeatedWordsIndex(in: Iterator[String]): java.util.Iterator[String] = {
val initialCapacity = 4096
val res = new java.util.ArrayList[String](initialCapacity) // or mutable.Buffer or mutable.Set, if you want Scala
var prev: String = null
var next: String = null
var prevEquals = false
while (in.hasNext) {
next = in.next()
if (next == prev) {
if (!prevEquals) res.add(prev)
prevEquals = true
} else {
prevEquals = false
}
prev = next
}
res.iterator // may be need to call distinct
}

You could traverse the collection using foldLeft with its accumulator being a Tuple of Map and String to keep track of the previous word for the conditional word counts, followed by a collect, as shown below:
def foo(in: Iterator[String]): Iterator[String] =
in.foldLeft((Map.empty[String, Int], "")){ case ((m, prev), word) =>
val count = if (word == prev) m.getOrElse(word, 0) + 1 else 1
(m + (word -> count), word)
}._1.
collect{ case (word, count) if count > 1 => word }.
iterator
foo(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd")).toList
// res1: List[String] = List("aaa", "cc")
To capture also the repeated word counts and indexes, just index the collection and apply similar tactic for the conditional word count:
def bar(in: Iterator[String]): Map[(String, Int), Int] =
in.zipWithIndex.foldLeft((Map.empty[(String, Int), Int], "", 0)){
case ((m, pWord, pIdx), (word, idx)) =>
val idx1 = if (word == pWord) idx min pIdx else idx
val count = if (word == pWord) m.getOrElse((word, idx1), 0) + 1 else 1
(m + ((word, idx1) -> count), word, idx1)
}._1.
filter{ case ((_, _), count) => count > 1 }
bar(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "cc", "cc", "cc"))
// res2: Map[(String, Int), Int] = Map(("cc", 7) -> 3, ("cc", 3) -> 2, ("aaa", 0) -> 2)
UPDATE:
As per the revised requirement, to minimize memory usage, one approach would be to keep the Map to a minimal size by removing elements of count 1 (which would be the majority if few words are repeated) on-the-fly during the foldLeft traversal. Method baz below is a revised version of bar:
def baz(in: Iterator[String]): Map[(String, Int), Int] =
(in ++ Iterator("")).zipWithIndex.
foldLeft((Map.empty[(String, Int), Int], (("", 0), 0), 0)){
case ((m, pElem, pIdx), (word, idx)) =>
val sameWord = word == pElem._1._1
val idx1 = if (sameWord) idx min pIdx else idx
val count = if (sameWord) m.getOrElse((word, idx1), 0) + 1 else 1
val elem = ((word, idx1), count)
val newMap = m + ((word, idx1) -> count)
if (sameWord) {
(newMap, elem, idx1)
} else
if (pElem._2 == 1)
(newMap - pElem._1, elem, idx1)
else
(newMap, elem, idx1)
}._1.
filter{ case ((word, _), _) => word != "" }
baz(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "cc", "cc", "cc"))
// res3: Map[(String, Int), Int] = Map(("aaa", 0) -> 2, ("cc", 3) -> 2, ("cc", 7) -> 3)
Note that the dummy empty String appended to the input collection is to ensure that the last word gets properly processed as well.

Related

Conditional concatenation of iterator elements - A Scala idiomatic solution

I have an Iterator of Strings and would like to concatenate each element preceding one that matches a predicate, e.g. for an Iterator of
Iterator("a", "b", "c break", "d break", "e")
and a predicate of
!line.endsWith("break")
I would like to print out
(Group: 0): a-b-c break
(Group: 1): d break
(Group: 2): e
(without needing to hold in memory more than a single group at a time)
I know I can achieve this with an iterator like below, but there has to be a more "Scala" way of writing this, right?
import scala.collection.mutable.ListBuffer
object IteratingAndAccumulating extends App {
class AccumulatingIterator(lines: Iterator[String])extends Iterator[ListBuffer[String]] {
override def hasNext: Boolean = lines.hasNext
override def next(): ListBuffer[String] = getNextLine(lines, new ListBuffer[String])
def getNextLine(lines: Iterator[String], accumulator: ListBuffer[String]): ListBuffer[String] = {
val line = lines.next
accumulator += line
if (line.endsWith("break") || !lines.hasNext) accumulator
else getNextLine(lines, accumulator)
}
}
new AccumulatingIterator(Iterator("a", "b", "c break", "d break", "e"))
.map(_.mkString("-")).zipWithIndex.foreach{
case (conc, i) =>
println(s"(Group: $i): $conc")
}
}
many thanks,
Fil

Here is a simple solution if you don't mind loading the entire contents into memory at once:
val lines: List[List[String]] = it.foldLeft(List(List.empty[String])) {
case (head::tail, x) if predicate(x) => Nil :: (x::head) :: tail
case (head::tail, x) => (x::head ) :: tail
}.dropWhile(_.isEmpty).map(_.reverse).reverse
If you would rather iterate through the strings and groups one-by-one, it gets a little bit more involved:
// first "instrument" the iterator, by "demarcating" group boundaries with None:
val instrumented: Iterator[Option[String]] = it.flatMap {
case x if predicate(x) => Seq(Some(x), None)
case x => Seq(Some(x))
}
// And now, wrap it around into another iterator, constructing groups:
val lines: Iterator[Iterator[String]] = Iterator.continually {
instrumented.takeWhile(_.nonEmpty).flatten
}.takeWhile(_.nonEmpty)

How to get all possible partitions for a list in Scala

I have a list of string, e.g. List("A", "B", "C"). I would like to get all the possible partitions of it in Scala. The result I expect is:
def func(List[String]): List[List[String]] = {
// some operations
}
In: func(List("A", "B", "C"))
Out:
[
[["A"], ["B"], ["C"]],
[["A", "B"], ["C"]],
[["A", "C"], ["B"]],
[["B", "C"], ["A"]],
[["A", "B", "C"]],
]

This is a solution using Set:
def partitions[T](seq: TraversableOnce[T]): Set[Set[Set[T]]] = {
def loop(set: Set[T]): Set[Set[Set[T]]] =
if (set.size < 2) {
Set(Set(set))
} else {
set.subsets.filter(_.nonEmpty).flatMap(sub =>
loop(set -- sub).map(_ + sub - Set.empty)
).toSet
}
loop(seq.toSet)
}
Using Set makes the logic easier, but it does remove duplicate values if they are present in the original list. The same logic can be use for List but you need to implement the set-like operations such as subsets.
Just for reference, here is an implementation using List which will preserve duplicates in the input list.
def partitions[T](list: List[T]): List[List[List[T]]] =
list match {
case Nil | _ :: Nil => // 0/1 elements
List(List(list))
case head :: tail => // 2+ elements
partitions(tail).flatMap(part => {
val joins =
part.indices.map(i =>
part.zipWithIndex.map { case (p, j) =>
if (i == j) {
head +: p
} else {
p
}
}
)
(List(head) +: part) +: joins
})
}

Append auto-incrementing suffix to duplicated elements of a List

Given the following list :
val l = List("A", "A", "C", "C", "B", "C")
How can I add an auto-incrementing suffix to every elements so that I end up with a list containing no more duplicates, like the following (the ordering doesn't matter) :
List("A0", "A1", "C0", "C1", "C2", "B0")

I found it out by myself just after having written this question
val l = List("A", "A", "C", "C", "B", "C")
l.groupBy(identity) // Map(A->List(A,A),C->List(C,C,C),B->List(B))
.values.flatMap(_.zipWithIndex) // List((A,0),(A,1),(C,0),(C,1),(C,2),(B,0))
.map{ case (str, i) => s"$str$i"}
If there is a better solution (using foldLeft maybe) please let me know

In a single pass straightforward way :
def transformList(list : List[String]) : List[String] = {
val buf: mutable.Map[String, Int] = mutable.Map.empty
list.map {
x => {
val i = buf.getOrElseUpdate(x, 0)
val result = s"${x.toString}$i"
buf.put(x, i + 1)
result
}
}
}
transformList( List("A", "A", "C", "C", "B", "C"))

Perhaps not the most readable solution, but...
def appendCount(l: List[String]): List[String] = {
// Since we're doing zero-based counting, we need to use `getOrElse(e, -1) + 1`
// to indicate a first-time element count as 0.
val counts =
l.foldLeft(Map[String, Int]())((acc, e) =>
acc + (e -> (acc.getOrElse(e, -1) + 1))
)
val (appendedList, _) =
l.foldRight(List[String](), counts){ case (e, (li, m)) =>
// Prepend the element with its count to the accumulated list.
// Decrement that element's count within the map of element counts
(s"$e${m(e)}" :: li, m + (e -> (m(e) - 1)))
}
appendedList
}
The idea here is that you create a count of each element in the list. You then iterate from the back of the list of original values and append the count to the value while decrementing the count map.
You need to define a helper here because foldRight will require both the new List[String] and the counts as an accumulator (and, as such, will return both). You'll just ignore the counts at the end (they'll all be -1 anyway).
I'd say your way is probably more clear. You'll need to benchmark to see which is faster if that's a concern.
Ideone.

Simple functionnal way for grouping successive elements? [duplicate]

I'm trying to 'group' a string into segments, I guess this example would explain it more succintly
scala> val str: String = "aaaabbcddeeeeeeffg"
... (do something)
res0: List("aaaa","bb","c","dd","eeeee","ff","g")
I can thnk of a few ways to do this in an imperative style (with vars and stepping through the string to find groups) but I was wondering if any better functional solution could
be attained? I've been looking through the Scala API but there doesn't seem to be something that fits my needs.
Any help would be appreciated

You can split the string recursively with span:
def s(x : String) : List[String] = if(x.size == 0) Nil else {
val (l,r) = x.span(_ == x(0))
l :: s(r)
}
Tail recursive:
#annotation.tailrec def s(x : String, y : List[String] = Nil) : List[String] = {
if(x.size == 0) y.reverse
else {
val (l,r) = x.span(_ == x(0))
s(r, l :: y)
}
}

Seems that all other answers are very concentrated on collection operations. But pure string + regex solution is much simpler:
str split """(?<=(\w))(?!\1)""" toList
In this regex I use positive lookbehind and negative lookahead for the captured char

def group(s: String): List[String] = s match {
case "" => Nil
case s => s.takeWhile(_==s.head) :: group(s.dropWhile(_==s.head))
}
Edit: Tail recursive version:
def group(s: String, result: List[String] = Nil): List[String] = s match {
case "" => result reverse
case s => group(s.dropWhile(_==s.head), s.takeWhile(_==s.head) :: result)
}
can be used just like the other because the second parameter has a default value and thus doesnt have to be supplied.

Make it one-liner:
scala> val str = "aaaabbcddddeeeeefff"
str: java.lang.String = aaaabbcddddeeeeefff
scala> str.groupBy(identity).map(_._2)
res: scala.collection.immutable.Iterable[String] = List(eeeee, fff, aaaa, bb, c, dddd)
UPDATE:
As #Paul mentioned about the order here is updated version:
scala> str.groupBy(identity).toList.sortBy(_._1).map(_._2)
res: List[String] = List(aaaa, bb, c, dddd, eeeee, fff)

You could use some helper functions like this:
val str = "aaaabbcddddeeeeefff"
def zame(chars:List[Char]) = chars.partition(_==chars.head)
def q(chars:List[Char]):List[List[Char]] = chars match {
case Nil => Nil
case rest =>
val (thesame,others) = zame(rest)
thesame :: q(others)
}
q(str.toList) map (_.mkString)
This should do the trick, right? No doubt it can be cleaned up into one-liners even further

A functional* solution using fold:
def group(s : String) : Seq[String] = {
s.tail.foldLeft(Seq(s.head.toString)) { case (carry, elem) =>
if ( carry.last(0) == elem ) {
carry.init :+ (carry.last + elem)
}
else {
carry :+ elem.toString
}
}
}
There is a lot of cost hidden in all those sequence operations performed on strings (via implicit conversion). I guess the real complexity heavily depends on the kind of Seq strings are converted to.
(*) Afaik all/most operations in the collection library depend in iterators, an imho inherently unfunctional concept. But the code looks functional, at least.

Starting Scala 2.13, List is now provided with the unfold builder which can be combined with String::span:
List.unfold("aaaabbaaacdeeffg") {
case "" => None
case rest => Some(rest.span(_ == rest.head))
}
// List[String] = List("aaaa", "bb", "aaa", "c", "d", "ee", "ff", "g")
or alternatively, coupled with Scala 2.13's Option#unless builder:
List.unfold("aaaabbaaacdeeffg") {
rest => Option.unless(rest.isEmpty)(rest.span(_ == rest.head))
}
// List[String] = List("aaaa", "bb", "aaa", "c", "d", "ee", "ff", "g")
Details:
Unfold (def unfold[A, S](init: S)(f: (S) => Option[(A, S)]): List[A]) is based on an internal state (init) which is initialized in our case with "aaaabbaaacdeeffg".
For each iteration, we span (def span(p: (Char) => Boolean): (String, String)) this internal state in order to find the prefix containing the same symbol and produce a (String, String) tuple which contains the prefix and the rest of the string. span is very fortunate in this context as it produces exactly what unfold expects: a tuple containing the next element of the list and the new internal state.
The unfolding stops when the internal state is "" in which case we produce None as expected by unfold to exit.

Edit: Have to read more carefully. Below is no functional code.
Sometimes, a little mutable state helps:
def group(s : String) = {
var tmp = ""
val b = Seq.newBuilder[String]
s.foreach { c =>
if ( tmp != "" && tmp.head != c ) {
b += tmp
tmp = ""
}
tmp += c
}
b += tmp
b.result
}
Runtime O(n) (if segments have at most constant length) and tmp.+= probably creates the most overhead. Use a string builder instead for strict runtime in O(n).
group("aaaabbcddeeeeeeffg")
> Seq[String] = List(aaaa, bb, c, dd, eeeeee, ff, g)

If you want to use scala API you can use the built in function for that:
str.groupBy(c => c).values
Or if you mind it being sorted and in a list:
str.groupBy(c => c).values.toList.sorted

Combining multiple Lists of arbitrary length

I am looking for an approach to join multiple Lists in the following manner:
ListA a b c
ListB 1 2 3 4
ListC + # * § %
..
..
..
Resulting List: a 1 + b 2 # c 3 * 4 § %
In Words: The elements in sequential order, starting at first list combined into the resulting list. An arbitrary amount of input lists could be there varying in length.
I used multiple approaches with variants of zip, sliding iterators but none worked and especially took care of varying list lengths. There has to be an elegant way in scala ;)

val lists = List(ListA, ListB, ListC)
lists.flatMap(_.zipWithIndex).sortBy(_._2).map(_._1)
It's pretty self-explanatory. It just zips each value with its position on its respective list, sorts by index, then pulls the values back out.

Here's how I would do it:
class ListTests extends FunSuite {
test("The three lists from his example") {
val l1 = List("a", "b", "c")
val l2 = List(1, 2, 3, 4)
val l3 = List("+", "#", "*", "§", "%")
// All lists together
val l = List(l1, l2, l3)
// Max length of a list (to pad the shorter ones)
val maxLen = l.map(_.size).max
// Wrap the elements in Option and pad with None
val padded = l.map { list => list.map(Some(_)) ++ Stream.continually(None).take(maxLen - list.size) }
// Transpose
val trans = padded.transpose
// Flatten the lists then flatten the options
val result = trans.flatten.flatten
// Viola
assert(List("a", 1, "+", "b", 2, "#", "c", 3, "*", 4, "§", "%") === result)
}
}

Here's an imperative solution if efficiency is paramount:
def combine[T](xss: List[List[T]]): List[T] = {
val b = List.newBuilder[T]
var its = xss.map(_.iterator)
while (!its.isEmpty) {
its = its.filter(_.hasNext)
its.foreach(b += _.next)
}
b.result
}

You can use padTo, transpose, and flatten to good effect here:
lists.map(_.map(Some(_)).padTo(lists.map(_.length).max, None)).transpose.flatten.flatten

Here's a small recursive solution.
def flatList(lists: List[List[Any]]) = {
def loop(output: List[Any], xss: List[List[Any]]): List[Any] = (xss collect { case x :: xs => x }) match {
case Nil => output
case heads => loop(output ::: heads, xss.collect({ case x :: xs => xs }))
}
loop(List[Any](), lists)
}
And here is a simple streams approach which can cope with an arbitrary sequence of sequences, each of potentially infinite length.
def flatSeqs[A](ssa: Seq[Seq[A]]): Stream[A] = {
def seqs(xss: Seq[Seq[A]]): Stream[Seq[A]] = xss collect { case xs if !xs.isEmpty => xs } match {
case Nil => Stream.empty
case heads => heads #:: seqs(xss collect { case xs if !xs.isEmpty => xs.tail })
}
seqs(ssa).flatten
}

Here's something short but not exceedingly efficient:
def heads[A](xss: List[List[A]]) = xss.map(_.splitAt(1)).unzip
def interleave[A](xss: List[List[A]]) = Iterator.
iterate(heads(xss)){ case (_, tails) => heads(tails) }.
map(_._1.flatten).
takeWhile(! _.isEmpty).
flatten.toList

Here's a recursive solution that's O(n). The accepted solution (using sort) is O(nlog(n)). Some testing I've done suggests the second solution using transpose is also O(nlog(n)) due to the implementation of transpose. The use of reverse below looks suspicious (since it's an O(n) operation itself) but convince yourself that it either can't be called too often or on too-large lists.
def intercalate[T](lists: List[List[T]]) : List[T] = {
def intercalateHelper(newLists: List[List[T]], oldLists: List[List[T]], merged: List[T]): List[T] = {
(newLists, oldLists) match {
case (Nil, Nil) => merged
case (Nil, zss) => intercalateHelper(zss.reverse, Nil, merged)
case (Nil::xss, zss) => intercalateHelper(xss, zss, merged)
case ( (y::ys)::xss, zss) => intercalateHelper(xss, ys::zss, y::merged)
}
}
intercalateHelper(lists, List.empty, List.empty).reverse
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Iterator of repeated words in a file - scala

Related

Conditional concatenation of iterator elements - A Scala idiomatic solution

How to get all possible partitions for a list in Scala

Append auto-incrementing suffix to duplicated elements of a List

Simple functionnal way for grouping successive elements? [duplicate]

Combining multiple Lists of arbitrary length

Categories

Resources