I'm trying to take a iterator of Strings, and turn it into an iterator of collections of strings based on an arbitrary splitting function.
So say I have
val splitter: String => Boolean = s => s.isEmpty
then I want it to take
val data = List("abc", "def", "", "ghi", "jkl", "mno", "", "pqr").iterator
and have
def f[A] (input: Iterator[A], splitFcn: A => Boolean): Iterator[X[A]]
where X can be any collection-like class you want, so long as it can be converted into a Seq, such that
f(data, splitter).foreach(println(_.toList))
outputs
List("abc", "def")
List("ghi", "jkl", "mno")
List("pqr")
Is there a clean way to do this, that does not require collecting the results of the input iterator entirely into memory?
This should do what you want:
val splitter: String => Boolean = s => s.isEmpty
val data = List("abc", "def", "", "ghi", "jkl", "", "mno", "pqr")
def splitList[A](l: List[A], p: A => Boolean):List[List[A]] = {
l match {
case Nil => Nil
case _ =>
val (h, t) = l.span(a => !p(a))
h :: splitList(t.drop(1), p)
}
}
println(splitList(data, splitter))
//prints List(List(abc, def), List(ghi, jkl), List(mno, pqr))
Here it is:
scala> val data = List("abc", "def", "", "ghi", "jkl", "mno", "", "pqr").iterator
data: Iterator[String] = non-empty iterator
scala> val splitter: String => Boolean = s => s.isEmpty
splitter: String => Boolean = <function1>
scala> def f[A](in: Iterator[A], sf: A => Boolean): Iterator[Iterator[A]] =
in.hasNext match {
| case false => Iterator()
| case true => Iterator(in.takeWhile(x => !sf(x))) ++ f(in, sf)
| }
f: [A](in: Iterator[A], sf: A => Boolean)Iterator[Iterator[A]]
scala> f(data, splitter) foreach (x => println(x.toList))
List(abc, def)
List(ghi, jkl, mno)
List(pqr)
UPDATE #2: Travis Brown answered another question using Scalaz-streams, an interesting package that might be helpful to you here. I am just starting to look at the package, but was quickly able to use it to read data from a file containing this:
abc
def
ghi
jkl
mno
pqr
and produce another file that looked like this:
Vector(abc, def, )
Vector(ghi, jkl, mno, )
Vector(pqr)
The library only holds the Vector being accumulated in memory. Here's my code (which should be considered dangerous, as I barely know anything about Scalaz-streams):
import scalaz.stream._
io.linesR("/tmp/a")
.pipe( process1.chunkBy(_.nonEmpty) )
.map( _.toString + "\n" )
.pipe(text.utf8Encode)
.to( io.fileChunkW("/tmp/b") )
.run.run
Key to your task is the chunkBy(_.nonEmpty), which accumulates lines into a Vector until it hits an empty line. I have no idea at this point why you have to say run twice.
Old stuff below.
UPDATE #1: Ah! I just discovered the new constraint that it not all be read into memory. This solution isn't for you, then; you'd want Iterators or Streams.
I'm guessing that you'd want to enrich Traversable. And with the function in a separate argument list, the compiler can infer the types. For performance you would probably only want to make one pass over the data. And to avoid crashing with large datasets (and for performance), you wouldn't want any recursion that is not tail-recursion. Given this enricher:
implicit class EnrichedTraversable[A]( val xs:Traversable[A] ) extends AnyVal {
def splitWhere( f: A => Boolean ) = {
#tailrec
def loop( xs:Traversable[A], group:Seq[A], groups:Seq[Seq[A]] ):Seq[Seq[A]] =
if ( xs.isEmpty ) {
groups :+ group
} else {
val x = xs.head
val rest = xs.tail
if ( f(x) ) loop( rest, Vector(), groups :+ group )
else loop( rest, group :+ x, groups )
}
loop( xs, Vector(), Vector() )
}
}
you can do this:
List("a","b","","c","d") splitWhere (_.isEmpty)
Here are some tests you might want to check out, to be sure the semantics are what you want (I personally like splits to behave this way):
val xs = List("a","b","","d","e","","f","g") //> xs : List[String] = List(a, b, "", d, e, "", f, g)
xs splitWhere (_.isEmpty) //> res0: Seq[Seq[String]] = Vector(Vector(a, b), Vector(d, e), Vector(f, g))
List("a","b","") splitWhere (_.isEmpty) //> res1: Seq[Seq[String]] = Vector(Vector(a, b), Vector())
List("") splitWhere (_.isEmpty) //> res2: Seq[Seq[String]] = Vector(Vector(), Vector())
List[String]() splitWhere (_.isEmpty) //> res3: Seq[Seq[String]] = Vector(Vector())
Vector("a","b","","c") splitWhere (_.isEmpty) //> res4: Seq[Seq[String]] = Vector(Vector(a, b), Vector(c))
I think Stream is what you want since they are evaluated lazily (not everything in memory).
def split[A](inputStream: Stream[A], splitter: A => Boolean): Stream[List[A]] = {
var accumulationList: List[A] = Nil
def loop(inputStream: Stream[A]): Stream[List[A]] = {
if (inputStream.isEmpty) {
if (accumulationList.isEmpty)
Stream.empty[List[A]]
else
accumulationList.reverse #:: Stream.empty[List[A]]
} else if (splitter(inputStream.head)) {
val outputList = accumulationList.reverse
accumulationList = Nil
if (outputList.isEmpty)
loop(inputStream.tail)
else
outputList #:: loop(inputStream.tail)
} else {
accumulationList = inputStream.head :: accumulationList
loop(inputStream.tail)
}
}
loop(inputStream)
}
val splitter = { s: String => s.isEmpty }
val list = List("asdf", "aa", "", "fw", "", "wfwf", "", "")
val stream = split(list.toStream, splitter)
stream foreach println
The output is:
List(asdf, aa)
List(fw)
List(wfwf)
EDIT:
I have not looked at it in detail, but I guess my recursive method loop could be replaced by a foldLeft or foldRight.
Related
Is there a way for splitting a list of strings like following:
List("lorem", "ipsum" ,"X", "sit", "amet", "consectetur")
At every predicate like x => x.equals("X") into several lists that the result would be:
List(List("lorem", "ipsum"), List("sit", "amet", "consectetur"))
That all in a simple functional way?
The unfold() (Scala 2.13.x) way.
val lst =
List("X","lorem","ipsum","X","sit","amet","X","consectetur","X")
List.unfold(lst){st =>
Option.when(st.nonEmpty){
val (nxt, rst) = st.span(_ != "X")
(nxt, rst.drop(1))
}
}
//res0: List[List[String]] = List(List()
// , List(lorem, ipsum)
// , List(sit, amet)
// , List(consectetur))
You can use a tail-recursive function to easily keep track of each chunk like this:
def splitEvery[A](data: List[A])(p: A => Boolean): List[List[A]] = {
#annotation.tailrec
def loop(remaining: List[A], currentChunk: List[A], acc: List[List[A]]): List[List[A]] =
remaining match {
case a :: tail =>
if (p(a))
loop(
remaining = tail,
currentChunk = List.empty,
currentChunk.reverse :: acc
)
else
loop(
remaining = tail,
a :: currentChunk,
acc
)
case Nil =>
(currentChunk.reverse :: acc).reverse
}
loop(
remaining = data,
currentChunk = List.empty,
acc = List.empty
)
}
Which can be used like this:
val data = List("lorem", "ipsum" ,"X", "sit", "amet", "consectetur")
val result = splitEvery(data)(_ == "X")
// result: List[List[String]] = List(List(lorem, ipsum), List(sit, amet, consectetur)
)
You can see the code running here.
I am currently working on a function that takes in a Map[String, List[String]] and a String as arguments. The map contains a user Id and the IDs of films that they liked. What I need to do is, to return a List[List[String]] which contains the other movies that where liked by the user who liked the movie that was passed into the function.
The function declaration looks as follows:
def movies(m: Map[String, List[String]], mov: String) : List[List[String]]= {
}
So lets imagine the following:
val m1 : [Map[Int, List[String]]] = Map(1 ‐> List("b", "a"), 2 ‐> List("y", "x"), 3 ‐> List("c", "a"))
val movieID = "a"
movies(m1, movieId)
This should return:
List(List("b"), List("c"))
I have tried using
m1.filter(x => x._2.contains(movieID))
So that only Lists containing movieID are kept in the map, but my problem is that I need to remove movieID from every list it occurs in, and then return the result as a List[List[String]].
You could use collect:
val m = Map("1" -> List("b", "a"), "2" -> List("y", "x"), "3" -> List("c", "a"))
def movies(m: Map[String, List[String]], mov: String) = m.collect {
case (_, l) if l.contains(mov) => l.filterNot(_ == mov)
}
movies(m, "a") //List(List(b), List(c))
Problem with this approach is, that it would iterate over every movie list twice, the first time with contains and the second time with filterNot. We could optimize it tail-recursive function, which would look for element and if found just return list without it:
import scala.annotation.tailrec
def movies(m: Map[String, List[String]], mov: String) = {
#tailrec
def withoutElement[T](l: List[T], mov: T, acc: List[T] = Nil): Option[List[T]] = {
l match {
case x :: xs if x == mov => Some(acc.reverse ++ xs)
case x :: xs => withoutElement(xs, mov, x :: acc)
case Nil => None
}
}
m.values.flatMap(withoutElement(_, mov))
}
The solution from Krzysztof is a good one. Here's an alternate way to traverse every List just once.
def movies(m: Map[String, List[String]], mov: String) =
m.values.toList.flatMap{ss =>
val tpl = ss.foldLeft((false, List.empty[String])){
case ((_,res), `mov`) => (true, res)
case ((keep,res), str) => (keep, str::res)
}
if (tpl._1) Some(tpl._2) else None
}
This should work for you:
object DemoAbc extends App {
val m1 = Map(1 -> List("b", "a"), 2 -> List("y", "x"), 3 -> List("c", "a"))
val movieID = "a"
def movies(m: Map[Int, List[String]], mov: String): List[List[String]] = {
val ans = m.foldLeft(List.empty[List[String]])((a: List[List[String]], b: (Int, List[String])) => {
if (b._2.contains(mov))
b._2.filter(_ != mov) :: a
else a
})
ans
}
print(movies(m1, movieID))
}
I am facing a problem to calculate the sum of elements in Scala having the same title (my key in this case).
Currently my input can be described as:
val listInput1 =
List(
"itemA,CATA,2,4 ",
"itemA,CATA,3,1 ",
"itemB,CATB,4,5",
"itemB,CATB,4,6"
)
val listInput2 =
List(
"itemA,CATA,2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
The required output for lists in input should be
val listoutput1 =
List(
"itemA,CATA,5,5 ",
"itemB,CATB,8,11"
)
val listoutput2 =
List(
"itemA , CATA, 2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
I wrote the following function:
def sumByTitle(listInput: List[String]): List[String] =
listInput.map(_.split(",")).groupBy(_(0)).map {
case (title, features) =>
"%s,%s,%d,%d".format(
title,
features.head.apply(1),
features.map(_(2).toInt).sum,
features.map(_(3).toInt).sum)}.toList
It doesn't give me the expected result as it changes the order of lines.
How can I fix that?
The ListMap is designed to preserve the order of items inserted to the Map.
import collection.immutable.ListMap
def sumByTitle(listInput: List[String]): List[String] = {
val itemPttrn = raw"(.*)(\d+),(\d+)\s*".r
listInput.foldLeft(ListMap.empty[String, (Int,Int)].withDefaultValue((0,0))) {
case (lm, str) =>
val itemPttrn(k, a, b) = str //unsafe
val (x, y) = lm(k)
lm.updated(k, (a.toInt + x, b.toInt + y))
}.toList.map { case (k, (a, b)) => s"$k$a,$b" }
}
This is a bit unsafe in that it will throw if the input string doesn't match the regex pattern.
sumByTitle(listInput1)
//res0: List[String] = List(itemA,CATA,5,5, itemB,CATB,8,11)
sumByTitle(listInput2)
//res1: List[String] = List(itemA,CATA,2,4, itemB,CATB,4,5, itemC,CATC,1,2)
You'll note that the trailing space, if there is one, is not preserved.
If you are just interested in sorting you can simply return the sorted list:
val listInput1 =
List(
"itemA , CATA, 2,4 ",
"itemA , CATA, 3,1 ",
"itemB,CATB,4,5",
"itemB,CATB,4,6"
)
val listInput2 =
List(
"itemA , CATA, 2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
def sumByTitle(listInput: List[String]): List[String] =
listInput.map(_.split(",")).groupBy(_(0)).map {
case (title, features) =>
"%s,%s,%d,%d".format(
title,
features.head.apply(1),
features.map(_(2).trim.toInt).sum,
features.map(_(3).trim.toInt).sum)}.toList.sorted
println("LIST 1")
sumByTitle(listInput1).foreach(println)
println("LIST 2")
sumByTitle(listInput2).foreach(println)
You can find the code on Scastie for you to play around with.
As a side note, you may be interested in separating the serialization and deserialization from your business logic.
Here you can find another Scastie notebook with a relatively naive approach for a first step towards separating concerns.
def foldByTitle(listInput: List[String]): List[Item] =
listInput.map(Item.parseItem).foldLeft(List.empty[Item])(sumByTitle)
val sumByTitle: (List[Item], Item) => List[Item] = (acc, curr) =>
acc.find(_.name == curr.name).fold(curr :: acc) { i =>
acc.filterNot(_.name == curr.name) :+ i.copy(num1 = i.num1 + curr.num1, num2 = i.num2 + curr.num2)
}
case class Item(name: String, category: String, num1: Int, num2: Int)
object Item {
def parseItem(serializedItem: String): Item = {
val itemTokens = serializedItem.split(",").map(_.trim)
Item(itemTokens.head, itemTokens(1), itemTokens(2).toInt, itemTokens(3).toInt)
}
}
This way the initial order of the elements to kept.
Slightly simplifying, my problem comes from a list of strings input that I want to parse with a function parse returning Either[String,Int].
Then list.map(parse) returns a list of Eithers. The next step in the program is to format an error message summing up all the errors or passing on the list of parsed integers.
Lets call the solution I'm looking for partitionEithers.
Calling
partitionEithers(List(Left("foo"), Right(1), Left("bar")))
Would give
(List("foo", "bar"),List(1))
Finding something like this in the standard library would be best. Failing that some kind of clean, idiomatic and efficient solution would be best. Also some kind of efficient utility function I could just paste into my projects would be ok.
I was very confused between these 3 earlier questions. As far as I can tell, neither of those questions matches my case, but some answers there seem to contain valid answers to this question.
Scala collections offer a partition function:
val eithers: List[Either[String, Int]] = List(Left("foo"), Right(1), Left("bar"))
eithers.partition(_.isLeft) match {
case (leftList, rightList) =>
(leftList.map(_.left.get), rightList.map(_.right.get))
}
=> res0: (List[String], List[Int]) = (List(foo, bar),List(1))
UPDATE
If you want to wrap it in a (maybe even somewhat type safer) generic function:
def partitionEither[Left : ClassTag, Right : ClassTag](in: List[Either[Left, Right]]): (List[Left], List[Right]) =
in.partition(_.isLeft) match {
case (leftList, rightList) =>
(leftList.collect { case Left(l: Left) => l }, rightList.collect { case Right(r: Right) => r })
}
You could use separate from MonadPlus (scalaz) or MonadCombine (cats) :
import scala.util.{Either, Left, Right}
import scalaz.std.list._
import scalaz.std.either._
import scalaz.syntax.monadPlus._
val l: List[Either[String, Int]] = List(Right(1), Left("error"), Right(2))
l.separate
// (List[String], List[Int]) = (List(error),List(1, 2))
I don't really get the amount of contortions of the other answers. So here is a one liner:
scala> val es:List[Either[Int,String]] =
List(Left(1),Left(2),Right("A"),Right("B"),Left(3),Right("C"))
es: List[Either[Int,String]] = List(Left(1), Left(2), Right(A), Right(B), Left(3), Right(C))
scala> es.foldRight( (List[Int](), List[String]()) ) {
case ( e, (ls, rs) ) => e.fold( l => ( l :: ls, rs), r => ( ls, r :: rs ) )
}
res5: (List[Int], List[String]) = (List(1, 2, 3),List(A, B, C))
Here is an imperative implementation mimicking the style of Scala collection internals.
I wonder if there should something like this in there, since at least I run into this from time to time.
import collection._
import generic._
def partitionEithers[L, R, E, I, CL, CR]
(lrs: I)
(implicit evI: I <:< GenTraversableOnce[E],
evE: E <:< Either[L, R],
cbfl: CanBuildFrom[I, L, CL],
cbfr: CanBuildFrom[I, R, CR])
: (CL, CR) = {
val ls = cbfl()
val rs = cbfr()
ls.sizeHint(lrs.size)
rs.sizeHint(lrs.size)
lrs.foreach { e => evE(e) match {
case Left(l) => ls += l
case Right(r) => rs += r
} }
(ls.result(), rs.result())
}
partitionEithers(List(Left("foo"), Right(1), Left("bar"))) == (List("foo", "bar"), List(1))
partitionEithers(Set(Left("foo"), Right(1), Left("bar"), Right(1))) == (Set("foo", "bar"), Set(1))
You can use foldLeft.
def f(s: Seq[Either[String, Int]]): (Seq[String], Seq[Int]) = {
s.foldRight((Seq[String](), Seq[Int]())) { case (c, r) =>
c match {
case Left(le) => (le +: r._1, r._2)
case Right(ri) => (r._1 , ri +: r._2)
}
}
}
val eithers: List[Either[String, Int]] = List(Left("foo"), Right(1), Left("bar"))
scala> f(eithers)
res0: (Seq[String], Seq[Int]) = (List(foo, bar),List(1))
I'm trying to write some code to make it easy to chain functions that return Scalaz Validation types. One method I am trying to write is analogous to Validation.flatMap (Short circuit that validation) which I will call andPipe. The other is analogous to |#| on ApplicativeBuilder (accumulating errors) except it only returns the final Success type, which I will call andPass
Suppose I have functions:
def allDigits: (String) => ValidationNEL[String, String]
def maxSizeOfTen: (String) => ValidationNEL[String, String]
def toInt: (String) => ValidationNEL[String, Int]
As an example, I would like to first pass the input String to both allDigits and maxSizeOf10. If there are failures, it should short circuit by not calling the toInt function and return either or both failures that occurred. If Successful, I would like to pass the Success value to the toInt function. From there, it would either Succeed with the output value being an Int, or it would fail returning only the validation failure from toInt.
def intInput: (String) => ValidationNEL[String,Int] = (allDigits andPass maxSizeOfTen) andPipe toInt
Is there a way to do this without my add-on implementation below?
Here is my Implementation:
trait ValidationFuncPimp[E,A,B] {
val f: (A) => Validation[E, B]
/** If this validation passes, pass to f2, otherwise fail without accumulating. */
def andPipe[C](f2: (B) => Validation[E,C]): (A) => Validation[E,C] = (a: A) => {
f(a) match {
case Success(x) => f2(x)
case Failure(x) => Failure(x)
}
}
/** Run this validation and the other validation, Success only if both are successful. Fail accumulating errors. */
def andPass[D](f2: (A) => Validation[E,D])(implicit S: Semigroup[E]): (A) => Validation[E,D] = (a:A) => {
(f(a), f2(a)) match {
case (Success(x), Success(y)) => Success(y)
case (Failure(x), Success(y)) => Failure(x)
case (Success(x), Failure(y)) => Failure(y)
case (Failure(x), Failure(y)) => Failure(S.append(x, y))
}
}
}
implicit def toValidationFuncPimp[E,A,B](valFunc : (A) => Validation[E,B]): ValidationFuncPimp[E,A,B] = {
new ValidationFuncPimp[E,A,B] {
val f = valFunc
}
}
I'm not claiming that this answer is necessarily any better than drstevens's, but it takes a slightly different approach and wouldn't fit in a comment there.
First for our validation methods (note that I've changed the type of toInt a bit, for reasons I'll explain below):
import scalaz._, Scalaz._
def allDigits: (String) => ValidationNEL[String, String] =
s => if (s.forall(_.isDigit)) s.successNel else "Not all digits".failNel
def maxSizeOfTen: (String) => ValidationNEL[String, String] =
s => if (s.size <= 10) s.successNel else "Too big".failNel
def toInt(s: String) = try(s.toInt.right) catch {
case _: NumberFormatException => NonEmptyList("Still not an integer").left
}
I'll define a type alias for the sake of convenience:
type ErrorsOr[+A] = NonEmptyList[String] \/ A
Now we've just got a couple of Kleisli arrows:
val validator = Kleisli[ErrorsOr, String, String](
allDigits.flatMap(x => maxSizeOfTen.map(x *> _)) andThen (_.disjunction)
)
val integerizer = Kleisli[ErrorsOr, String, Int](toInt)
Which we can compose:
val together = validator >>> integerizer
And use like this:
scala> together("aaa")
res0: ErrorsOr[Int] = -\/(NonEmptyList(Not all digits))
scala> together("12345678900")
res1: ErrorsOr[Int] = -\/(NonEmptyList(Too big))
scala> together("12345678900a")
res2: ErrorsOr[Int] = -\/(NonEmptyList(Not all digits, Too big))
scala> together("123456789")
res3: ErrorsOr[Int] = \/-(123456789)
Using flatMap on something that isn't monadic makes me a little uncomfortable, and combining our two ValidationNEL methods into a Kleisli arrow in the \/ monad—which also serves as an appropriate model for our string-to-integer conversion—feels a little cleaner to me.
This is relatively concise with little "added code". It is still sort of wonky though because it ignores the successful result of applying allDigits.
scala> val validated = for {
| x <- allDigits
| y <- maxSizeOfTen
| } yield x *> y
validated: String => scalaz.Validation[scalaz.NonEmptyList[String],String] = <function1>
scala> val validatedToInt = (str: String) => validated(str) flatMap(toInt)
validatedToInt: String => scalaz.Validation[scalaz.NonEmptyList[String],Int] = <function1>
scala> validatedToInt("10")
res25: scalaz.Validation[scalaz.NonEmptyList[String],Int] = Success(10)
Alternatively you could keep both of the outputs of allDigits and maxSizeOfTen.
val validated2 = for {
x <- allDigits
y <- maxSizeOfTen
} yield x <|*|> y
I'm curious if someone else could come up with a better way to combine these. It's not really composition...
val validatedToInt = (str: String) => validated2(str) flatMap(_ => toInt(str))
Both validated and validated2 accumulate failures as shown below:
scala> def allDigits: (String) => ValidationNEL[String, String] = _ => failure(NonEmptyList("All Digits Fail"))
allDigits: String => scalaz.Scalaz.ValidationNEL[String,String]
scala> def maxSizeOfTen: (String) => ValidationNEL[String, String] = _ => failure(NonEmptyList("max > 10"))
maxSizeOfTen: String => scalaz.Scalaz.ValidationNEL[String,String]
scala> val validated = for {
| x <- allDigits
| y <- maxSizeOfTen
| } yield x *> y
validated: String => scalaz.Validation[scalaz.NonEmptyList[String],String] = <function1>
scala> val validated2 = for {
| x <- allDigits
| y <- maxSizeOfTen
| } yield x <|*|> y
validated2: String => scalaz.Validation[scalaz.NonEmptyList[String],(String, String)] = <function1>
scala> validated("ten")
res1: scalaz.Validation[scalaz.NonEmptyList[String],String] = Failure(NonEmptyList(All Digits Fail, max > 10))
scala> validated2("ten")
res3: scalaz.Validation[scalaz.NonEmptyList[String],(String, String)] = Failure(NonEmptyList(All Digits Fail, max > 10))
Use ApplicativeBuilder with the first two, so that the errors accumulate,
then flatMap toInt, so toInt only gets called if the first two succeed.
val validInt: String => ValidationNEL[String, Int] =
for {
validStr <- (allDigits |#| maxSizeOfTen)((x,_) => x);
i <- toInt
} yield(i)