groupby scala list of string - scala

I am facing a problem to calculate the sum of elements in Scala having the same title (my key in this case).
Currently my input can be described as:
val listInput1 =
List(
"itemA,CATA,2,4 ",
"itemA,CATA,3,1 ",
"itemB,CATB,4,5",
"itemB,CATB,4,6"
)
val listInput2 =
List(
"itemA,CATA,2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
The required output for lists in input should be
val listoutput1 =
List(
"itemA,CATA,5,5 ",
"itemB,CATB,8,11"
)
val listoutput2 =
List(
"itemA , CATA, 2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
I wrote the following function:
def sumByTitle(listInput: List[String]): List[String] =
listInput.map(_.split(",")).groupBy(_(0)).map {
case (title, features) =>
"%s,%s,%d,%d".format(
title,
features.head.apply(1),
features.map(_(2).toInt).sum,
features.map(_(3).toInt).sum)}.toList
It doesn't give me the expected result as it changes the order of lines.
How can I fix that?

The ListMap is designed to preserve the order of items inserted to the Map.
import collection.immutable.ListMap
def sumByTitle(listInput: List[String]): List[String] = {
val itemPttrn = raw"(.*)(\d+),(\d+)\s*".r
listInput.foldLeft(ListMap.empty[String, (Int,Int)].withDefaultValue((0,0))) {
case (lm, str) =>
val itemPttrn(k, a, b) = str //unsafe
val (x, y) = lm(k)
lm.updated(k, (a.toInt + x, b.toInt + y))
}.toList.map { case (k, (a, b)) => s"$k$a,$b" }
}
This is a bit unsafe in that it will throw if the input string doesn't match the regex pattern.
sumByTitle(listInput1)
//res0: List[String] = List(itemA,CATA,5,5, itemB,CATB,8,11)
sumByTitle(listInput2)
//res1: List[String] = List(itemA,CATA,2,4, itemB,CATB,4,5, itemC,CATC,1,2)
You'll note that the trailing space, if there is one, is not preserved.

If you are just interested in sorting you can simply return the sorted list:
val listInput1 =
List(
"itemA , CATA, 2,4 ",
"itemA , CATA, 3,1 ",
"itemB,CATB,4,5",
"itemB,CATB,4,6"
)
val listInput2 =
List(
"itemA , CATA, 2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
def sumByTitle(listInput: List[String]): List[String] =
listInput.map(_.split(",")).groupBy(_(0)).map {
case (title, features) =>
"%s,%s,%d,%d".format(
title,
features.head.apply(1),
features.map(_(2).trim.toInt).sum,
features.map(_(3).trim.toInt).sum)}.toList.sorted
println("LIST 1")
sumByTitle(listInput1).foreach(println)
println("LIST 2")
sumByTitle(listInput2).foreach(println)
You can find the code on Scastie for you to play around with.
As a side note, you may be interested in separating the serialization and deserialization from your business logic.
Here you can find another Scastie notebook with a relatively naive approach for a first step towards separating concerns.

def foldByTitle(listInput: List[String]): List[Item] =
listInput.map(Item.parseItem).foldLeft(List.empty[Item])(sumByTitle)
val sumByTitle: (List[Item], Item) => List[Item] = (acc, curr) =>
acc.find(_.name == curr.name).fold(curr :: acc) { i =>
acc.filterNot(_.name == curr.name) :+ i.copy(num1 = i.num1 + curr.num1, num2 = i.num2 + curr.num2)
}
case class Item(name: String, category: String, num1: Int, num2: Int)
object Item {
def parseItem(serializedItem: String): Item = {
val itemTokens = serializedItem.split(",").map(_.trim)
Item(itemTokens.head, itemTokens(1), itemTokens(2).toInt, itemTokens(3).toInt)
}
}
This way the initial order of the elements to kept.

Related

how to get the index of the duplicate pair in the a list using scala

I have a Scala list below :
val numList = List(1,2,3,4,5,1,2)
I want to get index of the same element pair in the list. The output should look like (0,5),(1,6)
How can I achieve using map?
def catchDuplicates(num : List[Int]) : (Int , Int) = {
val count = 0;
val emptyMap: HashMap[Int, Int] = HashMap.empty[Int, Int]
for (i <- num)
if (emptyMap.contains(i)) {
emptyMap.put(i, (emptyMap.get(i)) + 1) }
else {
emptyMap.put(i, 1)
}
}
Let's make the challenge a little more interesting.
val numList = List(1,2,3,4,5,1,2,1)
Now the result should be something like (0, 5, 7),(1, 6), which makes it pretty clear that returning one or more tuples is not going to be feasible. Returning a List of List[Int] would make much more sense.
def catchDuplicates(nums: List[Int]): List[List[Int]] =
nums.zipWithIndex //List[(Int,Int)]
.groupMap(_._1)(_._2) //Map[Int,List[Int]]
.values //Iterable[List[Int]]
.filter(_.lengthIs > 1)
.toList //List[List[Int]]
You might also add a .view in order to minimize the number of traversals and intermediate collections created.
def catchDuplicates(nums: List[Int]): List[List[Int]] =
nums.view
.zipWithIndex
.groupMap(_._1)(_._2)
.collect{case (_,vs) if vs.sizeIs > 1 => vs.toList}
.toList
How can I achieve using map?
You can't.
Because you only want to return the indexes of the elements that appear twice; which is a very different kind of transformation than the one that map expects.
You can use foldLeft thought.
object catchDuplicates {
final case class Result[A](elem: A, firstIdx: Int, secondIdx: Int)
private final case class State[A](seenElements: Map[A, Int], duplicates: List[Result[A]]) {
def next(elem: A, idx: Int): State[A] =
seenElements.get(key = elem).fold(
ifEmpty = this.copy(seenElements = this.seenElements + (elem -> idx))
) { firstIdx =>
State(
seenElements = this.seenElements.removed(key = elem),
duplicates = Result(elem, firstIdx, secondIdx = idx) :: this.duplicates
)
}
}
private object State {
def initial[A]: State[A] =
State(
seenElements = Map.empty,
duplicates = List.empty
)
}
def apply[A](data: List[A]): List[Result[A]] =
data.iterator.zipWithIndex.foldLeft(State.initial[A]) {
case (acc, (elem, idx)) =>
acc.next(elem, idx)
}.duplicates // You may add a reverse here if order is important.
}
Which can be used like this:
val numList = List(1,2,3,4,5,1,2)
val result = catchDuplicates(numList)
// result: List[Result] = List(Result(2,1,6), Result(1,0,5))
You can see the code running here.
I think returning tuple is not a good option instead you should try Map like -
object FindIndexOfDupElement extends App {
val numList = List(1, 2, 3, 4, 5, 1, 2)
#tailrec
def findIndex(elems: List[Int], res: Map[Int, List[Int]] = Map.empty, index: Int = 0): Map[Int, List[Int]] = {
elems match {
case head :: rest =>
if (res.get(head).isEmpty) {
findIndex(rest, res ++ Map(head -> (index :: Nil)), index + 1)
} else {
val updatedMap: Map[Int, List[Int]] = res.map {
case (key, indexes) if key == head => (key, (indexes :+ index))
case (key, indexes) => (key, indexes)
}
findIndex(rest, updatedMap, index + 1)
}
case _ => res
}
}
println(findIndex(numList).filter(x => x._2.size > 1))
}
you can clearly see the number(key) and respective index in the map -
HashMap(1 -> List(0, 5), 2 -> List(1, 6))

concateneate lines having same title

I have the following issue
Given this list in input , I want to concateneate integers for each line having the same title,
val listIn= List("TitleB,Int,11,0",
"TitleB,Int,1,0",
"TitleB,Int,1,0",
"TitleB,Int,3,0",
"TitleA,STR,3,0",
"TitleC,STR,4,5")
I wrote the following function
def sumB(list: List[String]): List[String] = {
val itemPattern = raw"(.*)(\d+),(\d+)\s*".r
list.foldLeft(ListMap.empty[String, (Int,Int)].withDefaultValue((0,0))) {
case (line, stri) =>
val itemPattern(k,i,j) = stri
val (a, b) = line(k)
line.updated(k, (i.toInt + a, j.toInt + b))
}.toList.map { case (k, (i, j)) => s"$k$i,$j" }
}
Expected output would be:
List("TitleB,Int,16,0",
"TitleA,STR,3,0",
"TitleC,STR,4,5")
Since you are looking to preserve the order of the titles as they appear in the input data, I would suggest you to use LinkedHashMap with foldLeft as below
val finalResult = listIn.foldLeft(new mutable.LinkedHashMap[String, (String, String, Int, Int)]){ (x, y) => {
val splitted = y.split(",")
if(x.keySet.contains(Try(splitted(0)).getOrElse(""))){
val oldTuple = x(Try(splitted(0)).getOrElse(""))
x.update(Try(splitted(0)).getOrElse(""), (Try(splitted(0)).getOrElse(""), Try(splitted(1)).getOrElse(""), oldTuple._3+Try(splitted(2).toInt).getOrElse(0), oldTuple._4+Try(splitted(3).toInt).getOrElse(0)))
x
}
else {
x.put(Try(splitted(0)).getOrElse(""), (Try(splitted(0)).getOrElse(""), Try(splitted(1)).getOrElse(""), Try(splitted(2).toInt).getOrElse(0), Try(splitted(3).toInt).getOrElse(0)))
x
}
}}.mapValues(iter => iter._1+","+iter._2+","+iter._3+","+iter._4).values.toList
finalResult should be your desired output
List("TitleB,Int,16,0", "TitleA,STR,3,0", "TitleC,STR,4,5")

group by with foldleft scala

I have the following list in input:
val listInput1 =
List(
"itemA,CATs,2,4",
"itemA,CATS,3,1",
"itemB,CATQ,4,5",
"itemB,CATQ,4,6",
"itemC,CARC,5,10")
and I want to write a function in scala using groupBy and foldleft ( just one function) in order to sum up third and fourth colum for lines having the same title(first column here), the wanted output is :
val listOutput1 =
List(
"itemA,CATS,5,5",
"itemB,CATQ,8,11",
"itemC,CARC,5,10"
)
def sumIndex (listIn:List[String]):List[String]={
listIn.map(_.split(",")).groupBy(_(0)).map{
case (title, label) =>
"%s,%s,%d,%d".format(
title,
label.head.apply(1),
label.map(_(2).toInt).sum,
label.map(_(3).toInt).sum)}.toList
}
Kind regards
The logic in your code looks sound, here it is with a case class implemented as that handles edge cases more cleanly:
// represents a 'row' in the original list
case class Item(
name: String,
category: String,
amount: Int,
price: Int
)
// safely converts the row of strings into case class, throws exception otherwise
def stringsToItem(strings: Array[String]): Item = {
if (strings.length != 4) {
throw new Exception(s"Invalid row: ${strings.foreach(print)}; must contain only 4 entries!")
} else {
val n = strings.headOption.getOrElse("N/A")
val cat = strings.lift(1).getOrElse("N/A")
val amt = strings.lift(2).filter(_.matches("^[0-9]*$")).map(_.toInt).getOrElse(0)
val p = strings.lastOption.filter(_.matches("^[0-9]*$")).map(_.toInt).getOrElse(0)
Item(n, cat, amt, p)
}
}
// original code with case class and method above used
listInput1.map(_.split(","))
.map(stringsToItem)
.groupBy(_.name)
.map { case (name, items) =>
Item(
name,
category = items.head.category,
amount = items.map(_.amount).sum,
price = items.map(_.price).sum
)
}.toList
You can solve it with a single foldLeft, iterating the input list only once. Use a Map to aggregate the result.
listInput1.map(_.split(",")).foldLeft(Map.empty[String, Int]) {
(acc: Map[String, Int], curr: Array[String]) =>
val label: String = curr(0)
val oldValue: Int = acc.getOrElse(label, 0)
val newValue: Int = oldValue + curr(2).toInt + curr(3).toInt
acc.updated(label, newValue)
}
result: Map(itemA -> 10, itemB -> 19, itemC -> 15)
If you have a list as
val listInput1 =
List(
"itemA,CATs,2,4",
"itemA,CATS,3,1",
"itemB,CATQ,4,5",
"itemB,CATQ,4,6",
"itemC,CARC,5,10")
Then you can write a general function that can be used with foldLeft and reduceLeft as
def accumulateLeft(x: Map[String, Tuple3[String, Int, Int]], y: Map[String, Tuple3[String, Int, Int]]): Map[String, Tuple3[String, Int, Int]] ={
val key = y.keySet.toList(0)
if(x.keySet.contains(key)){
val oldTuple = x(key)
x.updated(key, (y(key)._1, oldTuple._2+y(key)._2, oldTuple._3+y(key)._3))
}
else{
x.updated(key, (y(key)._1, y(key)._2, y(key)._3))
}
}
and you can call them as
foldLeft
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.foldLeft(Map.empty[String, Tuple3[String, Int, Int]])(accumulateLeft)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res0: List[String] = List(itemA,CATS,5,5, itemB,CATQ,8,11, itemC,CARC,5,10)
reduceLeft
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.reduceLeft(accumulateLeft)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res1: List[String] = List(itemA,CATS,5,5, itemB,CATQ,8,11, itemC,CARC,5,10)
Similarly you can just interchange the variables in the general function so that it can be used with foldRight and reduceRight as
def accumulateRight(y: Map[String, Tuple3[String, Int, Int]], x: Map[String, Tuple3[String, Int, Int]]): Map[String, Tuple3[String, Int, Int]] ={
val key = y.keySet.toList(0)
if(x.keySet.contains(key)){
val oldTuple = x(key)
x.updated(key, (y(key)._1, oldTuple._2+y(key)._2, oldTuple._3+y(key)._3))
}
else{
x.updated(key, (y(key)._1, y(key)._2, y(key)._3))
}
}
and calling the function would give you
foldRight
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.foldRight(Map.empty[String, Tuple3[String, Int, Int]])(accumulateRight)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res2: List[String] = List(itemC,CARC,5,10, itemB,CATQ,8,11, itemA,CATs,5,5)
reduceRight
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.reduceRight(accumulateRight)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res3: List[String] = List(itemC,CARC,5,10, itemB,CATQ,8,11, itemA,CATs,5,5)
So you don't really need a groupBy and can use any of the foldLeft, foldRight, reduceLeft or reduceRight functions to get your desired output.

Error processing scala list

def trainBestSeller(events: RDD[BuyEvent], n: Int, itemStringIntMap: BiMap[String, Int]): Map[String, Array[(Int, Int)]] = {
val itemTemp = events
// map item from string to integer index
.flatMap {
case BuyEvent(user, item, category, count) if itemStringIntMap.contains(item) =>
Some((itemStringIntMap(item),category),count)
case _ => None
}
// cache to use for next times
.cache()
// top view with each category:
val bestSeller_Category: Map[String, Array[(Int, Int)]] = itemTemp.reduceByKey(_ + _)
.map(row => (row._1._2, (row._1._1, row._2)))
.groupByKey
.map { case (c, itemCounts) =>
(c, itemCounts.toArray.sortBy(_._2)(Ordering.Int.reverse).take(n))
}
.collectAsMap.toMap
// top view with all category => cateogory ALL
val bestSeller_All: Map[String, Array[(Int, Int)]] = itemTemp.reduceByKey(_ + _)
.map(row => ("ALL", (row._1._1, row._2)))
.groupByKey
.map {
case (c, itemCounts) =>
(c, itemCounts.toArray.sortBy(_._2)(Ordering.Int.reverse).take(n))
}
.collectAsMap.toMap
// merge 2 map bestSeller_All and bestSeller_Category
val bestSeller = bestSeller_Category ++ bestSeller_All
bestSeller
}
List processing
Your list processing seems okay. I did a small recheck
def main( args: Array[String] ) : Unit = {
case class JString(x: Int)
case class CompactBuffer(x: Int, y: Int)
val l = List( JString(2435), JString(3464))
val tuple: (List[JString], CompactBuffer) = ( List( JString(2435), JString(3464)), CompactBuffer(1,4) )
val result: List[(JString, CompactBuffer)] = tuple._1.map((_, tuple._2))
val result2: List[(JString, CompactBuffer)] = {
val l = tuple._1
val cb = tuple._2
l.map( x => (x,cb) )
}
println(result)
println(result2)
}
Result is (as expected)
List((JString(2435),CompactBuffer(1,4)), (JString(3464),CompactBuffer(1,4)))
Further analysis
Analysis is required, if that does not solve your problem:
Where are types JStream (from org.json4s.JsonAST ?) and CompactBuffer ( Spark I suppose ) from?
How exactly looks the code, that creates pair ? What exactly are you doing? Please provide code excerpts!

Reverse of flatMap in scala

I'm trying to take a iterator of Strings, and turn it into an iterator of collections of strings based on an arbitrary splitting function.
So say I have
val splitter: String => Boolean = s => s.isEmpty
then I want it to take
val data = List("abc", "def", "", "ghi", "jkl", "mno", "", "pqr").iterator
and have
def f[A] (input: Iterator[A], splitFcn: A => Boolean): Iterator[X[A]]
where X can be any collection-like class you want, so long as it can be converted into a Seq, such that
f(data, splitter).foreach(println(_.toList))
outputs
List("abc", "def")
List("ghi", "jkl", "mno")
List("pqr")
Is there a clean way to do this, that does not require collecting the results of the input iterator entirely into memory?
This should do what you want:
val splitter: String => Boolean = s => s.isEmpty
val data = List("abc", "def", "", "ghi", "jkl", "", "mno", "pqr")
def splitList[A](l: List[A], p: A => Boolean):List[List[A]] = {
l match {
case Nil => Nil
case _ =>
val (h, t) = l.span(a => !p(a))
h :: splitList(t.drop(1), p)
}
}
println(splitList(data, splitter))
//prints List(List(abc, def), List(ghi, jkl), List(mno, pqr))
Here it is:
scala> val data = List("abc", "def", "", "ghi", "jkl", "mno", "", "pqr").iterator
data: Iterator[String] = non-empty iterator
scala> val splitter: String => Boolean = s => s.isEmpty
splitter: String => Boolean = <function1>
scala> def f[A](in: Iterator[A], sf: A => Boolean): Iterator[Iterator[A]] =
in.hasNext match {
| case false => Iterator()
| case true => Iterator(in.takeWhile(x => !sf(x))) ++ f(in, sf)
| }
f: [A](in: Iterator[A], sf: A => Boolean)Iterator[Iterator[A]]
scala> f(data, splitter) foreach (x => println(x.toList))
List(abc, def)
List(ghi, jkl, mno)
List(pqr)
UPDATE #2: Travis Brown answered another question using Scalaz-streams, an interesting package that might be helpful to you here. I am just starting to look at the package, but was quickly able to use it to read data from a file containing this:
abc
def
ghi
jkl
mno
pqr
and produce another file that looked like this:
Vector(abc, def, )
Vector(ghi, jkl, mno, )
Vector(pqr)
The library only holds the Vector being accumulated in memory. Here's my code (which should be considered dangerous, as I barely know anything about Scalaz-streams):
import scalaz.stream._
io.linesR("/tmp/a")
.pipe( process1.chunkBy(_.nonEmpty) )
.map( _.toString + "\n" )
.pipe(text.utf8Encode)
.to( io.fileChunkW("/tmp/b") )
.run.run
Key to your task is the chunkBy(_.nonEmpty), which accumulates lines into a Vector until it hits an empty line. I have no idea at this point why you have to say run twice.
Old stuff below.
UPDATE #1: Ah! I just discovered the new constraint that it not all be read into memory. This solution isn't for you, then; you'd want Iterators or Streams.
I'm guessing that you'd want to enrich Traversable. And with the function in a separate argument list, the compiler can infer the types. For performance you would probably only want to make one pass over the data. And to avoid crashing with large datasets (and for performance), you wouldn't want any recursion that is not tail-recursion. Given this enricher:
implicit class EnrichedTraversable[A]( val xs:Traversable[A] ) extends AnyVal {
def splitWhere( f: A => Boolean ) = {
#tailrec
def loop( xs:Traversable[A], group:Seq[A], groups:Seq[Seq[A]] ):Seq[Seq[A]] =
if ( xs.isEmpty ) {
groups :+ group
} else {
val x = xs.head
val rest = xs.tail
if ( f(x) ) loop( rest, Vector(), groups :+ group )
else loop( rest, group :+ x, groups )
}
loop( xs, Vector(), Vector() )
}
}
you can do this:
List("a","b","","c","d") splitWhere (_.isEmpty)
Here are some tests you might want to check out, to be sure the semantics are what you want (I personally like splits to behave this way):
val xs = List("a","b","","d","e","","f","g") //> xs : List[String] = List(a, b, "", d, e, "", f, g)
xs splitWhere (_.isEmpty) //> res0: Seq[Seq[String]] = Vector(Vector(a, b), Vector(d, e), Vector(f, g))
List("a","b","") splitWhere (_.isEmpty) //> res1: Seq[Seq[String]] = Vector(Vector(a, b), Vector())
List("") splitWhere (_.isEmpty) //> res2: Seq[Seq[String]] = Vector(Vector(), Vector())
List[String]() splitWhere (_.isEmpty) //> res3: Seq[Seq[String]] = Vector(Vector())
Vector("a","b","","c") splitWhere (_.isEmpty) //> res4: Seq[Seq[String]] = Vector(Vector(a, b), Vector(c))
I think Stream is what you want since they are evaluated lazily (not everything in memory).
def split[A](inputStream: Stream[A], splitter: A => Boolean): Stream[List[A]] = {
var accumulationList: List[A] = Nil
def loop(inputStream: Stream[A]): Stream[List[A]] = {
if (inputStream.isEmpty) {
if (accumulationList.isEmpty)
Stream.empty[List[A]]
else
accumulationList.reverse #:: Stream.empty[List[A]]
} else if (splitter(inputStream.head)) {
val outputList = accumulationList.reverse
accumulationList = Nil
if (outputList.isEmpty)
loop(inputStream.tail)
else
outputList #:: loop(inputStream.tail)
} else {
accumulationList = inputStream.head :: accumulationList
loop(inputStream.tail)
}
}
loop(inputStream)
}
val splitter = { s: String => s.isEmpty }
val list = List("asdf", "aa", "", "fw", "", "wfwf", "", "")
val stream = split(list.toStream, splitter)
stream foreach println
The output is:
List(asdf, aa)
List(fw)
List(wfwf)
EDIT:
I have not looked at it in detail, but I guess my recursive method loop could be replaced by a foldLeft or foldRight.