I have the following list in input:
val listInput1 =
List(
"itemA,CATs,2,4",
"itemA,CATS,3,1",
"itemB,CATQ,4,5",
"itemB,CATQ,4,6",
"itemC,CARC,5,10")
and I want to write a function in scala using groupBy and foldleft ( just one function) in order to sum up third and fourth colum for lines having the same title(first column here), the wanted output is :
val listOutput1 =
List(
"itemA,CATS,5,5",
"itemB,CATQ,8,11",
"itemC,CARC,5,10"
)
def sumIndex (listIn:List[String]):List[String]={
listIn.map(_.split(",")).groupBy(_(0)).map{
case (title, label) =>
"%s,%s,%d,%d".format(
title,
label.head.apply(1),
label.map(_(2).toInt).sum,
label.map(_(3).toInt).sum)}.toList
}
Kind regards
The logic in your code looks sound, here it is with a case class implemented as that handles edge cases more cleanly:
// represents a 'row' in the original list
case class Item(
name: String,
category: String,
amount: Int,
price: Int
)
// safely converts the row of strings into case class, throws exception otherwise
def stringsToItem(strings: Array[String]): Item = {
if (strings.length != 4) {
throw new Exception(s"Invalid row: ${strings.foreach(print)}; must contain only 4 entries!")
} else {
val n = strings.headOption.getOrElse("N/A")
val cat = strings.lift(1).getOrElse("N/A")
val amt = strings.lift(2).filter(_.matches("^[0-9]*$")).map(_.toInt).getOrElse(0)
val p = strings.lastOption.filter(_.matches("^[0-9]*$")).map(_.toInt).getOrElse(0)
Item(n, cat, amt, p)
}
}
// original code with case class and method above used
listInput1.map(_.split(","))
.map(stringsToItem)
.groupBy(_.name)
.map { case (name, items) =>
Item(
name,
category = items.head.category,
amount = items.map(_.amount).sum,
price = items.map(_.price).sum
)
}.toList
You can solve it with a single foldLeft, iterating the input list only once. Use a Map to aggregate the result.
listInput1.map(_.split(",")).foldLeft(Map.empty[String, Int]) {
(acc: Map[String, Int], curr: Array[String]) =>
val label: String = curr(0)
val oldValue: Int = acc.getOrElse(label, 0)
val newValue: Int = oldValue + curr(2).toInt + curr(3).toInt
acc.updated(label, newValue)
}
result: Map(itemA -> 10, itemB -> 19, itemC -> 15)
If you have a list as
val listInput1 =
List(
"itemA,CATs,2,4",
"itemA,CATS,3,1",
"itemB,CATQ,4,5",
"itemB,CATQ,4,6",
"itemC,CARC,5,10")
Then you can write a general function that can be used with foldLeft and reduceLeft as
def accumulateLeft(x: Map[String, Tuple3[String, Int, Int]], y: Map[String, Tuple3[String, Int, Int]]): Map[String, Tuple3[String, Int, Int]] ={
val key = y.keySet.toList(0)
if(x.keySet.contains(key)){
val oldTuple = x(key)
x.updated(key, (y(key)._1, oldTuple._2+y(key)._2, oldTuple._3+y(key)._3))
}
else{
x.updated(key, (y(key)._1, y(key)._2, y(key)._3))
}
}
and you can call them as
foldLeft
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.foldLeft(Map.empty[String, Tuple3[String, Int, Int]])(accumulateLeft)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res0: List[String] = List(itemA,CATS,5,5, itemB,CATQ,8,11, itemC,CARC,5,10)
reduceLeft
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.reduceLeft(accumulateLeft)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res1: List[String] = List(itemA,CATS,5,5, itemB,CATQ,8,11, itemC,CARC,5,10)
Similarly you can just interchange the variables in the general function so that it can be used with foldRight and reduceRight as
def accumulateRight(y: Map[String, Tuple3[String, Int, Int]], x: Map[String, Tuple3[String, Int, Int]]): Map[String, Tuple3[String, Int, Int]] ={
val key = y.keySet.toList(0)
if(x.keySet.contains(key)){
val oldTuple = x(key)
x.updated(key, (y(key)._1, oldTuple._2+y(key)._2, oldTuple._3+y(key)._3))
}
else{
x.updated(key, (y(key)._1, y(key)._2, y(key)._3))
}
}
and calling the function would give you
foldRight
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.foldRight(Map.empty[String, Tuple3[String, Int, Int]])(accumulateRight)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res2: List[String] = List(itemC,CARC,5,10, itemB,CATQ,8,11, itemA,CATs,5,5)
reduceRight
listInput1
.map(_.split(","))
.map(array => Map(array(0) -> (array(1), array(2).toInt, array(3).toInt)))
.reduceRight(accumulateRight)
.map(x => x._1+","+x._2._1+","+x._2._2+","+x._2._3)
.toList
//res3: List[String] = List(itemC,CARC,5,10, itemB,CATQ,8,11, itemA,CATs,5,5)
So you don't really need a groupBy and can use any of the foldLeft, foldRight, reduceLeft or reduceRight functions to get your desired output.
Related
I am currently working on a function that takes in a Map[String, List[String]] and a String as arguments. The map contains a user Id and the IDs of films that they liked. What I need to do is, to return a List[List[String]] which contains the other movies that where liked by the user who liked the movie that was passed into the function.
The function declaration looks as follows:
def movies(m: Map[String, List[String]], mov: String) : List[List[String]]= {
}
So lets imagine the following:
val m1 : [Map[Int, List[String]]] = Map(1 ‐> List("b", "a"), 2 ‐> List("y", "x"), 3 ‐> List("c", "a"))
val movieID = "a"
movies(m1, movieId)
This should return:
List(List("b"), List("c"))
I have tried using
m1.filter(x => x._2.contains(movieID))
So that only Lists containing movieID are kept in the map, but my problem is that I need to remove movieID from every list it occurs in, and then return the result as a List[List[String]].
You could use collect:
val m = Map("1" -> List("b", "a"), "2" -> List("y", "x"), "3" -> List("c", "a"))
def movies(m: Map[String, List[String]], mov: String) = m.collect {
case (_, l) if l.contains(mov) => l.filterNot(_ == mov)
}
movies(m, "a") //List(List(b), List(c))
Problem with this approach is, that it would iterate over every movie list twice, the first time with contains and the second time with filterNot. We could optimize it tail-recursive function, which would look for element and if found just return list without it:
import scala.annotation.tailrec
def movies(m: Map[String, List[String]], mov: String) = {
#tailrec
def withoutElement[T](l: List[T], mov: T, acc: List[T] = Nil): Option[List[T]] = {
l match {
case x :: xs if x == mov => Some(acc.reverse ++ xs)
case x :: xs => withoutElement(xs, mov, x :: acc)
case Nil => None
}
}
m.values.flatMap(withoutElement(_, mov))
}
The solution from Krzysztof is a good one. Here's an alternate way to traverse every List just once.
def movies(m: Map[String, List[String]], mov: String) =
m.values.toList.flatMap{ss =>
val tpl = ss.foldLeft((false, List.empty[String])){
case ((_,res), `mov`) => (true, res)
case ((keep,res), str) => (keep, str::res)
}
if (tpl._1) Some(tpl._2) else None
}
This should work for you:
object DemoAbc extends App {
val m1 = Map(1 -> List("b", "a"), 2 -> List("y", "x"), 3 -> List("c", "a"))
val movieID = "a"
def movies(m: Map[Int, List[String]], mov: String): List[List[String]] = {
val ans = m.foldLeft(List.empty[List[String]])((a: List[List[String]], b: (Int, List[String])) => {
if (b._2.contains(mov))
b._2.filter(_ != mov) :: a
else a
})
ans
}
print(movies(m1, movieID))
}
I am facing a problem to calculate the sum of elements in Scala having the same title (my key in this case).
Currently my input can be described as:
val listInput1 =
List(
"itemA,CATA,2,4 ",
"itemA,CATA,3,1 ",
"itemB,CATB,4,5",
"itemB,CATB,4,6"
)
val listInput2 =
List(
"itemA,CATA,2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
The required output for lists in input should be
val listoutput1 =
List(
"itemA,CATA,5,5 ",
"itemB,CATB,8,11"
)
val listoutput2 =
List(
"itemA , CATA, 2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
I wrote the following function:
def sumByTitle(listInput: List[String]): List[String] =
listInput.map(_.split(",")).groupBy(_(0)).map {
case (title, features) =>
"%s,%s,%d,%d".format(
title,
features.head.apply(1),
features.map(_(2).toInt).sum,
features.map(_(3).toInt).sum)}.toList
It doesn't give me the expected result as it changes the order of lines.
How can I fix that?
The ListMap is designed to preserve the order of items inserted to the Map.
import collection.immutable.ListMap
def sumByTitle(listInput: List[String]): List[String] = {
val itemPttrn = raw"(.*)(\d+),(\d+)\s*".r
listInput.foldLeft(ListMap.empty[String, (Int,Int)].withDefaultValue((0,0))) {
case (lm, str) =>
val itemPttrn(k, a, b) = str //unsafe
val (x, y) = lm(k)
lm.updated(k, (a.toInt + x, b.toInt + y))
}.toList.map { case (k, (a, b)) => s"$k$a,$b" }
}
This is a bit unsafe in that it will throw if the input string doesn't match the regex pattern.
sumByTitle(listInput1)
//res0: List[String] = List(itemA,CATA,5,5, itemB,CATB,8,11)
sumByTitle(listInput2)
//res1: List[String] = List(itemA,CATA,2,4, itemB,CATB,4,5, itemC,CATC,1,2)
You'll note that the trailing space, if there is one, is not preserved.
If you are just interested in sorting you can simply return the sorted list:
val listInput1 =
List(
"itemA , CATA, 2,4 ",
"itemA , CATA, 3,1 ",
"itemB,CATB,4,5",
"itemB,CATB,4,6"
)
val listInput2 =
List(
"itemA , CATA, 2,4 ",
"itemB,CATB,4,5",
"itemC,CATC,1,2"
)
def sumByTitle(listInput: List[String]): List[String] =
listInput.map(_.split(",")).groupBy(_(0)).map {
case (title, features) =>
"%s,%s,%d,%d".format(
title,
features.head.apply(1),
features.map(_(2).trim.toInt).sum,
features.map(_(3).trim.toInt).sum)}.toList.sorted
println("LIST 1")
sumByTitle(listInput1).foreach(println)
println("LIST 2")
sumByTitle(listInput2).foreach(println)
You can find the code on Scastie for you to play around with.
As a side note, you may be interested in separating the serialization and deserialization from your business logic.
Here you can find another Scastie notebook with a relatively naive approach for a first step towards separating concerns.
def foldByTitle(listInput: List[String]): List[Item] =
listInput.map(Item.parseItem).foldLeft(List.empty[Item])(sumByTitle)
val sumByTitle: (List[Item], Item) => List[Item] = (acc, curr) =>
acc.find(_.name == curr.name).fold(curr :: acc) { i =>
acc.filterNot(_.name == curr.name) :+ i.copy(num1 = i.num1 + curr.num1, num2 = i.num2 + curr.num2)
}
case class Item(name: String, category: String, num1: Int, num2: Int)
object Item {
def parseItem(serializedItem: String): Item = {
val itemTokens = serializedItem.split(",").map(_.trim)
Item(itemTokens.head, itemTokens(1), itemTokens(2).toInt, itemTokens(3).toInt)
}
}
This way the initial order of the elements to kept.
I need to sort the keys in an RDD, but there is no natural sorting order (not ascending or descending). I wouldn't even know how to write a Comparator to do it. Say I had a map of apples, pears, oranges, and grapes, I want to sort by oranges, apples, grapes, and pears.
Any ideas on how to do this in Spark/Scala? Thanks!
In Scala, you need to look for the Ordering[T] trait rather than the Comparator interface -- mostly a cosmetic difference so that the focus is on the attribute of the data rather than a thing which compares two instances of the data. Implementing the trait requires that the compare(T,T) method be defined. A very explicit version of the enumerated comparison could be:
object fruitOrdering extends Ordering[String] {
def compare(lhs: String, rhs: String): Int = (lhs, rhs) match {
case ("orange", "orange") => 0
case ("orange", _) => -1
case ("apple", "orange") => 1
case ("apple", "apple") => 0
case ("apple", _) => -1
case ("grape", "orange") => 1
case ("grape", "apple") => 1
case ("grape", "grape") => 0
case ("grape", _) => -1
case ("pear", "orange") => 1
case ("pear", "apple") => 1
case ("pear", "grape") => 1
case ("pear", "pear") => 0
case ("pear", _) => -1
case _ => 0
}
}
Or, to slightly adapt zero323's answer:
object fruitOrdering2 extends Ordering[String] {
private val values = Seq("orange", "apple", "grape", "pear")
// generate the map based off of indices so we don't have to worry about human error during updates
private val ordinalMap = values.zipWithIndex.toMap.withDefaultValue(Int.MaxValue)
def compare(lhs: String, rhs: String): Int = ordinalMap(lhs).compare(ordinalMap(rhs))
}
Now that you have an instance of Ordering[String], you need to inform the sortBy method use this ordering rather than the built-in one. If you look at the signature for RDD#sortBy you'll see the full signature is
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
That implicit Ordering[K] in the second parameter list is normally looked up by the compiler for pre-defined orderings -- that's how it knows what the natural ordering should be. Any implicit parameter, however, can be given an explicit value instead. Note that if you supply one implicit value then you need to supply all, so in this case we also need to provide the ClassTag[K]. That's always generated by the compiler but can be easily explicitly generated using scala.reflect.classTag.
Specifying all of that, the invocation would look like:
import scala.reflect.classTag
rdd.sortBy { case (key, _) => key }(fruitOrdering, classOf[String])
That's still pretty messy, though, isn't it? Luckily we can use implicit classes to take away a lot of the cruft. Here's a snippet that I use fairly commonly:
package com.example.spark
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
package object implicits {
implicit class RichSortingRDD[A : ClassTag](underlying: RDD[A]) {
def sorted(implicit ord: Ordering[A]): RDD[A] =
underlying.sortBy(identity)(ord, implicitly[ClassTag[A]])
def sortWith(fn: (A, A) => Int): RDD[A] = {
val ord = new Ordering[A] { def compare(lhs: A, rhs: A): Int = fn(lhs, rhs) }
sorted(ord)
}
}
implicit class RichSortingPairRDD[K : ClassTag, V](underlying: RDD[(K, V)]) {
def sortByKey(implicit ord: Ordering[K]): RDD[(K, V)] =
underlying.sortBy { case (key, _) => key } (ord, implicitly[ClassTag[K]])
def sortByKeyWith(fn: (K, K) => Int): RDD[(K, V)] = {
val ord = new Ordering[K] { def compare(lhs: K, rhs: K): Int = fn(lhs, rhs) }
sortByKey(ord)
}
}
}
And in action:
import com.example.spark.implicits._
val rdd = sc.parallelize(Seq(("grape", 0.3), ("apple", 5.0), ("orange", 5.6)))
rdd.sortByKey(fruitOrdering).collect
// Array[(String, Double)] = Array((orange,5.6), (apple,5.0), (grape,0.3))
rdd.sortByKey.collect // Natural ordering by default
// Array[(String, Double)] = Array((apple,5.0), (grape,0.3), (orange,5.6))
rdd.sortWith(_._2 compare _._2).collect // sort by the value instead
// Array[(String, Double)] = Array((grape,0.3), (apple,5.0), (orange,5.6))
If the only way you can describe the order is enumeration then simply enumerate:
val order = Map("orange" -> 0L, "apple" -> 1L, "grape" -> 2L, "pear" -> 3L)
val rdd = sc.parallelize(Seq(("grape", 0.3), ("apple", 5.0), ("orange", 5.6)))
val sorted = rdd.sortBy{case (key, _) => order.getOrElse(key, Long.MaxValue)}
sorted.collect
// Array[(String, Double)] = Array((orange,5.6), (apple,5.0), (grape,0.3))
There is a sortBy method in Spark which allows you to define an arbitrary ordering and whether you want ascending or descending. E.g.
scala> val rdd = sc.parallelize(Seq ( ("a", 1), ("z", 7), ("p", 3), ("a", 13) ))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[331] at parallelize at <console>:70
scala> rdd.sortBy( _._2, ascending = false) .collect.mkString("\n")
res34: String =
(a,13)
(z,7)
(p,3)
(a,1)
scala> rdd.sortBy( _._1, ascending = false) .collect.mkString("\n")
res35: String =
(z,7)
(p,3)
(a,1)
(a,13)
scala> rdd.sortBy
def sortBy[K](f: T => K, ascending: Boolean, numPartitions: Int)(implicit ord: scala.math.Ordering[K], ctag: scala.reflect.ClassTag[K]): RDD[T]
The last part tells you what the signature of sortBy is. The ordering used in previous examples is by the first and second part of the pair.
Edit: answered too quickly, without checking your question, sorry... Anyway, you would define your ordering like in your example:
def myord(fruit:String) = fruit match {
case "oranges" => 1 ;
case "apples" => 2;
case "grapes" =>3;
case "pears" => 4;
case _ => 5}
val rdd = sc.parallelize(Seq("apples", "oranges" , "pears", "grapes" , "other") )
Then, the result of ordering would be:
scala> rdd.sortBy[Int](myord, ascending = true).collect.mkString("\n")
res1: String =
oranges
apples
grapes
pears
other
I don't know about spark, but with pure Scala collections that would be
_.sortBy(_.fruitType)
For example,
val l: List[String] = List("the", "big", "bang")
val sortedByFirstLetter = l.sortBy(_.head)
// List(big, bang, the)
def trainBestSeller(events: RDD[BuyEvent], n: Int, itemStringIntMap: BiMap[String, Int]): Map[String, Array[(Int, Int)]] = {
val itemTemp = events
// map item from string to integer index
.flatMap {
case BuyEvent(user, item, category, count) if itemStringIntMap.contains(item) =>
Some((itemStringIntMap(item),category),count)
case _ => None
}
// cache to use for next times
.cache()
// top view with each category:
val bestSeller_Category: Map[String, Array[(Int, Int)]] = itemTemp.reduceByKey(_ + _)
.map(row => (row._1._2, (row._1._1, row._2)))
.groupByKey
.map { case (c, itemCounts) =>
(c, itemCounts.toArray.sortBy(_._2)(Ordering.Int.reverse).take(n))
}
.collectAsMap.toMap
// top view with all category => cateogory ALL
val bestSeller_All: Map[String, Array[(Int, Int)]] = itemTemp.reduceByKey(_ + _)
.map(row => ("ALL", (row._1._1, row._2)))
.groupByKey
.map {
case (c, itemCounts) =>
(c, itemCounts.toArray.sortBy(_._2)(Ordering.Int.reverse).take(n))
}
.collectAsMap.toMap
// merge 2 map bestSeller_All and bestSeller_Category
val bestSeller = bestSeller_Category ++ bestSeller_All
bestSeller
}
List processing
Your list processing seems okay. I did a small recheck
def main( args: Array[String] ) : Unit = {
case class JString(x: Int)
case class CompactBuffer(x: Int, y: Int)
val l = List( JString(2435), JString(3464))
val tuple: (List[JString], CompactBuffer) = ( List( JString(2435), JString(3464)), CompactBuffer(1,4) )
val result: List[(JString, CompactBuffer)] = tuple._1.map((_, tuple._2))
val result2: List[(JString, CompactBuffer)] = {
val l = tuple._1
val cb = tuple._2
l.map( x => (x,cb) )
}
println(result)
println(result2)
}
Result is (as expected)
List((JString(2435),CompactBuffer(1,4)), (JString(3464),CompactBuffer(1,4)))
Further analysis
Analysis is required, if that does not solve your problem:
Where are types JStream (from org.json4s.JsonAST ?) and CompactBuffer ( Spark I suppose ) from?
How exactly looks the code, that creates pair ? What exactly are you doing? Please provide code excerpts!
I have a Summary case class:
case class Summary(
title: String,
desc: String,
image: Option[String],
path: String,
timestamp: Int
)
I attempt to retrieve a Summary for each ID I have:
val rawSummaryList = nids.map {
_ match {
case Some(id) => (id, Node.nodeSumByNid(id))
case None => (-1, None)
}
}
This returns a value of type List[(Int, Option[drupalslick.data.Summary])] and each Summary may, or may not, contain an image.
How do I:
Drop (Int, Option[Summary]) pairs where either Summary or Summary.image is None?
Convert (Int, Option[Summary]) to (Int, Summary)?
val rawSummaryList: List[(Int, Option[Summary])] = ???
val filtered: List[(Int, Summary)] = rawSummaryList collect {
case (id, Some(summary)) if summary.image.isDefined => (id, summary)
}
You can use collect function, which is a combination of map and filter, e.g:
val list: List[(Int, Option[drupalslick.data.Summary])] = ???
list.collect { case (int, Some(smr # Summary(_, _, Some(_), _, _)) => (int, smr) }
You can use for-comprehensions:
val rawSummaryList: List[(Int, Summary)] = for {
Some(id) <- nids
summary <- Node.nodeSumByNid(id)
image <- summary.image
} yield (id,summary)