What is the difference between partition and groupBy?

What is the difference between partition and groupBy? - scala

I am reading through Twitter's Scala School right now and was looking at the groupBy and partition methods for collections. And I am not exactly sure what the difference between the two methods is.
I did some testing on my own:
scala> List(1, 2, 3, 4, 5, 6).partition(_ % 2 == 0)
res8: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))
scala> List(1, 2, 3, 4, 5, 6).groupBy(_ % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3, 5), true -> List(2, 4, 6))
So does this mean that partition returns a list of two lists and groupBy returns a Map with boolean keys and list values? Both have the same "effect" of splitting a list into two different parts based on a condition. I am not sure why I would use one over the other. So, when would I use partition over groupBy and vice-versa?

groupBy is better suited for lists of more complex objects.
Say, you have a class:
case class Beer(name: String, cityOfBrewery: String)
and a List of beers:
val beers = List(Beer("Bitburger", "Bitburg"), Beer("Frueh", "Cologne") ...)
you can then group beers by cityOfBrewery:
val beersByCity = beers.groupBy(_.cityOfBrewery)
Now you can get yourself a list of all beers brewed in any city you have in your data:
val beersByCity("Cologne") = List(Beer("Frueh", "Cologne"), ...)
Neat, isn't it?

And I am not exactly sure what the difference between the two methods
is.
The difference is in their signature. partition expects a function A => Boolean while groupBy expects a function A => K.
It appears that in your case the function you apply with groupBy is A => Boolean too, but you don't want always to do this, sometimes you want to group by a function that don't always returns a boolean based on its input.
For example if you want to group a List of strings by their length, you need to do it with groupBy.
So, when would I use partition over groupBy and vice-versa?
Use groupBy if the image of the function you apply is not in the boolean set (i.e f(x) for an input x yield another result than a boolean). If it's not the case then you can use both, it's up to you whether you prefer a Map or a (List, List) as output.

Partition is when you need to split some collection into two basing on yes/no logic (even/odd numbers, uppercase/lowecase letters, you name it). GroupBy has more general usage: producing many groups, basing on some function. Let's say you want to split corpus of words into bins depending on their first letter (resulting into 26 groups), it simply not possible with .partition.

Related

how to find all possible combinations between tuples without duplicates scala

suppose I have list of tuples:
val a = ListBuffer((1, 5), (6, 7))
Update: Elements in a are assumed to be distinct inside each of the tuples2, in other words, it can be for example (1,4) (1,5) but not (1,1) (2,2).
I want to generate results of all combinations of ListBuffer a between these two tuples but without duplication. The result will look like:
ListBuffer[(1,5,6), (1,5,7), (6,7,1), (6,7,5)]
Update: elements in result tuple3 are also distinct. tuples them selves are also distinct, means as long as (6,7,1) is present, then (1,7,6) should not be in the result tuple3.
If, for example val a = ListBuffer((1, 4), (1, 5)) then the result output should be ListBuffer[(1,4,5)] in which (1,4,1) and (1,5,1) are discarded
How can I do that in Scala?
Note: I just gave an example. Usually the val a has tens of scala.Tuple2

If the individual elements are unique, as you've commented, then you should be able to flatten everything (un-tuple), get the desired combinations(), and re-tuple.
updated
val a = collection.mutable.ListBuffer((1, 4), (1, 5))
a.flatMap(t => Seq(t._1, t._2)) //un-tuple
.distinct //no duplicates
.combinations(3) //unique sets of 3
.map{case Seq(x,y,z) => (x,y,z)} //re-tuple
.toList //if you don't want an iterator

Scala Shuffle A List Randomly And repeat it

I want to shuffle a scala list randomly.
I know i can do this by using scala.util.Random.shuffle
But here by calling this i will always get a new set of list. What i really want is that in some cases i want the shuffle to give me the same output always. So how can i achieve that?
Basically what i want to do is to shuffle a list randomly at first and then repeat it in the same order. For the first time i want to generate the list randomly and then based on some parameter repeat the same shuffling.

Use setSeed() to seed the generator before shuffling. Then if you want to repeat a shuffle reuse the seed.
For example:
scala> util.Random.setSeed(41L) // some seed chosen for no particular reason
scala> util.Random.shuffle(Seq(1,2,3,4))
res0: Seq[Int] = List(2, 4, 1, 3)
That shuffled: 1st -> 3rd, 2nd -> 1st, 3rd -> 4th, 4th -> 2nd
Now we can repeat that same shuffle pattern.
scala> util.Random.setSeed(41L) // same seed
scala> util.Random.shuffle(Seq(2,4,1,3)) // result of previous shuffle
res1: Seq[Int] = List(4, 3, 2, 1)

Let a be the seed parameter
Let b be the how many time you want to shuffle
There are two ways to kinda of do this
you can use scala.util.Random.setSeed(a) where 'a' can be any integer so after you finish your shuffling b times you can set the 'a' seed again and then your shuffling will be in the same order as your parameter 'a'
The other way is to shuffle List(1,2,3,...a) == 1 to a b times save that as a nested list or vector and then you can map it to your iterable
val arr = List(Bob, Knight, John)
val randomer = (0 to b).map(x => scala.util.Random.shuffle((0 to arr.size))
randomer.map(x => x.map(y => arr(y)))
You can use the same randomer for you other list you want to shuffle by mapping it

Functional Programming way to calculate something like a rolling sum

Let's say I have a list of numerics:
val list = List(4,12,3,6,9)
For every element in the list, I need to find the rolling sum, i,e. the final output should be:
List(4, 16, 19, 25, 34)
Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both?
Something like map(initial)((curr,prev) => curr+prev)
I want to achieve this without maintaining any shared global state.
EDIT: I would like to be able to do the same kinds of computation on RDDs.

You may use scanLeft
list.scanLeft(0)(_ + _).tail

The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
val num = implicitly[Numeric[N]]
val nPartitions = rdd.partitions.length
val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) =>
if (index == nPartitions - 1) Iterator.empty
else Iterator.single(iter.foldLeft(num.zero)(num.plus))
).collect
.scanLeft(num.zero)(num.plus)
rdd.mapPartitionsWithIndex((index, iter) =>
if (iter.isEmpty) iter
else {
val start = num.plus(partitionCumSums(index), iter.next)
iter.scanLeft(start)(num.plus)
}
)
}
It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.

I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):
list.zipWithIndex.map{x => list.take(x._2+1).sum}
This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).
When printing it, I get the following:
List(4, 16, 19, 25, 34)

How do I perform set theory minus operation between two lists in Scala?

I have the following case class
case class Cart(userId: Int, ProductId :Int, SellerId:Int, Qty: Int)
I have the following lists :
val mergedCart :List[Cart]= List(Cart(900,1,1,2),Cart(900,2,2,2),Cart(901,3,3,2),Cart(901,2,2,2),Cart(901,1,1,2),Cart(900,4,2,1))
val userCart:List[Cart] = List(Cart(900,1,1,2),Cart(900,2,2,2),Cart(900,4,2,1))
val guestCart:List[Cart] = List(Cart(901,3,3,2),Cart(901,2,2,2),Cart(901,1,1,2))
val commonCart = List(Cart(900,2,2,4), Cart(900,1,1,4))
My requirement is that I have to get the following list as the output:
List(Cart(900,2,2,4),Cart(900,1,1,4),Cart(901,3,3,2),Cart(900,4,2,1))
The final list should have the common objects from userCart and guestCart based on the ProductId,SellerId combination and the quantity of both the objects get added. Then, the other objects present in userCart and guestCart which do not match the common objects should also be present in the final list in the output.
I am new to Scala and I am not able to solve this, kindly help me with this code.

If you don't care about ordering in resulting list (so basically your result is a Set) , it's as simple as that:
def sum(a: Cart, b: Cart) = {
//require(a.userId == b.userId)
a.copy(Qty = a.Qty + b.Qty)
}
(userCart ++ guestCart)
.groupBy(x => x.ProductId -> x.SellerId)
.mapValues(_.reduce(sum _))
.values
.toList //toSet is more appropriate here
Results:
List(Cart(900,4,2,1), Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2))
(!) Be aware that I just took first userId in case of collision (see sum function). However, it preserves priority of users over guests if that's what implied.
Being represented as a Set, this result equals to your requirement:
scala> val mRes = List(Cart(900,4,2,1), Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2))
mRes: List[Cart] = List(Cart(900,4,2,1), Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2))
scala> val req = List(Cart(900,2,2,4),Cart(900,1,1,4),Cart(901,3,3,2),Cart(900,4,2,1))
req: List[Cart] = List(Cart(900,2,2,4), Cart(900,1,1,4), Cart(901,3,3,2), Cart(900,4,2,1))
scala> mRes.toSet == req.toSet
res17: Boolean = true
Explanations:
++ concatenates two lists
groupBy groups values by some predicate (like x.ProductId -> x.SellerId which equivalent to a tuple (x.ProductId, x.SellerId) in your case). It preserves order inside group, but groups themselves aren't ordered - that's why order in resulting list is undefined. The operator returns Map[Key, List[Value]], in your case Map[(Int, Int), List[Cart]]
mapValues iterates over lists with carts
reduce inside mapValues reduces List with carts by summing carts using sum function
I didn't have to reattach objects with unique (x.ProductId, x.SellerId) as they were represented just as lists with one element, so reduce function didn't touch them - it just returned first (and only) element.
a.copy(Qty = ...) makes copy of a with modified Qty field. In our case I take left element as a template, so elements that preced in the (userCart ++ guestCart) would have higher priority when userId is chosen.
Answering the headline's question about subtracting two sets:
scala> Set(1,2,3,4) - 4
res16: scala.collection.immutable.Set[Int] = Set(1, 2, 3)
scala> Set(1,2,3,4) -- Set(3,4)
res15: scala.collection.immutable.Set[Int] = Set(1, 2)
If elements of sets are instances of case classes (given that hashCode/equals methods weren't overridden) - it would compare all fields in order to check equality between two elements.
There is a theoretical connection of groupBy solution with a set theory. First, you can easily notice that my solution is representable with SQL's GROUP BY + AGGREGATE (groupBy with reduce-catamorphism in Scala). SQL is mostly based on relational-algebra, which in its turn partially based on set-theory, so here it is.
P.S. field/value/variable name in scala should always start with lowercase letter by convention. First capital letter means a constant.

How to sort a list in scala

I am a newbie in scala and I need to sort a very large list with 40000 integers.
The operation is performed many times. So performance is very important.
What is the best method for sorting?

You can sort the list with List.sortWith() by providing a relevant function literal. For example, the following code prints all elements of sorted list which contains all elements of the initial list in alphabetical order of the first character lowercased:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s: String, t: String)
=> s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
Much shorter version will be the following with Scala's type inference:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s, t) => s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
For integers there is List.sorted, just use this:
val list = List(4, 3, 2, 1)
val sortedList = list.sorted
println(sortedList)

just check the docs
List has several methods for sorting. myList.sorted works for types with already defined order (like Int or String and others). myList.sortWith and myList.sortBy receive a function that helps defining the order
Also, first link on google for scala List sort: http://alvinalexander.com/scala/how-sort-scala-sequences-seq-list-array-buffer-vector-ordering-ordered

you can use List(1 to 400000).sorted

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

What is the difference between partition and groupBy? - scala

Related

how to find all possible combinations between tuples without duplicates scala

Scala Shuffle A List Randomly And repeat it

Functional Programming way to calculate something like a rolling sum

How do I perform set theory minus operation between two lists in Scala?

How to sort a list in scala

Categories

Resources