How to convert Seq[Object] to Map[User, Set[String] in Scala - scala

It's really hard to explain in the title but here's what I want to do.
I'm pretty new to Scala. I have an object User which is just a user that two users can be equal given the same user id
case class UserCustomFeature(
hobby: String,
users: Set[User]
) {}
My input is Seq[UserCustomFeature] So basically a list of objects of a hobby -> users. For example,
[('tv' -> Set('user1', 'user2')),
('swimming' -> Set('user2', 'user3'))]
And I want the result to be
('user1' -> Set('tv')),
('user2' -> Set('tv', 'swimming')),
('user3' -> Set('swimming'))
I have something like this so far but I'm not sure how to group them later
userHobbyMap
.map({
case (hobby, users) => {
users.map(user => {
(user, hobby)
})
}
})

case class User(id: String)
case class UserCustomFeature(
hobby: String,
users: Set[User]
) {}
val input = Seq(
UserCustomFeature("tv", Set(User("1"), User("2"))),
UserCustomFeature("swimming", Set(User("2"), User("3")))
)
val output = (for (UserCustomFeature(h, us) <- input; u <- us) yield (u, h))
.groupBy(_._1)
.mapValues(_.map(_._2).toSet)
output foreach println
Generates output:
(User(1),Set(tv))
(User(3),Set(swimming))
(User(2),Set(tv, swimming))
Brief explanation:
for-comprehension transposes the sequence of UserCustomFeatures into a list of (user, hobby) pairs.
groupBy groups hobbies by user (first component)
The map(_._2) drops the redundant user id from grouped pairs
toSet converts the resulting list of hobbies to a set of hobbies

Related

How can I group by the individual elements of a list of elements in Scala

Forgive me if I'm not naming things by their actual name, I've just started to learn Scala. I've been looking around for a while, but can not find a clear answer to my question.
Suppose I have a list of objects, each object has two fields: x: Int and l: List[String], where the Strings, in my case, represent categories.
The l lists can be of arbitrary length, so an object can belong to multiple categories. Furthermore, various objects can belong to the same category. My goal is to group the objects by the individual categories, where the categories are the keys. This means that if an object is linked to say "N" categories, it will occur in "N" of the key-value pairs.
So far I managed to groupBy the lists of categories through:
objectList.groupBy(x => x.l)
However, this obviously groups the objects by list of categories rather than by categories.
I'm trying to do this with immutable collections avoiding loops etc.
If anyone has some ideas that would be much appreciated!
Thanks
EDIT:
By request the actual case class and what I am trying.
case class Car(make: String, model: String, fuelCapacity: Option[Int], category:Option[List[String]])
Once again, a car can belong to multiple categories. Let's say List("SUV", "offroad", "family").
I want to group by category elements rather than by the whole list of categories, and have the fuelCapacity as the values, in order to be able to extract average fuelCapacity per category amongst other metrics.
Using your EDIT as a guide.
case class Car( make: String
, model: String
, fuelCapacity: Option[Int]
, category:Option[List[String]] )
val cars: List[Car] = ???
//all currently known category strings
val cats: Set[String] = cars.flatMap(_.category).flatten.toSet
//category -> list of cars in this category
val catMap: Map[String,List[Car]] =
cats.map(cat => (cat, cars.filter(_.category.contains(cat)))).toMap
//category -> average fuel capacity for cars in this category
val fcAvg: Map[String,Double] =
catMap.map{case (cat, cars) =>
val fcaps: List[Int] = cars.flatMap(_.fuelCapacity)
if (fcaps.lengthIs < 1) (cat, -1d)
else (cat, fcaps.sum.toDouble / fcaps.length)
}
Something like the following?
objectList // Seq[YourType]
.flatMap(o => o.l.map(c => c -> o)) // Seq[(String, YourType)]
.groupBy { case (c,_) => c } // Map[String,Seq[(String,YourType)]]
.mapValues { items => c -> items.map { case (_, o) => o } } // Map[String, Seq[YourType]]
(Deliberately "heavy" to help you understand the idea behind it)
EDIT, or as of Scala 2.13 thanks to groupMap:
objectList // Seq[YourType]
.flatMap(o => o.l.map(c => c -> o)) // Seq[(String, YourType)]
.groupMap { case (c,_) => c } { case (_, o) => o } // Map[String,Seq[YourType]]
You are very close, you just need to split each individual element in the list before the group so try with something like this:
// I used a Set instead of a List,
// since I don't think the order of categories matters
// as well I would think having two times the same category doesn't make sense.
final case class MyObject(x: Int, categories: Set[String] = Set.empty) {
def addCategory(category: String): MyObject =
this.copy(categories = this.categories + category)
}
def groupByCategories(data: List[MyObject]): Map[String, List[Int]] =
data
.flatMap(o => o.categories.map(c => c -> o.x))
.groupMap(_._1)(_._2)

Grouping duplicates in a csv file

I have a CSV file which I've applied a case class onto and made into a list e.g
CSV file was like this -
"user_id","age","liked_ad","location"
2145,34,true,USA
6786,25,true,UK
9025,21,false,USA
1145,40,false,UK
It goes on. Ultimately I am trying to find the top user_id's who have the most liked_ad's (true values). I know that there are duplicates within the csv file as I did -
val origFile = processCSV("src/main/resources/advert-data.csv")
val origFileLength = origFile.length
val uniqueList = origFile.distinct
val uniqueListLength = uniqueList.length
The two lengths were different. I am thinking I need to group all the user_id's so that all the entries of the same user_id are in a group where I can then count how many 'trues' are in that user's entries. I am completely stuck on the right way to go about this.
This is my processCSV function at the moment -
final case class AdvertInfo(userId: Int, age: Int, likedAd: Boolean, location: String)
def processCSV(file: String): List[AdvertInfo] = {
val data = io.Source.fromFile(file)
data
.getLines()
.map(_.split(',').iterator.map(_.trim).toList)
.flatMap {
case userIdRaw :: ageRaw :: likedAdRaw :: locationRaw :: Nil =>
for {
userId <- userIdRaw.toIntOption
age <- ageRaw.toIntOption
likedAd <- likedAdRaw.toBooleanOption
location <- Some(locationRaw)
} yield AdvertInfo(userId, age, likedAd, location)
case _ =>
None
}.toList
}
Your description is a bit confusing but I think what you want is:
origFile.filter(_.likedAd)
.groupMapReduce(_.userId)(_ => 1)(_+_) //Scala 2.13.x
The result is a Map with the user_ids as the keys and the count of all the liked_ad=="true" as the values.
From there you can .toList.sortBy(-_._2) in order to get the ranking-by-liked-count.

Consolidate a list of Futures into a Map in Scala

I have two case classes P(id: String, ...) and Q(id: String, ...), and two functions returning futures:
One that retrieves a list of objects given a list of id-s:
def retrieve(ids: Seq[String]): Future[Seq[P]] = Future { ... }
The length of the result might be shorter than the input, if not all id-s were found.
One that further transforms P to some other type Q:
def transform(p: P): Future[Q] = Future { ... }
What I would like in the end is, the following. Given ids: Seq[String], calculate a Future[Map[String, Option[Q]]].
Every id from ids should be a key in the map, with id -> Some(q) when it was retrieved successfully (ie. present in the result of retrieve) and also transformed successfully. Otherwise, the map should contain id -> None or Empty.
How can I achieve this?
Is there an .id property on P or Q? You would need one to create the map. Something like this?
for {
ps <- retrieve(ids)
qs <- Future.sequence(ps.map(p => transform(p))
} yield ids.map(id => id -> qs.find(_.id == id)).toMap
Keep in mind that Map[String,Option[X]] is usually not necessary, since if you have Map[String,X] the .get method on the map will give you an Option[X].
Edit: Now assumes that P has a member id that equals the original id-String, otherwise the connection between ids and ps gets lost after retrieve.
def consolidatedMap(ids: Seq[String]): Future[Map[String, Option[Q]]] = {
for {
ps <- retrieve(ids)
qOpts <- Future.traverse(ps){
p => transform(p).map(Option(_)).recover {
// TODO: don't sweep `Throwable` under the
// rug in your real code
case t: Throwable => None
}
}
} yield {
val qMap = (ps.map(_.id) zip qOpts).toMap
ids.map{ id => (id, qMap.getOrElse(id, None)) }.toMap
}
}
Builds an intermediate Map from retrieved Ps and transformed Qs, so that building of ids-to-q-Options map works in linear time.

How to create Map using multiple lists in Spark

I am trying to figure out how to access particular elements from RDD myRDD with example entries below:
(600,List((600,111,7,1), (615,111,3,5))
(601,List((622,112,2,1), (615,111,3,5), (456,111,9,12))
I want to extract some data from Redis DB using 3-rd field from sub-lists as ID. For example, in case of (600,List((600,111,1,1), (615,111,1,5)), the IDs are 7 and 3.
In case of (601,List((622,112,2,1), (615,111,3,5), (456,111,9,12)), the ID's are 2, 3 and 9.
The problem is that I don't know how to collect values using multiple IDs. In the given code below, I use line._2(3), but it's not correct, because this way I access sublists instead of the fields inside these sublists.
Should I use flatMap or similar?
val newRDD = myRDD.mapPartitions(iter => {
val redisPool = new Pool(new JedisPool(new JedisPoolConfig(), "localhost", 6379, 2000))
iter.map({line => (line._1,
redisPool.withJedisClient { client =>
val start_date: String = Dress.up(client).hget("id:"+line._2(3),"start_date")
val end_date: String = Dress.up(client).hget("id:"+line._2(3),"end_date")
val additionalData = List((start_date,end_date))
Map(("base_data", line._2), ("additional_data", additionalData))
})
})
})
newRDD.collect().foreach(println)
If we assume that Redis DB contains some relevant data, then the result newRDD could be the following:
(600,Map("base_data" -> List((600,111,7,1), (615,111,3,5)), "additional_data" -> List((2014,2015),(2015,2016)))
(601,Map("base_data" -> List((622,112,2,1), (615,111,3,5), (456,111,9,12)), "additional_data" -> List((2010,2015),(2011,2016),(2014,2016)))
To get a list of third elements of each tuple in line._2, use line._2.map(_._3) (assuming the type of line is (Int, List[(Int, Int, Int, Int)]), like it looks from your example, and types like Any aren't involved). Overall, it seems like your code should look like
iter.map({ case (first, second) => (first,
redisPool.withJedisClient { client =>
val additionalData = second.map { tuple =>
val start_date: String = Dress.up(client).hget("id:"+tuple._3,"start_date")
val end_date: String = Dress.up(client).hget("id:"+tuple._3,"end_date")
(start_date, end_date)
}
Map(("base_data", second), ("additional_data", additionalData))
})
})

Scala: batch process Map of index-based form fields

Part of a web app I'm working on handles forms that need to be bound to a collection of model (case class) instances. See this question
So, if I were to add several users at one time, form fields would be named email[0], email[1], password[0], password[1], etc.
Posting the form results in a Map[String, Seq[String]]
Now, what I would like to do is to process the Map in batches, by index, so that for each iteration I can bind a User instance, creating a List[User] as the final result of the bindings.
The hacked approach I'm thinking of is to regex match against "[\d]" in the Map keys and then find the highest index via filter or count; with that, then (0..n).toList map{ ?? } through the number of form field rows, calling the binding/validation method (which also takes a Map[String, Seq[String]]) accordingly.
What is a concise way to achieve this?
Assuming that:
All map keys are in form "field[index]"
There is only one value in Seq for each key.
If there is entry for "email[x]" than there is entry for "password[x]" and vice versa.
I would done something like this:
val request = Map(
"email[0]" -> Seq("alice#example.com"),
"email[1]" -> Seq("bob#example.com"),
"password[0]" -> Seq("%vT*n7#4"),
"password[1]" -> Seq("Bfts7B&^")
)
case class User(email: String, password: String)
val Field = """(.+)\[(\d+)\]""".r
val userList = request.groupBy { case (Field(_, idx), _) => idx.toInt }
.mapValues { userMap =>
def extractField(name: String) =
userMap.collect{case (Field(`name`, _), values) => values.head}.head
User(extractField("email"), extractField("password"))}
.toList.sortBy(_._1).map(_._2)
// Exiting paste mode, now interpreting.
request: scala.collection.immutable.Map[String,Seq[String]] = Map(email[0] -> List(alice#example.com),
email[1] -> List(bob#example.com), password[0] -> List(%vT*n7#4), password[1] -> List(Bfts7B&^))
defined class User
Field: scala.util.matching.Regex = (.+)\[(\d+)\]
userList: List[User] = List(User(alice#example.com,%vT*n7#4), User(bob#example.com,Bfts7B&^))