I have some troubles with groupByKey in scala and Spark.
I have 2 case classes :
case class Employee(id_employee: Long, name_emp: String, salary: String)
For the moment I use this 2nd case class:
case class Company(id_company: Long, employee:Seq[Employee])
However, I want to replace it with this new one:
case class Company(id_company: Long, name_comp: String employee:Seq[Employee])
There is a parent DataSet (df1) that I use with groupByKey to create Company objects :
val companies = df1.groupByKey(v => v.id_company)
.mapGroups(
{
case(k,iter) => Company(k, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)
}
).collect()
This code works, it returns objects like this one :
Company(1234,List(Employee(0987, John, 30000),Employee(4567, Bob, 50000)))
But I don't find the tip to add the Company name_comp to those objects (this field exist df1). In order to retrieve objects like this (using the new case class):
Company(1234, NYTimes, List(Employee(0987, John, 30000),Employee(4567, Bob, 50000)))
Since you want both the company id and name, what you can do is to use a tuple as the key when you group your data. This will make both values easily available when constructing the Company class:
df1.groupByKey(v => (v.id_company, v.name_comp))
.mapGroups{ case((id, name), iter) =>
Company(id, name, iter.map(x => Employee(x.id_employee, x.name_emp, x.salary)).toSeq)}
.collect()
Related
Given this case class:
case class Categories(fruit: String, amount: Double, mappedTo: String)
I have a list containing the following:
List(
Categories("Others",22.38394964594807,"Others"),
Categories("Others",77.6160503540519,"Others")
)
I want to combine two elements in the list by summing up their amount if they are in the same category, so that the end result in this case would be:
List(Categories("Others",99.99999999999997,"Others"))
How can I do that?
Since groupMapReduce was introduced in Scala 2.13, I'll try to provide another approch to Martinjn's great answer.
Assuming we have:
case class Categories(Fruit: String, amount: Double, mappedTo: String)
val categories = List(
Categories("Apple",22.38394964594807,"Others"),
Categories("Apple",77.6160503540519,"Others")
)
If you want to aggregate by both mappedTo and Fruit
val result = categories.groupBy(c => (c.Fruit, c.mappedTo)).map {
case ((fruit, mappedTo), categories) => Categories(fruit, categories.map(_.amount).sum, mappedTo)
}
Code run can be found at Scastie.
If you want to aggregate only by mappedTo, and choose a random Fruit, you can do:
val result = categories.groupBy(c => c.mappedTo).map {
case (mappedTo, categories) => Categories(categories.head.Fruit, categories.map(_.amount).sum, mappedTo)
}
Code run can be found at Scastie
You want to group your list entries by category, and reduce them to a single value. There is groupMapReduce for that, which groups entries, and then maps the group (you don't need this) and reduces the group to a single value.
given
case class Category(category: String, amount: Double)
if you have a val myList: List[Category], then you want to group on Category#category, and reduce them by merging the members, summing up the amount.
that gives
myList.groupMapReduce(_.category) //group
(identity) //map. We don't need to map, so we use the identity mapping
{
case (Category(name, amount1), Category(_, amount2)) =>
Category(name, amount1 + amount2) }
} //reduce, combine each elements by taking the name, and summing the amojunts
In theory just a groupReduce would have been enough, but that doesn't exist, so we're stuck with the identity here.
If I have a case class like below:
case class Student(name: String, activities: Seq[String], grade: String)
And I have a List like this:
val students = List(
Student("John", List("soccer", "Video Games"), "9th"),
Student("Jane", List("sword fighting", "debate"), "10th"),
Student("Boy Wonder", List("1", "5", "2"), "5th")
)
How can I sort the contents based on name and activities attributes to form a string? In the scenario above the string would be:
boywonder_1_2_5_5th_jane_debate_swordfighting_10th_john_soccer_videogames_9th
The sorting in this case is done like this:
First the elements are sorted with name -- Thats why in the final string boywonder comes first
Then that elements' activities are sorted as well -- Thats why Boy Wonder's activities are sorted as 1_2_5
You need to:
Make everything lowercase.
Sort the inner list activities.
Sort the outer list students, by name.
Turn everything into a String.
Here is the code.
students
.map { student =>
student.copy(
name = student.name.toLowerCase,
activities = student.activities.sorted.map(activity => activity.toLowerCase)
)
}.sortBy(student => student.name)
.map(student => s"${student.name}${student.activities.mkString}${student.grade}")
.mkString
.replaceAll("\\s", "")
// res: String = "boywonder1255thjanedebateswordfighting10thjohnvideogamessoccer9th"
I have the following file in Hadoop
val dataset=sc.textFile("/user/hue/mycompanies1.csv")
It looks like this
CS,84,Jimmys Bistro, Jimmys
CS,90,Pauls Fish
CS,100, Happy Hardware
My scala/Spark code looks like:
case class Company (
record_type: String,
company_num: Integer,
company_name: String;,
nickname: String
)
val company = dataset.map(k=>k.split(",")).map(
k => Company(k(0).trim, k(1).toInt, k(2).trim, k(3).trim)
company.toDF().registerTempTable("company_table4")
When i try to access company RDD after i get a nullpointerexception because of the missing nickname value in the data. How do i deal with this gracefully?
Since the nickname is optional, I would change the case class to reflect that, then use one of various ways to optionally obtain the index-3 element, eg:
case class Company (
record_type: String,
company_num: Integer,
company_name: String,
nickname: Option[String]
)
val company = dataset.map(k=>k.split(",")).map(
k => Company(k(0).trim, k(1).toInt, k(2).trim, k.drop(3).headOption.map(_.trim))
Some nested case classes and the field addresses is a Seq[Address]:
// ... means other fields
case class Street(name: String, ...)
case class Address(street: Street, ...)
case class Company(addresses: Seq[Address], ...)
case class Employee(company: Company, ...)
I have an employee:
val employee = Employee(Company(Seq(
Address(Street("aaa street")),
Address(Street("bbb street")),
Address(Street("bpp street")))))
It has 3 addresses.
And I want to capitalize the streets start with "b" only. My code is mess like following:
val modified = employee.copy(company = employee.company.copy(addresses =
employee.company.addresses.map { address =>
address.copy(street = address.street.copy(name = {
if (address.street.name.startsWith("b")) {
address.street.name.capitalize
} else {
address.street.name
}
}))
}))
The modified employee is then:
Employee(Company(List(
Address(Street(aaa street)),
Address(Street(Bbb street)),
Address(Street(Bpp street)))))
I'm looking for a way to improve it, and can't find one. Even tried Monocle, but can't apply it to this problem.
Is there any way to make it better?
PS: there are two key requirements:
use only immutable data
don't lose other existing fields
As Peter Neyens points out, Shapeless's SYB works really nicely here, but it will modify all Street values in the tree, which may not always be what you want. If you need more control over the path, Monocle can help:
import monocle.Traversal
import monocle.function.all._, monocle.macros._, monocle.std.list._
val employeeStreetNameLens: Traversal[Employee, String] =
GenLens[Employee](_.company).composeTraversal(
GenLens[Company](_.addresses)
.composeTraversal(each)
.composeLens(GenLens[Address](_.street))
.composeLens(GenLens[Street](_.name))
)
val capitalizer = employeeStreeNameLens.modify {
case s if s.startsWith("b") => s.capitalize
case s => s
}
As Julien Truffaut points out in an edit, you can make this even more concise (but less general) by creating a lens all the way to the first character of the street name:
import monocle.std.string._
val employeeStreetNameFirstLens: Traversal[Employee, Char] =
GenLens[Employee](_.company.addresses)
.composeTraversal(each)
.composeLens(GenLens[Address](_.street.name))
.composeOptional(headOption)
val capitalizer = employeeStreetNameFirstLens.modify {
case 'b' => 'B'
case s => s
}
There are symbolic operators that would make the definitions above a little more concise, but I prefer the non-symbolic versions.
And then (with the result reformatted for clarity):
scala> capitalizer(employee)
res3: Employee = Employee(
Company(
List(
Address(Street(aaa street)),
Address(Street(Bbb street)),
Address(Street(Bpp street))
)
)
)
Note that as in the Shapeless answer, you'll need to change your Employee definition to use List instead of Seq, or if you don't want to change your model, you could build that transformation into the Lens with an Iso[Seq[A], List[A]].
If you are open to replacing the addresses in Company from Seq to List, you can use "Scrap Your Boilerplate" from shapeless (example).
import shapeless._, poly._
case class Street(name: String)
case class Address(street: Street)
case class Company(addresses: List[Address])
case class Employee(company: Company)
val employee = Employee(Company(List(
Address(Street("aaa street")),
Address(Street("bbb street")),
Address(Street("bpp street")))))
You can create a polymorphic function which capitalizes the name of a Street if the name starts with a "b".
object capitalizeStreet extends ->(
(s: Street) => {
val name = if (s.name.startsWith("b")) s.name.capitalize else s.name
Street(name)
}
)
Which you can use as :
val afterCapitalize = everywhere(capitalizeStreet)(employee)
// Employee(Company(List(
// Address(Street(aaa street)),
// Address(Street(Bbb street)),
// Address(Street(Bpp street)))))
Take a look at quicklens
You could do it like this
import com.softwaremill.quicklens._
case class Street(name: String)
case class Address(street: Street)
case class Company(address: Seq[Address])
case class Employee(company: Company)
object Foo {
def foo(e: Employee) = {
modify(e)(_.company.address.each.street.name).using {
case name if name.startsWith("b") => name.capitalize
case name => name
}
}
}
Given a list of Order objects...
case class Order(val id: String, val orderType: Option[String])
case class Transaction (val id: String, ...)
val orders = List(Order(1, Some("sell")), Order(2, None), ...)
... I need to create a sequence of Futures for all those orders that have a type (i.e. orderType is defined):
val transactions: Seq[Future[Transaction]] = orders.filter(
_.orderType.isDefined).map { case order =>
trxService.findTransactions(order.id) // this returns a Future[Transaction]
}
)
The code above first invokes filter, which creates a new List containing only orders with orderType set to Some, and then creates a sequence of Futures out of it. Is there a more efficient way to accomplish this?
You can aggregate filter and map using collect
val transactions: Seq[Future[Transaction]] = orders.collect {
case order if order.orderType.isDefined => trxService.findTransactions(order.id)
}