I have a class called Group
class Group(id: Int, name: String, category: String) {
}
I am trying to convert Array[Group] to Map[String, Seq[Group]] with category: String as key. I want to create an empty Seq[Group] and add Group if the key does not exist otherwise update the Seq[Group]. I am not sure how to update the Seq if the key already exists.
groupBy will do it all.
arrayOfGroups.groupBy(_.category)
Just the result will be a Map[String, Array[Group]] (because the original container was an array). Array is not exactly a Seq, so if you want one, you may do
arraysOfGroup.groupBy(_.category).mapValues(_.toSeq)
You may replace the toSeq by any more precise transformation.
It would also be possible to do arrayOfGroup.toSeq.groupBy(_.category)
Related
I have a list that I would like to group into group and then for each group get the max value. For example, a list of actions of user, get the last action per user.
case class UserActions(userId: String, actionId: String, actionTime: java.sql.Timestamp) extends Ordered[UserActions] {
def compare(that: UserActions) = this.actionTime.before(that.actionTime).compareTo(true)
}
val actions = List(UserActions("1","1",new java.sql.Timestamp(0L)),UserActions("1","1",new java.sql.Timestamp(1L)))
When I try the following groupBy:
actions.groupBy(_.userId)
I receive a Map
scala.collection.immutable.Map[String,List[UserActions]] = Map(1 -> List(UserActions(1,1,1970-01-01 00:00:00.0), UserActions(1,1,1970-01-01 00:00:00.001))
Which is fine, but when I try to add the maxBy I get an error:
actions.groupBy(_.userId).maxBy(_._2)
<console>:13: error: diverging implicit expansion for type
Ordering[List[UserActions]]
starting with method $conforms in object Predef
actions.groupBy(_.userId).maxBy(_._2)
What should I change?
Thanks
Nir
So you have a Map of String (userId) -> List[UserActions] and you want each list reduced to just its max element?
actions.groupBy(_.userId).mapValues(_.max)
//res0: Map[String,UserActions] = Map(1 -> UserActions(1,1,1969-12-31 16:00:00.0))
You don't need maxBy() because you've already added the information needed to order/compare different UserActions elements.
Likewise if you just want the max from the original list.
actions.max
You'd use maxBy() if you wanted the maximum as measured by some other parameter.
actions.maxBy(_.actionId.length)
Your compare method should be:
def compare(that: UserActions) = this.actionTime.compareTo(that.actionTime)
Then do actions.groupBy(_.userId).mapValues(_.max) as #jwvh shows.
You want to group values by using userId. You can do it by using groupBy().
actions.groupBy(_.userId)
Then you can take only values from key value pairs.
actions.groupBy(_.userId).values
You have used maxBy(_._2). So, I think You want to get max by comparing second value. You can do it by using map().
actions.groupBy(_.userId).values.map(_.maxBy(_._2))
Given is data by joining two tables.
joinDataRdd.take(5).foreach(println)
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT)))
(65722,((164249,365,2,119.98,59.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164250,730,5,400.0,80.0),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164251,1004,1,399.98,399.98),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164252,627,5,199.95,39.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
When I am trying to get following
val data = joinDataRdd.map(x=>(x._1,x._2._1.split(",")(3)))
It's is throwing an error :
value split is not a member of (String, String, String, String, String)
val data = joinDataRdd.map(x=>(x._1,x._2._1._1.split(",")(3)))
You are trying to split the tuple so that is why the error message. At the given position x._2._1 ,
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT))), the highlighted data is the result. So if you are looking to dig inside the tuple, then you need to advance one position.
It looks like the values are already in a tuple, so you don't need to split the string. Is
val data = joinDataRdd.map(x=>(x._1,x._2._1._4))
what you are looking for?
I have a class that looks like this:
case class Person(id : String, name : String, refId : String) {}
And I have a list of Person.
I want to have a map with
key = refId
value = List[Person] that have the same refId (duplicate keys)
What I did:
val persons = getPersons() // get the List from somewhere
val refMap = new mutable.HashMap[String,Seq[Person]]()
for (person<- persons){
refMap.put(person.refId,refMap.getOrElse(person.refId,new ArrayBuffer[Person]) :+ person)
}
That was my first idea and it work, but I want something more Scala-like or something that looks better. Do you have an idea?
I also tried what is written here: Convert List of tuple to map (and deal with duplicate key ?)
But they use Tuple and I couldn't get this work either.
I also tried it to map my list to tuples first but
1. I don't want to iterate 2 times over the List when it's not necessary (1 time to create tuples, 1 time to create the map.
2. I tried but I failed with tuples too.
Any help for a better code would be nice.
Try groupBy:
getPersons().groupBy(_.refId): Map[String, List[Person]]
Generally what I am trying to achieve:
I think I would like to remove the case classes from the RDD, but keep the RDD, and am unsure how to do that.
Specificatlly what I am trying to do:
What I am trying to achieve is to turn each row of an RDD into json. But the json can only be a list a key:value pairs. When I turn it into json in it's current form I get
{"CCABINDeviceDataPartial":
{"Tran_Id":"1234weqr",
"TranData":{"Processor_Id":"qqq","Merchant_Id":"1234"},
"BillingAndShippingData":{"Billing_City":"MyCity","Billing_State":"State","Billing_Zip":"000000","Billing_Country":"MexiCanada","Shipping_City":"MyCity","Shipping_State":"State","Shipping_Zip":"000000","Shipping_Country":"USico"}
...
}
}
What I want is
{"Tran_Id":"1234weqr",
"Processor_Id":"qqq",
"Merchant_Id":"1234",
"Billing_City":"MyCity",
"Billing_State":"State",
"Billing_Zip":"000000",
"Billing_Country":"MexiCanada",
"Shipping_City":"MyCity",
"Shipping_State":"State",
"Shipping_Zip":"000000",
"Shipping_Country":"USico"
...
}
I have what I call a parent case class that looks like this:
case class CCABINDeviceDataPartial(Tran_Id: String, TranData: TranData,
BillingAndShippingData: BillingAndShippingData, AcquirerData: AcquirerData,
TimingData: TimingData, RBD_Tran_Id: String, DeviceData1: DeviceData1, ACS_Time: Long,
Payfone_Alias: String, TranStatusData: TranStatusData, Centurion_BIN_Class: String,
BankData: BankData, DeviceData2: DeviceData2, ACS_Host: String,
DeviceData3: DeviceData3, txn_status: String, Device_Type: String,
TranOutcome: TranOutcome, AcsData: AcsData, DateTimeData: DateTimeData)
Now TranData, BillingAndShippingData, AcquirerData, and some others are also case classes. I presume this was done to get around the 21 or 22 element limit on case classes. If you "unroll" everything there are 76 elements in total.
My only working idea is to break out the case classes into dataframes and then join them together one at a time. This seems a bit onerous and I am hoping that there is a way to just "flatten" the RDD. I have looked at the API documentation for RDDs but don't see anything that obvious.
Additional Notes
This is how I currently convert things to json.
First I convert the RDD to a dataframe with
def rddDistinctToTable(txnData: RDD[CCABINDeviceDataPartial], instanceSpark:SparkService,
tableName: String): DataFrame = {
import instanceSpark.sql.implicits._
val fullTxns = txnData.filter(x => x.Tran_Id != "0")
val uniqueTxns = rddToDataFrameHolder(fullTxns.distinct()).toDF()
uniqueTxns.registerTempTable(tableName)
return uniqueTxns
}
Then to convert to json and write to Elasticsearch with
sparkStringJsonRDDFunctions(uniqueTxns.toJSON)
.saveJsonToEs(instanceSpark.sc.getConf.get("es.resource"))
Quick and simple solution:
convert RDD to DataFrame
use select to flatten records (you can use dots to access nested objects like df.select("somecolumn.*", "another.nested.column"))
use write.json to write as JSON
I'm using Play2+Scala+ReactiveMongo to build a web application. Since mongodb doesn't require all documents to have the same structure, I make a large use of case classes with Options as parameters to implement models. For example:
case class PersonInfo(address: Option[String],
telephone: Option[String],
age: Option[Int],
job: Option[String])
case class Person(id: UUID, name: String, info: PersonInfo)
Now suppose I want to merge two PersonInfo objects, for example in a update function. I do right now is:
val updPInfo = old.copy(address = new.address orElse address,
telephone = new.telephone orElse telephone,
age = new.age orElse age,
job = new.job orElse job)
This way I have an object that has new values where they were specified by the new object and old values for the remaining ones.
This actually works fine, but it is bit ugly to see when the list of parameters grows.
Is there a more elegant way to do that?
If the only place you need this is in mongo, you can do it, like this:
collection.
update(
Json.obj("_id" -> id),
Json.obj("$set" -> Json.toJson(new))
)
That way you'll have correct presentation in DB, which you can read and use afterwards.
or
If you need it in Scala, you can merge 2 Json presentations:
val merged = Json.toJson(old).deepMerge(new).as[PersonInfo]
Which is not quite better, then what you are doing now.