Given is data by joining two tables.
joinDataRdd.take(5).foreach(println)
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT)))
(65722,((164249,365,2,119.98,59.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164250,730,5,400.0,80.0),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164251,1004,1,399.98,399.98),(2014-05-23 00:00:00.0,4077,COMPLETE)))
(65722,((164252,627,5,199.95,39.99),(2014-05-23 00:00:00.0,4077,COMPLETE)))
When I am trying to get following
val data = joinDataRdd.map(x=>(x._1,x._2._1.split(",")(3)))
It's is throwing an error :
value split is not a member of (String, String, String, String, String)
val data = joinDataRdd.map(x=>(x._1,x._2._1._1.split(",")(3)))
You are trying to split the tuple so that is why the error message. At the given position x._2._1 ,
(41234,((102921,249,2,109.94,54.97),(2014-04-04 00:00:00.0,3182,PENDING_PAYMENT))), the highlighted data is the result. So if you are looking to dig inside the tuple, then you need to advance one position.
It looks like the values are already in a tuple, so you don't need to split the string. Is
val data = joinDataRdd.map(x=>(x._1,x._2._1._4))
what you are looking for?
Related
In the following piece of code, entities is a Map[String, Seq[String]] object that I receive from some other piece of code. The goal is to map the entities object into a two column Spark DataFrame; but, before I get there, I found some very unusual results.
val data: Map[String, Seq[String]] = Map("idtag" -> Seq("things", "associated", "with", "id"))
println(data)
println(data.toSeq)
data.toSeq.foreach{println}
data.toSeq.map{case(id: String, names: Seq[String]) => names}.foreach{println}
val eSeq: Seq[(String, Seq[String])] = entities.toSeq
println(eSeq.head)
println(eSeq.head.getClass)
println(eSeq.head._1.getClass)
println(eSeq.head._2.getClass)
eSeq.map{case(id: String, names: Seq[String]) => names}.foreach{println}
The output of the above on the console is:
Map(idtag -> List(things, associated, with, id))
ArrayBuffer((idtag,List(things, associated, with, id)))
(idtag,List(things, associated, with, id))
List(things, associated, with, id)
(0CY4NZ-E,["MEC", "Marriott-MEC", "Media IQ - Kimberly Clark c/o Mindshare", "Mindshare", "WPP", "WPP Plc", "Wavemaker Global", "Wavemaker Global Ltd"])
class scala.Tuple2
class java.lang.String
class java.lang.String
Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to scala.collection.Seq
at package.EntityList$$anonfun$toStorage$4.apply(EntityList.scala:31)
The data object that I hardcoded acts as expected. The .toSeq function on the entities map produces a Seq (implemented as an ArrayBuffer) of tuples; and these tuples can be processed through mapping.
But using the entities object, you can see that when I take the first element using .head and it is a Tuple2[String, String]. How can that possibly happen? How does the second element of the tuple turn into a String and cause the exception?
Further confusing me, if the last line is changed to reflect the Tuple2[String, String]:
eSeq.map{case(id: String, names: String) => names}.foreach{println}
then we get a compile error:
/path/to/repo/src/main/scala/package/EntityList.scala:31: error: pattern type is incompatible with expected type;
found : String
required: Seq[String]
eSeq.map{case(id: String, names: String) => names}.foreach{println}
I can't replicate this odd behavior with a Map[String, Seq[String]] that I create myself, as you can see in this code. Can anyone explain this behavior and why it happens?
The problem appears to be that entities.toSeq is lying about the type of the data that it is returning, so I would look at "some other piece of code" and check it is doing the right thing.
Specifically, it claims to return Seq[(String, Seq[String])] and the compiler believes it. But getClass shows that the second object in the tuple is actually java.lang.String not Seq[String].
If this were correct then the match statement would be using unapply to extract the value and then getting an error when it tried to convert names to the stated type.
I note that the string appears to be a list of strings enclosed in [ ], so it seems possible that whatever is creating entities is failing to parse this into a Seq but claiming that it has succeeded.
My underlying function is defined like this:
def rowToSHA1(s: Seq[Any]): String = {
//return sha1 of sequence
}
}
Here is the definition of my udf:
val toSha = udf[String, Seq[Any]](rowToSHA1)
df.withColumn("shavalue",(toSha(array($"id",$"name",$"description",$"accepted")))
It works when i pass only a list of string as parameter but i get an error when there is a boolean.
org.apache.spark.sql.AnalysisException: cannot resolve 'array(`id`, `name`,
`description`, `accepted`)' due to data type mismatch: input to function
array should all be the same type, but it's [string, string, string,
boolean];;
I'm exploring the use of a generic function, is it a good idea?
FIX: converted my column to string before applying the function
df.withColumn("shavalue",(toSha(array($"id",$"name",$"description",$"accepted".cast("string)))
The best solution I know for this kind of situation is just convert everything to String, When you read/create the DataFrame make sure everything is String or convert it at some point. Later you can convert if back to any other type.
I have a list that I would like to group into group and then for each group get the max value. For example, a list of actions of user, get the last action per user.
case class UserActions(userId: String, actionId: String, actionTime: java.sql.Timestamp) extends Ordered[UserActions] {
def compare(that: UserActions) = this.actionTime.before(that.actionTime).compareTo(true)
}
val actions = List(UserActions("1","1",new java.sql.Timestamp(0L)),UserActions("1","1",new java.sql.Timestamp(1L)))
When I try the following groupBy:
actions.groupBy(_.userId)
I receive a Map
scala.collection.immutable.Map[String,List[UserActions]] = Map(1 -> List(UserActions(1,1,1970-01-01 00:00:00.0), UserActions(1,1,1970-01-01 00:00:00.001))
Which is fine, but when I try to add the maxBy I get an error:
actions.groupBy(_.userId).maxBy(_._2)
<console>:13: error: diverging implicit expansion for type
Ordering[List[UserActions]]
starting with method $conforms in object Predef
actions.groupBy(_.userId).maxBy(_._2)
What should I change?
Thanks
Nir
So you have a Map of String (userId) -> List[UserActions] and you want each list reduced to just its max element?
actions.groupBy(_.userId).mapValues(_.max)
//res0: Map[String,UserActions] = Map(1 -> UserActions(1,1,1969-12-31 16:00:00.0))
You don't need maxBy() because you've already added the information needed to order/compare different UserActions elements.
Likewise if you just want the max from the original list.
actions.max
You'd use maxBy() if you wanted the maximum as measured by some other parameter.
actions.maxBy(_.actionId.length)
Your compare method should be:
def compare(that: UserActions) = this.actionTime.compareTo(that.actionTime)
Then do actions.groupBy(_.userId).mapValues(_.max) as #jwvh shows.
You want to group values by using userId. You can do it by using groupBy().
actions.groupBy(_.userId)
Then you can take only values from key value pairs.
actions.groupBy(_.userId).values
You have used maxBy(_._2). So, I think You want to get max by comparing second value. You can do it by using map().
actions.groupBy(_.userId).values.map(_.maxBy(_._2))
I would like a function to consume tuple of 7 but compiler won't let me with the shown message. I failed to find a proper way how to do it. Is it even possible without explicitely typing all the type parameters like Tuple7[String,String...,String] and is it even a good idea to use Scala like this ?
def store(record:Tuple7): Unit = {
}
Error:(25, 20) class Tuple7 takes type parameters
def store(record: Tuple7): Unit = {
^
As stated by Luis you have to define what Type goes on which position for every position in the Tuple.
I`d like to add some approaches to express the same behaviour in different ways:
Tuple Syntax
For that you have two choices, what syntax to use to do so:
Tuple3[String, Int, Double]
(String, Int, Double)
Approach using Case Classes for better readability
Long tuples are hard to handle, especially when types are repeated. Scala offers a different approach for handling this. Instead of a Tuple7 you can use a case class with seven fields. The gain in this approach would be that you now can attach speaking names to each field and also the typing of each position makes more sense if a name is attached to it.
And the chance of putting values in wrong positions is reduced
(String, Int, String, Int)
// vs
case class(name: String, age: Int, taxNumber: String, numberOfChildren: Int)
using Seq with pattern matching
If your intention was to have a sequence of data seq in combination with pattern matching could also be a nice fit:
List("name", 24, "", 5 ) match {
case name:String :: age:Int ::_ :: _ :: Nil => doSomething(name, age)
}
This only works nice in a quite reduced scope. Normally you would lose a lot of type information as the List is of type Any.
You could do the following :
def store(record: (String, String, String, String, String, String, String)):Unit = {
}
which is the equivalent of :
def store(record: Tuple7[String, String, String, String, String, String, String]):Unit = {
}
You can read more about it in Programming in Scala, 2nd Edition, chapter "Next Steps in Scala", sub-chapter "Step 9. use Tuples".
I have a class called Group
class Group(id: Int, name: String, category: String) {
}
I am trying to convert Array[Group] to Map[String, Seq[Group]] with category: String as key. I want to create an empty Seq[Group] and add Group if the key does not exist otherwise update the Seq[Group]. I am not sure how to update the Seq if the key already exists.
groupBy will do it all.
arrayOfGroups.groupBy(_.category)
Just the result will be a Map[String, Array[Group]] (because the original container was an array). Array is not exactly a Seq, so if you want one, you may do
arraysOfGroup.groupBy(_.category).mapValues(_.toSeq)
You may replace the toSeq by any more precise transformation.
It would also be possible to do arrayOfGroup.toSeq.groupBy(_.category)