I am translating spark-1.6 rdd to spark-2.x datasets
The original code was:
val sample_data : Dataset[(Int, Array[Double])]
val samples : Array[Array[Array[Double]]] = sample_data.rdd
.groupBy(x => x._1)
.map(x => {
val (id: Int, points: Iterable[(Int, Array[Double])]) = x
val data1 = points.map(x => x._2).toArray
data1
}).collect()
The sample_data.rdd no longer works so I am trying to do the same operations using datasets. The new approach uses flatMapGroups
val sample_data : Dataset[(Int, Array[Double])]
val samples : Array[Array[Array[Double]]] = sample_data
.groupByKey(x => x._1)
.flatMapGroups ( (id: Int, points: Iterable[(Int, Array[Double])]) =>
Iterator(points.map((x:Int, y:Array[Double]) => y)).toList
).collect()
The error given is:
Error:(36, 25) overloaded method value map with alternatives: [B,
That](f: ((Int, Array[Double])) => B)(implicit bf:
scala.collection.generic.CanBuildFrom[Iterable[(Int,
Array[Double])],B,That])That [B](f: ((Int, Array[Double])) =>
B)Iterator[B] cannot be applied to ((Int, Array[Double]) =>
Array[Double])
Iterator(points.map((x:Int, y:Array[Double])
=> y)).toList
Can you please provide an example of how to use flatMapGroups and how to understand the given error?
points is actually an Iterator, but you are casting it to an Iterable, so the compiler is telling you to make it an Iterator.
This is what you are trying to do:
val samples: Array[Array[Array[Double]]] = sample_data
.groupByKey(_._1)
.flatMapGroups((id: Int, points: Iterator[(Int, Array[Double])]) =>
Iterator(points.map(_._2).toArray)
).collect()
Rewrapping in an Iterator isn't serving you a purpose, so you can just use mapGroups like so:
.mapGroups((_, points) => points.map(_._2).toArray)
However in both cases, there is no encoder for an Array[Array[_]]. Look here for more detail.
So either implement the implicit Encoder yourself (existing Encoders), or stick to the RDD interface.
Related
I have a function in scala which I send arguments to, I use it like this:
val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, f(v))}
My function f is:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
def f(v: Array[String]): Int = {
val parsedDates = v.map(LocalDate.parse(_, formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear}
And this is the error I get:
found : Iterable[Array[String]]
required: Array[String]
I already tried using:
val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, for (date <- v) f(date))}
But I get massive errors.
Just to get a better picture, data in concat is:
1974,1974-06-22
1966,1966-07-20
1954,1954-06-19
1994,1994-06-27
1954,1954-06-26
2006,2006-07-04
2010,2010-07-07
1990,1990-06-30
...
It is type RDD[String].
How can I properly iterate over that and get a single Int from that function f?
The RDD types alongside your pipeline are:
concat.map(_.split(",")) gives an RDD[Array[String]]
for instance Array("1954", "1954-06-19")
concat.map(_.split(",")).keyBy(_(0)) gives RDD[(String, Array[String])]
for instance ("1954", Array("1954", "1954-06-19"))
concat.map(_.split(",")).keyBy(_(0)).groupByKey() gives RDD[(String, Iterable[Array[String]])]
for instance Iterable(("1954", Iterable(Array("1954", "1954-06-19"), Array("1954", "1954-06-24"))))
Thus when you map at the end, the type of values is Iterable[Array[String]].
Since your input is "1974,1974-06-22", the solution could consist in replacing your keyBy transformation by a map:
input.map(_.split(",")).map(x => x(0) -> x(1)).groupByKey().map{case (k, v) => (k, f(v))}
Indeed, .map(x => x(0) -> x(1)) (instead of .map(x => x(0) -> x) whose keyBy(_(0)) is syntactic sugar for) will provide for the value the second element of the split array instead of the array itself. Thus giving RDD[(String, String)] during this second step rather than RDD[(String, Array[String])].
After grouping my dataset , it look like this
(AD_PRES,1)
(AD_VP,2)
(FI_ACCOUNT,5)
(FI_MGR,1)
(IT_PROG,5)
(PU_CLERK,5)
(PU_MAN,1)
(SA_MAN,5)
(ST_CLERK,20)
(ST_MAN,5)
Here i want to sort by key as descending and value as ascending . So tried below lines of code.
emp_data.map(s => (s.JOB_ID, s.FIRST_NAME.concat(",").concat(s.LAST_NAME))).groupByKey().map({
case (x, y) => (x, y.toList.size)
}).sortBy(s => (s._1, s._2))(Ordering.Tuple2(Ordering.String.reverse, Ordering.Int.reverse))
it is causing below exception.
not enough arguments for expression of type (implicit ord: Ordering[(String, Int)], implicit ctag: scala.reflect.ClassTag[(String, Int)])org.apache.spark.rdd.RDD[(String, Int)]. Unspecified value parameter ctag.
RDD.sortBy takes both ordering and class tags as implicit arguments.
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
You cannot just provide a subset of these and expect things to work. Instead you can provide block local implicit ordering:
{
implicit val ord = Ordering.Tuple2[String, Int](Ordering.String.reverse, Ordering.Int.reverse)
emp_data.map(s => (s.JOB_ID, s.FIRST_NAME.concat(",").concat(s.LAST_NAME))).groupByKey().map({
case (x, y) => (x, y.toList.size)
}).sortBy(s => (s._1, s._2))
}
though you should really use reduceByKey not groupByKey in such case.
I am trying to get this variable GroupsByP to have certain type: GroupsByP is defined out of db connection select/collect statement which has 3 fields: 2 strings (p and id) and an int (order).
Expected result should be of the form Map[p,Set[(Id,Order)]]
val GroupsByP = db.pLinkGroups.collect()
.groupBy(_.p)
.map(group => group._1 -> (group._2.map(_.Id -> group._2.map(_.Order)).toSet))
my desired type for this variable is
Map[String, Set[(String, Int)]]
but actual is Map[String, Set[(String, Array[Int])]],
If I got your question right, this should do it:
val GroupsByP: Map[String, Set[(String, Int)]] = input.collect()
.groupBy(_.p)
.map(group => group._1 -> group._2.map(record => (record.Id, record.Order)).toSet)
You should be mapping each record into a (Id, Order) tuple.
A very similar but perhaps more readable implementation might be:
val GroupsByP: Map[String, Set[(String, Int)]] = input.collect()
.groupBy(_.p)
.mapValues(_.map(record => (record.Id, record.Order)).toSet)
I am trying some basic logic using scala . I tried the below code but it throws error .
scala> val data = ("HI",List("HELLO","ARE"))
data: (String, List[String]) = (HI,List(HELLO, ARE))
scala> data.flatmap( elem => elem)
<console>:22: error: value flatmap is not a member of (String, List[String])
data.flatmap( elem => elem)
Expected Output :
(HI,HELLO,ARE)
Could some one help me to fix this issue?
You are trying to flatMap over a tuple, which won't work. The following will work:
val data = List(List("HI"),List("HELLO","ARE"))
val a = data.flatMap(x => x)
This will be very trivial in scala:
val data = ("HI",List("HELLO","ARE"))
println( data._1 :: data._2 )
what exact data structure are you working with?
If you are clear about you data structure:
type rec = (String, List[String])
val data : rec = ("HI",List("HELLO","ARE"))
val f = ( v: (String, List[String]) ) => v._1 :: v._2
f(data)
A couple of observations:
Currently there is no flatten method for tuples (unless you use shapeless).
flatMap cannot be directly applied to a list of elements which are a mix of elements and collections.
In your case, you can make element "HI" part of a List:
val data = List(List("HI"), List("HELLO","ARE"))
data.flatMap(identity)
Or, you can define a function to handle your mixed element types accordingly:
val data = List("HI", List("HELLO","ARE"))
def flatten(l: List[Any]): List[Any] = l.flatMap{
case x: List[_] => flatten(x)
case x => List(x)
}
flatten(data)
You are trying to flatMap on Tuple2 which is not available in current api
If you don't want to change your input, you can extract the values from Tuple2 and the extract the values for second tuple value as below
val data = ("HI",List("HELLO","ARE"))
val output = (data._1, data._2(0), data._2(1))
println(output)
If that's what you want:
val data = ("HI",List("HELLO,","ARE").mkString(""))
println(data)
>>(HI,HELLO,ARE)
Some context can be found here, the idea is that I have created a graph from tuples collected from a request on a Hive table. Those correspond to trade relations between countries.
Having built the graph this way, the vertices are not labelled. I want to study the distribution of degrees and get the most connected countries' names. I tried 2 options :
First : I tried to map the index of the vertices with the string names of the vertices with the function idMapbis inside the function which is collecting and printing the ten top connected degrees.
Second : I tried to add label to the vertices of the graph itself.
In both cases I get the following error : the task is not serializable
Global code :
import org.apache.spark.SparkContext
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val sqlContext= new org.apache.spark.sql.hive.HiveContext(sc)
val data = sqlContext.sql("select year, trade_flow, reporter_iso, partner_iso, sum(trade_value_us) from comtrade.annual_hs where length(commodity_code)='2' and not partner_iso='WLD' group by year, trade_flow, reporter_iso, partner_iso").collect()
val data_2010 = data.filter(line => line(0)==2010)
val couples = data_2010.map(line=>(line(2),line(3))) //pays->pays
couples look like this: Array[(Any, Any)] = Array((MWI,MOZ), (WSM,AUS), (MDA,CRI), (KNA,HTI), (PER,ERI), (SWE,CUB),...
val idMap = sc.broadcast(couples
.flatMap{case (x: String, y: String) => Seq(x, y)}
.distinct
.zipWithIndex
.map{case (k, v) => (k, v.toLong)}
.toMap)
val edges: RDD[(VertexId, VertexId)] = sc.parallelize(couples
.map{case (x: String, y: String) => (idMap.value(x), idMap.value(y))})
val graph = Graph.fromEdgeTuples(edges, 1)
built this way, vertices look like (68,1) for example
val degrees: VertexRDD[Int] = graph.degrees.cache()
//Most connected vertices
def topNamesAndDegrees(degrees: VertexRDD[Int], graph: Graph[Int, Int]): Array[(Int, Int)] = {
val namesAndDegrees = degrees.innerJoin(graph.vertices) {
(id, degree, k) => (id.toInt, degree)}
val ord = Ordering.by[(Int, Int), Int](_._2)
namesAndDegrees.map(_._2).top(10)(ord)}
topNamesAndDegrees(degrees, graph).foreach(println)
We get : (79,1016),(64,912),(55,889)...
First option to retrieve the names :
val idMapbis = sc.parallelize(couples
.flatMap{case (x: String, y: String) => Seq(x, y)}
.distinct
.zipWithIndex
.map{case (k, v) => (v,k)}
.toMap)
def topNamesAndDegrees(degrees: VertexRDD[Int], graph: Graph[Int, Int]): Array[(String, Int)] = {
val namesAndDegrees = degrees.innerJoin(graph.vertices) {
(id, degree, name) => (idMapbis.value(id.toInt), degree)}
val ord = Ordering.by[(String, Int), Int](_._2)
namesAndDegrees.map(_._2).top(10)(ord)}
topNamesAndDegrees(degrees, graph).foreach(println)
The task is not serializable but the function idMapbis is working since there is no error with idMapbis.value(graph.vertices.take(1)(0)._1.toInt)
Option 2:
graph.vertices.map{case (k, v) => (k,idMapbis.value(k.toInt))}
The task is not serializable again (for context here is how topNamesAndDegrees is modified to obtain the names of the most connected vertices in this option)
def topNamesAndDegrees(degrees: VertexRDD[Int], graph: Graph[Int, Int]): Array[(String, Int)] = {
val namesAndDegrees = degrees.innerJoin(graph.vertices) {
(id, degree, name) => (name, degree)}
val ord = Ordering.by[(String, Int), Int](_._2)
namesAndDegrees.map(_._2).top(10)(ord)}
topNamesAndDegrees(degrees, graph).foreach(println)
I am interested in understanding how to improve one of this option, maybe both if someone see how.
Problem with your attempts is that idMapbis is an RDD. Since we already know your data fits into memory you can simply use a broadcast variable as before:
val idMapRev = sc.broadcast(idMap.value.map{case (k, v) => (v, k)}.toMap)
graph.mapVertices{case (id, _) => idMapRev.value(id)}
Alternatively you could use the correct labels from the beginning:
val countries: RDD[(VertexId, String)] = sc
.parallelize(idMap.value.map(_.swap).toSeq)
val relationships: RDD[Edge[Int]] = sc.parallelize(couples
.map{case (x: String, y: String) => Edge(idMap.value(x), idMap.value(y), 1)}
)
val graph = Graph(countries, relationships)
The second approach has one important advantage - if graph is large you relatively easily replace broadcast variables with joins.