I need to "extract" some data contained in an Iterable[MyObject] (it was a RDD[MyObject] before a groupBy).
My initial RDD[MyObject] :
|-----------|---------|----------|
| startCity | endCity | Customer |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 1 | 1 |
| | |----|-----|
| | | 2 | 1 |
| | |----|-----|
| | | 3 | 50 |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 5 | 40 |
| | |----|-----|
| | | 6 | 41 |
| | |----|-----|
| | | 7 | 2 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 9 | 15 |
| | |----|-----|
| | | 10| 16 |
| | |----|-----|
| | | 11| 46 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 13| 7 |
| | |----|-----|
| | | 14| 9 |
| | |----|-----|
| | | 15| 60 |
|-----------|---------|----|-----|
| Barcelona | London | ID | Age |
| | |----|-----|
| | | 17| 66 |
| | |----|-----|
| | | 18| 53 |
| | |----|-----|
| | | 19| 11 |
|-----------|---------|----|-----|
I need to count them by age range by and groupBy startCity - endCity
The final result should be :
|-----------|---------|-------------|
| startCity | endCity | Customer |
|-----------|---------|-------------|
| Paris | London | Range| Count|
| | |------|------|
| | |0-2 | 3 |
| | |------|------|
| | |3-18 | 0 |
| | |------|------|
| | |19-99 | 3 |
|-----------|---------|-------------|
| New-York | Paris | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 3 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
| Barcelona | London | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 1 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
At the moment I'm doing this by count 3 times the same data (first time with 0-2 range, then 10-20, then 21-99).
Like :
Iterable[MyObject] ite
ite.count(x => x.age match {
case Some(age) => { age >= 0 && age < 2 }
}
It's working by giving me an Integer but not efficient at all I think since I have to count many times, what's the best way to do this please ?
Thanks
EDIT : The Customer object is a case class
def computeRange(age : Int) =
if(age<=2)
"0-2"
else if(age<=10)
"2-10"
// etc, you get the idea
Then, with an RDD of case class MyObject(id : String, age : Int)
rdd
.map(x=> computeRange(x.age) -> 1)
.reduceByKey(_+_)
Edit:
If you need to group by some columns, you can do it this way, provided that you have a RDD[(SomeColumns, Iterable[MyObject])]. The following lines would give you a map that associates each "range" to its number of occurences.
def computeMapOfOccurances(list : Iterable[MyObject]) : Map[String, Int] =
list
.map(_.age)
.map(computeRange)
.groupBy(x=>x)
.mapValues(_.size)
val result1 = rdd
.mapValues( computeMapOfOccurances(_))
And if you need to flatten your data, you can write:
val result2 = result1
.flatMapValues(_.toSeq)
Assuming that you have Customer[Object] as a case class as below
case class Customer(ID: Int, Age: Int)
And your RDD[MyObject] is a rdd of case class as below
case class MyObject(startCity: String, endCity: String, customer: List[Customer])
So using above case classes you should be having input (that you have in table format) as below
MyObject(Paris,London,List(Customer(1,1), Customer(2,1), Customer(3,50)))
MyObject(Paris,London,List(Customer(5,40), Customer(6,41), Customer(7,2)))
MyObject(New-York,Paris,List(Customer(9,15), Customer(10,16), Customer(11,46)))
MyObject(New-York,Paris,List(Customer(13,7), Customer(14,9), Customer(15,60)))
MyObject(Barcelona,London,List(Customer(17,66), Customer(18,53), Customer(19,11)))
And you've also mentioned that after grouping you have Iterable[MyObject] which is equivalent to below step
val groupedRDD = rdd.groupBy(myobject => (myobject.startCity, myobject.endCity)) //groupedRDD: org.apache.spark.rdd.RDD[((String, String), Iterable[MyObject])] = ShuffledRDD[2] at groupBy at worksheetTest.sc:23
So the next step for you to do is to use mapValues to iterate through the Iterable[MyObject], and then count the ages belonging to each ranges, and finally converting to the output you require as below
val finalResult = groupedRDD.mapValues(x => {
val rangeAge = Map("0-2" -> 0, "3-18" -> 0, "19-99" -> 0)
val list = x.flatMap(y => y.customer.map(z => z.Age)).toList
updateCounts(list, rangeAge).map(x => CustomerOut(x._1, x._2)).toList
})
where updateCounts is a recursive function
def updateCounts(ageList: List[Int], map: Map[String, Int]) : Map[String, Int] = ageList match{
case head :: tail => if(head >= 0 && head < 3) {
updateCounts(tail, map ++ Map("0-2" -> (map("0-2")+1)))
} else if(head >= 3 && head < 19) {
updateCounts(tail, map ++ Map("3-18" -> (map("3-18")+1)))
} else updateCounts(tail, map ++ Map("19-99" -> (map("19-99")+1)))
case Nil => map
}
and CustomerOut is another case class
case class CustomerOut(Range: String, Count: Int)
so the finalResult is as below
((Barcelona,London),List(CustomerOut(0-2,0), CustomerOut(3-18,1), CustomerOut(19-99,2)))
((New-York,Paris),List(CustomerOut(0-2,0), CustomerOut(3-18,4), CustomerOut(19-99,2)))
((Paris,London),List(CustomerOut(0-2,3), CustomerOut(3-18,0), CustomerOut(19-99,3)))
Related
I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?
Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows
The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)
I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+
I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+
I have question regarding how to make a calculated pivot table from several query results on PostgreSQL. I've managed to make three queries results but don't have any idea how to combine and calculate all the data into a single table. I've tried to google it but found out that most of the question is about how to make a pivot table from a single table, which I'm able to do using sum, case, and group by. Well, Here's the simplified version of my query results
Query from query 1 which contains gross value
| city | code | gross |
|-------|------|--------|
| city1 | 21 | 194793 |
| city1 | 25 | 139241 |
| city1 | 28 | 231365 |
| city2 | 21 | 282025 |
| city2 | 25 | 334458 |
| city2 | 28 | 410852 |
| city3 | 21 | 109237 |
Result from query 2 which contains positive adjustments
| city | code | adj_pos |
|-------|------|---------|
| city1 | 21 | 16259 |
| city1 | 25 | 13634 |
| city1 | 28 | 45854 |
| city2 | 25 | 18060 |
| city2 | 28 | 18220 |
Result from query 3 which contains negative adjustments
| city | code | adj_neg |
|-------|------|---------|
| city1 | 25 | 23364 |
| city2 | 21 | 27478 |
| city2 | 25 | 23474 |
And what I want to to is to create something like this
| city | 21_gross | 25_gross | 28_gross | 21_pos | 25_pos | 28_pos | 21_neg | 25_neg | 28_neg |
|-------|----------|----------|----------|--------|--------|--------|--------|--------|--------|
| city1 | 194793 | 139241 | 231365 | 16259 | 13634 | 45854 | | 23364 | |
| city2 | 282025 | 334458 | 410852 | | 18060 | 18220 | 27478 | 23474 | |
| city3 | 109237 | | | | | | | | |
or probably final calculation which come from gross + positive adjustment -
negative adjustment from each city on each code like this
| city | 21_nett | 25_nett | 28_nett |
|-------|---------|---------|---------|
| city1 | 211052 | 129511 | 277219 |
| city2 | 254547 | 329044 | 429072 |
| city3 | 109237 | 0 | 0 |
Any suggestion will be appreciated. Thank you!
I think the best you can achieve is to get the pivoting part as JSON - http://sqlfiddle.com/#!17/b7d64/23:
select
city,
json_object_agg(
code,
coalesce(gross,0) + coalesce(adj_pos,0) - coalesce(adj_neg,0)
) as js
from q1
left join q2 using (city,code)
left join q3 using (city,code)
group by city
What is Comonad, if it's possible describe in Scala syntax.
I found scalaz library implementation, but it's not clear where it can be useful.
Well, monads allow you to add values to them, change them based on a computation from a non-monad to a monad. Comonads allow you to extract values from them, and change them based on a computation from a comonad to a non-comonad.
The natural intuition is that they'll usually appear where you have a CM[A] and want to extract A.
See this very interesting post that touches on comonads a bit casually, but, to me at least, making them very clear.
What follows is a literal translation of code from this blog post.
case class U[X](left: Stream[X], center: X, right: Stream[X]) {
def shiftRight = this match {
case U(a, b, c #:: cs) => U(b #:: a, c, cs)
}
def shiftLeft = this match {
case U(a #:: as, b, c) => U(as, a, b #:: c)
}
}
// Not necessary, as Comonad also has fmap.
/*
implicit object uFunctor extends Functor[U] {
def fmap[A, B](x: U[A], f: A => B): U[B] = U(x.left.map(f), f(x.center), x.right.map(f))
}
*/
implicit object uComonad extends Comonad[U] {
def copure[A](u: U[A]): A = u.center
def cojoin[A](a: U[A]): U[U[A]] = U(Stream.iterate(a)(_.shiftLeft).tail, a, Stream.iterate(a)(_.shiftRight).tail)
def fmap[A, B](x: U[A], f: A => B): U[B] = U(x.left.map(f), x.center |> f, x.right.map(f))
}
def rule(u: U[Boolean]) = u match {
case U(a #:: _, b, c #:: _) => !(a && b && !c || (a == b))
}
def shift[A](i: Int, u: U[A]) = {
Stream.iterate(u)(x => if (i < 0) x.shiftLeft else x.shiftRight).apply(i.abs)
}
def half[A](u: U[A]) = u match {
case U(_, b, c) => Stream(b) ++ c
}
def toList[A](i: Int, j: Int, u: U[A]) = half(shift(i, u)).take(j - i)
val u = U(Stream continually false, true, Stream continually false)
val s = Stream.iterate(u)(_ =>> rule)
val s0 = s.map(r => toList(-20, 20, r).map(x => if(x) '#' else ' '))
val s1 = s.map(r => toList(-20, 20, r).map(x => if(x) '#' else ' ').mkString("|")).take(20).force.mkString("\n")
println(s1)
Output:
| | | | | | | | | | | | | | | | | | | |#| | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#|#| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#| |#| | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#|#|#|#| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#| | | |#| | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#|#| | |#|#| | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#| |#| |#| |#| | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#|#|#|#|#|#|#|#| | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#| | | | | | | |#| | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#|#| | | | | | |#|#| | | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#| |#| | | | | |#| |#| | | | | | | | |
| | | | | | | | | | | | | | | | | | | |#|#|#|#| | | | |#|#|#|#| | | | | | | |
| | | | | | | | | | | | | | | | | | | |#| | | |#| | | |#| | | |#| | | | | | |
| | | | | | | | | | | | | | | | | | | |#|#| | |#|#| | |#|#| | |#|#| | | | | |
| | | | | | | | | | | | | | | | | | | |#| |#| |#| |#| |#| |#| |#| |#| | | | |
| | | | | | | | | | | | | | | | | | | |#|#|#|#|#|#|#|#|#|#|#|#|#|#|#|#| | | |
| | | | | | | | | | | | | | | | | | | |#| | | | | | | | | | | | | | | |#| | |
| | | | | | | | | | | | | | | | | | | |#|#| | | | | | | | | | | | | | |#|#| |
| | | | | | | | | | | | | | | | | | | |#| |#| | | | | | | | | | | | | |#| |#|
| | | | | | | | | | | | | | | | | | | |#|#|#|#| | | | | | | | | | | | |#|#|#|#
The scalaz library provides a ComonadStore which extends the property of Comonad. It is defined like this:
trait ComonadStore[F[_], S] extends Comonad[F] { self =>
def pos[A](w: F[A]): S
def peek[A](s: S, w: F[A]): A
def peeks[A](s: S => S, w: F[A]): A =
peek(s(pos(w)), w)
def seek[A](s: S, w: F[A]): F[A] =
peek(s, cojoin(w))
def seeks[A](s: S => S, w: F[A]): F[A] =
peeks(s, cojoin(w))
def experiment[G[_], A](s: S => G[S], w: F[A])(implicit FG: Functor[G]): G[A] =
FG.map(s(pos(w)))(peek(_, w))
}
And a Store( which is analogous to (S => A, S)) has an instance of Comonad. You can look at this question that explains what it is more specifically.
You also have the Coreader and Cowriter Comonads that are the dual of the Reader and Writer Monads, here is an excellent blog post that talks about it in Scala.