I have a simple dataframe as an example:
val someDF = Seq(
(1, "A"),
(2, "A"),
(3, "A"),
(4, "B"),
(5, "B"),
(6, "A"),
(7, "A"),
(8, "A")
).toDF("t", "state")
// this part is half pseudocode
someDF.aggregate((acc, cur) => {
if (acc.last.state != cur.state) {
acc.add(cur)
}
}, List()).show(truncate=false)
"t" column represents points in time and "state" column represents the state at that point in time.
What I wish to find is the first time where each change occurs plus the first row, as in:
(1, "A")
(4, "B")
(6, "A")
I looked at the solutions in SQL too but they involve complex self-joins and window functions which I don't completely understand, but an SQL solution is OK too.
There are numerous functions in spark (fold, aggregate, reduce ..) that I feel which can do this, but I couldn't grasp the differences since I'm new to spark concepts like partitioning, and it's a bit tricky if the partitioning could affect the results.
You can use the window function lag for comparing with the previous row, and row_number for checking whether it's the first row:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val result = someDF.withColumn(
"change",
lag("state", 1).over(Window.orderBy("t")) =!= col("state") ||
row_number().over(Window.orderBy("t")) === 1
).filter("change").drop("change")
result.show
+---+-----+
| t|state|
+---+-----+
| 1| A|
| 4| B|
| 6| A|
+---+-----+
For an SQL solution:
someDF.createOrReplaceTempView("mytable")
val result = spark.sql("""
select t, state
from (
select
t, state,
lag(state) over (order by t) != state or row_number() over (order by t) = 1 as change
from mytable
)
where change
""")
Related
I have two dataframes. I want to replace values in col1 of df1 where values are null using the values from col1 of df2. Please keep in mind df1 can have > 10^6 rows similarly to df2 and that df1 have some additional columns which are different from some addtional columns of df2.
I know how to do join but I do not know how to do some kind of conditional join here in Spark with Scala.
df1
name | col1 | col2 | col3
----------------------------
foo | 0.1 | ...
bar | null |
hello | 0.6 |
foobar | null |
df2
name | col1 | col7
--------------------
lorem | 0.1 |
bar | 0.52 |
foobar | 0.47 |
EDIT:
This is my current solution:
df1.select("name", "col2", "col3").join(df2, (df1("name") === df2("name")), "left").select(df1("name"), col("col1"))
EDIT2:
val df1 = Seq(
("foo", Seq(0.1), 10, "a"),
("bar", Seq(), 20, "b"),
("hello", Seq(0.1), 30, "c"),
("foobar", Seq(), 40, "d")
).toDF("name", "col1", "col2", "col3")
val df2 = Seq(
("lorem", Seq(0.1), "x"),
("bar", Seq(0.52), "y"),
("foobar", Seq(0.47), "z")
).toDF("name", "col1", "col7")
display(df1.
join(df2, Seq("name"), "left_outer").
select(df1("name"), coalesce(df1("col1"), df2("col1")).as("col1")))
returns:
name | col1
bar | []
foo | [0.1]
foobar | []
hello | [0.1]
Consider using coalesce on col1 after performing the left join. To handle both nulls and empty arrays (in the case of ArrayType) as per revised requirement in the comments section, a when/otherwise clause is used, as shown below:
val df1 = Seq(
("foo", Some(Seq(0.1)), 10, "a"),
("bar", None, 20, "b"),
("hello", Some(Seq(0.1)), 30, "c"),
("foobar", Some(Seq()), 40, "d")
).toDF("name", "col1", "col2", "col3")
val df2 = Seq(
("lorem", Seq(0.1), "x"),
("bar", Seq(0.52), "y"),
("foobar", Seq(0.47), "z")
).toDF("name", "col1", "col7")
df1.
join(df2, Seq("name"), "left_outer").
select(
df1("name"),
coalesce(
when(lit(df1.schema("col1").dataType.typeName) === "array" && size(df1("col1")) === 0, df2("col1")).otherwise(df1("col1")),
df2("col1")
).as("col1")
).
show
/*
+------+------+
| name| col1|
+------+------+
| foo| [0.1]|
| bar|[0.52]|
| hello| [0.1]|
|foobar|[0.47]|
+------+------+
*/
UPDATE:
It appears that Spark, surprisingly, does not handle conditionA && conditionB the way most other languages do -- even when conditionA is false conditionB will still be evaluated, and replacing && with nested when/otherwise still would not resolve the issue. It might be due to limitations in how the internally translated case/when/else SQL is executed.
As a result, the above when/otherwise data-type check via array-specific function size() fails when col1 is non-ArrayType. Given that, I would forgo the dynamic column type check and perform different queries based on whether col1 is ArrayType or not, assuming it's known upfront:
df1.
join(df2, Seq("name"), "left_outer").
select(
df1("name"),
coalesce(
when(size(df1("col1")) === 0, df2("col1")).otherwise(df1("col1")), // <-- if col1 is an array
// df1("col1"), // <-- if col1 is not an array
df2("col1")
).as("col1")
).
show
I got two DataFrames, with the same schema (but +100 columns):
Small size: 1000 rows
Bigger size: 90000 rows
How to check every Row in 1 exists in 2? What is the "Spark way" of doing this? Should I use map and then deal with it at the Row level; or I use join and then use some sort of comparison with the small size DataFrame?
You can use except, which returns all rows of the first dataset that are not present in the second
smaller.except(bigger).isEmpty()
You can inner join the DF and count to check if ther eis a difference.
def isIncluded(smallDf: Dataframe, biggerDf: Dataframe): Boolean = {
val keys = smallDf.columns.toSeq
val joinedDf = smallDf.join(biggerDf, keys) // You might want to broadcast smallDf for performance issues
joinedDf.count == smallDf
}
However, I think the except method is clearer. Not sure about the performances (It might just be a join underneath)
I would do it with join, probably
This join will give you all rows that are in small data frame but are missing in large data frame. Then just check if it is zero size or no.
Code:
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val seq2 = Seq(
("A", "abc", "a", "b", "?"),
("C", "ghi", "a", "c", "?")
)
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
val df2 = ss.sparkContext.makeRDD(seq2).toDF("cA", "cB", "cH", "cI", "cJ")
df2.join(df1, df1("cA") === df2("cA"), "leftOuter").show
Output:
+---+---+---+---+---+---+---+---+---+---+
| cA| cB| cH| cI| cJ| cA| cB| cC| cD| cE|
+---+---+---+---+---+---+---+---+---+---+
| C|ghi| a| c| ?| C|ghi|0.2|0.2| 1|
| A|abc| a| b| ?| A|abc|0.1|0.0| 0|
+---+---+---+---+---+---+---+---+---+---+
I want to convert this basic SQL Query in Spark
select Grade, count(*) * 100.0 / sum(count(*)) over()
from StudentGrades
group by Grade
I have tried using windowing functions in spark like this
val windowSpec = Window.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(sum(count("*")) over windowSpec,count("*")).show()
+------+--------------------------------------------------------------------
----------+--------+
|Arrest|sum(count(1)) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING)|count(1)|
+------+--------------------------------------------------------------------
----------+--------+
| true|
665517| 184964|
| false|
665517| 480553|
+------+------------------------------------------------------------------------------+--------+
But when I try dividing by count(*) it through's error
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(count("*")/sum(count("*")) over
windowSpec,count("*")).show()
It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
My Question is when I'm already using count() inside sum() in the first query I'm not receiving any errors of using an aggregate function inside another aggregate function but why get error in the second one?
An example:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val df2 = df
.groupBy("c1")
.agg(sum("Val1").alias("sum"))
.withColumn("fraction", col("sum") / sum("sum").over())
df2.show
You will need to tailor to your own situation. E.g. count instead of sum. As follows:
val df2 = df
.groupBy("c1")
.agg(count("*"))
.withColumn("fraction", col("count(1)") / sum("count(1)").over())
returning:
+---+--------+-------------------+
| c1|count(1)| fraction|
+---+--------+-------------------+
| E| 1|0.16666666666666666|
| B| 1|0.16666666666666666|
| D| 1|0.16666666666666666|
| C| 1|0.16666666666666666|
| A| 2| 0.3333333333333333|
+---+--------+-------------------+
You can do x 100. I note the alias does not seem to work as per the sum, so worked around this and left comparison above. Again, you will need to tailor to your specifics, this is part of my general modules for research and such.
I have a RDD of (key, value) that I transformed into a RDD of (key, List(value1, value2, value3) as follow.
val rddInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10)))
val rddReduced = rddInit..groupByKey.mapValues(_.toList)
rddReduced.take(3).foreach(println)
This code give me the next RDD :
(1,List(2, 3)) (2,List(5, 7)) (3,List(10))
But now I would like to go back to the rddInit from the rdd I just computed (the rddReduced rdd).
My first guess is to realise some kind of cross product between the key and each element of the List like this :
rddReduced.map{
case (x, y) =>
val myList:ListBuffer[(Int, Int)] = ListBuffer()
for(element <- y) {
myList+=new Pair(x, element)
}
myList.toList
}.flatMap(x => x).take(5).foreach(println)
With this code, I get the initial RDD as a result. But I don't think using a ListBuffer inside a spark job is a good practice. Is there any other way to resolve this problem ?
I'm surprised no one has offered a solution with Scala's for-comprehension (that gets "desugared" to flatMap and map at compile time).
I don't use this syntax very often, but when I do...I find it quite entertaining. Some people prefer for-comprehension over a series of flatMap and map, esp. for more complex transformations.
// that's what you ended up with after `groupByKey.mapValues`
val rddReduced: RDD[(Int, List[Int])] = ...
val r = for {
(k, values) <- rddReduced
v <- values
} yield (k, v)
scala> :type r
org.apache.spark.rdd.RDD[(Int, Int)]
scala> r.foreach(println)
(3,10)
(2,5)
(2,7)
(1,2)
(1,3)
// even nicer to our eyes
scala> r.toDF("key", "value").show
+---+-----+
|key|value|
+---+-----+
| 1| 2|
| 1| 3|
| 2| 5|
| 2| 7|
| 3| 10|
+---+-----+
After all, that's why we enjoy flexibility of Scala, isn't it?
It's obviously not a good practice to use that kind of operation.
From what I have learned in a spark-summit course, you have to use Dataframes and Datasets as much as possible, using them you will benefit from a lot of optimizations form spark engine.
What you wanna do is called explode and it's preformed by applying the explode method from the sql.functions package
The solution whould be something like this :
import spark.implicits._
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.collect_list
val dfInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10))).toDF("x", "y")
val dfReduced = dfInit.groupBy("x").agg(collect_list("y") as "y")
val dfResult = dfReduced.withColumn("y", explode($"y"))
dfResult will contains the same data as the dfInit
Here's one way to restore the grouped RDD back to original:
val rddRestored = rddReduced.flatMap{
case (k, v) => v.map((k, _))
}
rddRestored.collect.foreach(println)
(1,2)
(1,3)
(2,5)
(2,7)
(3,10)
According to your question, I think this is what you want to do
rddReduced.map{case(x, y) => y.map((x,_))}.flatMap(_).take(5).foreach(println)
You get a list after group by in which you can map through it again.
I have two dataframes (Scala Spark) A and B. When A("id") == B("a_id") I want to update A("value") to B("value"). Since DataFrames have to be recreated I'm assuming I have to do some joins and withColumn calls but I'm not sure how to do this. In SQL it would be a simple update call on a natural join but for some reason this seems difficult in Spark?
Indeed, a left join and a select call would do the trick:
// assuming "spark" is an active SparkSession:
import org.apache.spark.sql.functions._
import spark.implicits._
// some sample data; Notice it's convenient to NAME the dataframes using .as(...)
val A = Seq((1, "a1"), (2, "a2"), (3, "a3")).toDF("id", "value").as("A")
val B = Seq((1, "b1"), (2, "b2")).toDF("a_id", "value").as("B")
// left join + coalesce to "choose" the original value if no match found:
val result = A.join(B, $"A.id" === $"B.a_id", "left")
.select($"id", coalesce($"B.value", $"A.value") as "value")
// result:
// +---+-----+
// | id|value|
// +---+-----+
// | 1| b1|
// | 2| b2|
// | 3| a3|
// +---+-----+
Notice that there's no real "update" here - result is a new DataFrame which you can use (write / count / ...) but the original DataFrames remain unchanged.