How to Merge values from multiple rows so they can be processed together - Spark scala - scala

I have multiple database rows per personId with columns that may or may not have values - I'm using colors here as the data is text not numeric so doesn't lend itself to built-in aggregation functions. A simplified example is
PersonId ColA ColB ColB
100 red
100 green
100 gold
100 green
110 yellow
110 white
110
120
etc...
I want to be able to decide in a function which column data to use per unique PersonId. A three-way join on the table against itself would be a good solution if the data didn't have multiple values(colors) per column. E.g. that join merges 3 of the rows into one but still produces multiple rows.
PersonId ColA ColB ColB
100 red green gold
100 green
110 white yellow
110
120
So the solution I'm looking for is something that will allow me to address all the values (colors) for a person in one place (function) so the decision can be made across all their data.
The real data of course has more columns but the primary ones for this decision are the three columns. The data is being read in Scala Spark as a Dataframe and I'd prefer using the API to sql. I don't know if any of the exotic windows or groupby functions will help or if it's gonna be down to plain old iterate and accumulate.
The technique used in [How to aggregate values into collection after groupBy? might be applicable but it's a bit of a leap.

Think of using customUDF for doing this.
import org.apache.spark.sql.functions._
val df = Seq((100, "red", null, null), (100, null, "white", null), (100, null, null, "green"), (200, null, "red", null)).toDF("PID", "A", "B", "C")
df.show()
+---+----+-----+-----+
|PID| A| B| C|
+---+----+-----+-----+
|100| red| null| null|
|100|null|white| null|
|100|null| null|green|
|200|null| red| null|
+---+----+-----+-----+
val customUDF = udf((array: Seq[String]) => {
val newts = array.filter(_.nonEmpty)
if (newts.size == 0) null
else newts.head
})
df.groupBy($"PID").agg(customUDF(collect_set($"A")).as("colA"), customUDF(collect_set($"B")).as("colB"), customUDF(collect_set($"C")).as("colC")).show
+---+----+-----+-----+
|PID|colA| colB| colC|
+---+----+-----+-----+
|100| red|white|green|
|200|null| red| null|
+---+----+-----+-----+

Related

Spark Combining Disparate rate Dataframes in Time

Using Spark and Scala, I have two DataFrames with data values.
I'm trying to accomplish something that, when processing serially would be trival, but when processing in a cluster seems daunting.
Let's say I have to sets of values. One of them is very regular:
Relative Time
Value1
10
1
20
2
30
3
And I want to combine it with another value that is very irregular:
Relative Time
Value2
1
100
22
200
And get this (driven by Value1):
Relative Time
Value1
Value2
10
1
100
20
2
100
30
3
200
Note: There are a few scenarios here. One of them is that Value1 is a massive DataFrame and Value2 only has a few hundred values. The other scenario is that they're both massive.
Also note: I depict Value2 as being very slow, and it might be, but also could may be much faster than Value1, so I may have 10 or 100 values of Value2 before my next value of Value1, and I'd want the latest. Because of this doing a union of them and windowing it doesn't seem practical.
How would I accomplish this in Spark?
I think you can do:
Full outer join between the two tables
Use the last function to look back the closest value of value2
import spark.implicits._
import org.apache.spark.sql.expressions.Window
val df1 = spark.sparkContext.parallelize(Seq(
(10, 1),
(20, 2),
(30, 3)
)).toDF("Relative Time", "value1")
val df2 = spark.sparkContext.parallelize(Seq(
(1, 100),
(22, 200)
)).toDF("Relative Time", "value2_temp")
val df = df1.join(df2, Seq("Relative Time"), "outer")
val window = Window.orderBy("Relative Time")
val result = df.withColumn("value2", last($"value2_temp", ignoreNulls = true).over(window)).filter($"value1".isNotNull).drop("value2_temp")
result.show()
+-------------+------+------+
|Relative Time|value1|value2|
+-------------+------+------+
| 10| 1| 100|
| 20| 2| 100|
| 30| 3| 200|
+-------------+------+------+

PySpark: Group by two columns, count the pairs, and divide the average of two different columns

I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).
Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+

How to select the N highest values for each category in spark scala

Say I have this dataset:
val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")
which looks like this:
I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,
Since the 2 highest homerun totals for them were 8 and 6.
How would I do this in the general case?
Thanks
Your problem is not really good fit for the pivot, since pivot means:
A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.
You could create an additional rank column with a window function and then select only rows with rank 1 or 2:
import org.apache.spark.sql.expressions.Window
main_df.withColumn(
"rank",
rank()
.over(
Window.partitionBy("teams")
.orderBy($"homeruns".desc)
)
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
| teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets| 8| 20| 1|
|yankees-mets| 6| 17| 2|
+------------+--------+----+----+
Then if you no longer need rank column you could just drop it.

how to join two dataframes and substract two columns from the dataframe

I have two dataframes which look like below
I am trying to find the diff between two amount based on ID
Dataframe 1:
ID I Amt
1 null 200
null 2 200
3 null 600
dataframe 2
ID I Amt
2 null 300
3 null 400
Output
Df
ID Amt(df2-df1)
2 100
3 -200
Query doesnt work:
Substraction doesnt work
df = df1.join(df2, df1["coalesce(ID, I)"] == df2["coalesce(ID, I)"], 'inner').select
((df1["amt)"]) – (df2["amt”])), df1["coalesce(ID, I)"].show())
I would do a couple of things differently. To make it easier to know what column is in what dataframe, I would rename them. I would also do the coalesce outside of the join itself.
val joined = df1.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF1_AMT")).join(
df2.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF2_AMT")),"joinKey")
Then you can easily perform your calculation:
joined.withColumn("DIFF",$"DF2_AMT" - $"DF1_AMT").show
+-------+-------+-------+------+
|joinKey|DF1_AMT|DF2_AMT| DIFF|
+-------+-------+-------+------+
| 2| 200| 300| 100.0|
| 3| 600| 400|-200.0|
+-------+-------+-------+------+

Count the number of non-null values in a Spark DataFrame

I have a data frame with some columns, and before doing analysis, I'd like to understand how complete the data frame is. So I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.
Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python.
Say you have:
val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()
How can you summarize the number of non-null values for each column and return a dataframe with the same number of columns and just a single row with the answer?
One straight forward option is to use .describe() function to get a summary of your data frame, where the count row includes a count of non-null values:
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+
Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...
You can do something like:
import org.apache.spark.sql.functions.{sum,when, count}
df.agg(
(sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
(sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
(sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
).show()
EDIT: sum(when($"x".isNotNull,0).otherwise(1)) can also just be replaced by count($"x") which only counts non-null values. As I find this not obvious, I tend to use the sum notation which is more clear
Here's how I did it in Scala 2.11, Spark 2.3.1:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.agg(
count("x").divide(count(lit(1)))
.as("x: percent non-null")
// ...copy paste that for columns y and z
).head()
count(*) counts non-null rows, count(1) runs on every row.
If you instead want to count percent null in population, find the complement of our count-based equation:
lit(1).minus(
count("x").divide(count(lit(1)))
)
.as("x: percent null")
It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:
// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
.divide(count(lit(1)))
.as("x: percent null")
Here is the simplest query:
d.filter($"x" !== null ).count
df.select(df.columns map count: _*)
or
df.select(df.columns map count: _*).toDF(df.columns: _*)
Spark 2.3+
(for string and numeric type columns)
df.summary("count").show()
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+