Calculate mean for several columns in Spark scala - scala

I'm looking for a way to calculate some statistic e.g. mean over several selected columns in Spark using Scala. Given that data object is my Spark DataFrame, it's easy to calculate a mean for one column only e.g.
data.agg(avg("var1") as "mean var1").show
Also, we can easily calculate a mean cross-tabulated by values of some other columns e.g.:
data.groupBy("category").agg(avg("var1") as "mean_var1").show
But how can we calculate a mean for a List of columns in a DataFrame? I tried running something like this, but it didn't work:
scala> data.select("var1", "var2").mean().show
<console>:44: error: value mean is not a member of org.apache.spark.sql.DataFrame
data.select("var1", "var2").mean().show
^

This is what you need to do
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1,2,3), (3,4,5), (1,2,4)).toDF("A", "B", "C")
data.select(data.columns.map(mean(_)): _*).show()
Output:
+------------------+------------------+------+
| avg(A)| avg(B)|avg(C)|
+------------------+------------------+------+
|1.6666666666666667|2.6666666666666665| 4.0|
+------------------+------------------+------+
This works for selected columns
data.select(Seq("A", "B").map(mean(_)): _*).show()
Output:
+------------------+------------------+
| avg(A)| avg(B)|
+------------------+------------------+
|1.6666666666666667|2.6666666666666665|
+------------------+------------------+
Hope this helps!

If you already have the dataset you can do this:
ds.describe(s"age")
Which will return this:
summary age
count 10.0
mean 53.3
stddev 11.6
min 18.0
max 92.0

Related

Check the minimum by iterating one row in a dataframe over all the rows in another dataframe

Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.

How to calculate the average of rows between a start index and end index of a column in a dataframe in Spark using Scala?

I have a spark dataframe with a column having float type values. I am trying to find the average of values between row 11 to row 20. Please note, I am not trying any sort of moving average. I tried using partition window like so -
var avgClose= avg(priceDF("Close")).over(partitionWindow.rowsBetween(11,20))
It returns an 'org.apache.spark.sql.Column' result. I don't know how to view avgClose.
I am new to Spark and Scala. Appreciate your help in getting this.
Assign an increasing id to your table. Then you can do an average between the ids.
val df = Seq(20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1).toDF("val1")
val dfWithId = df.withColumn("id", monotonically_increasing_id())
val avgClose= dfWithId.filter($"id" >= 11 && $"id" <= 20).agg(avg("val1"))
avgClose.show()
result:
+---------+
|avg(val1)|
+---------+
| 5.0|
+---------+

Filtering a DataFrame on date columns comparison

I am trying to filter a DataFrame comparing two date columns using Scala and Spark. Based on the filtered DataFrame there are calculations running on top to calculate new columns.
Simplified my data frame has the following schema:
|-- received_day: date (nullable = true)
|-- finished: int (nullable = true)
On top of that I create two new column t_start and t_end that would be used for filtering the DataFrame. They have 10 and 20 days difference from the original column received_day:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10))
.withColumn("t_start",date_sub(col("received_day"),20))
I now want to have a new calculated column that indicates for each row of data how many rows of the dataframe are in the t_start to t_end period. I thought I can achieve this the following way:
val dfWithCount = dfWithDates
.withColumn("cnt", lit(
dfWithDates.filter(
$"received_day".lt(col("t_end"))
&& $"received_day".gt(col("t_start"))).count()))
However, this count only returns 0 and I believe that the problem is in the the argument that I am passing to lt and gt.
From following that issue here Filtering a spark dataframe based on date I realized that I need to pass a string value. If I try with hard coded values like lt(lit("2018-12-15")), then the filtering works. So I tried casting my columns to StringType:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10).cast(DataTypes.StringType))
.withColumn("t_start",date_sub(col("received_day"),20).cast(DataTypes.StringType))
But the filter still returns an empty dataFrame.
I would assume that I am not handling the data type right.
I am running on Scala 2.11.0 with Spark 2.0.2.
Yes you are right. For $"received_day".lt(col("t_end") each reveived_day value is compared with the current row's t_end value, not the whole dataframe. So each time you'll get zero as count.
You can solve this by writing a simple udf. Here is the way how you can solve the issue:
Creating sample input dataset:
import org.apache.spark.sql.{Row, SparkSession}
import java.sql.Date
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((Date.valueOf("2018-10-12"),1),
(Date.valueOf("2018-10-13"),1),
(Date.valueOf("2018-09-25"),1),
(Date.valueOf("2018-10-14"),1)).toDF("received_day", "finished")
val dfWithDates= df
.withColumn("t_start",date_sub(col("received_day"),20))
.withColumn("t_end",date_sub(col("received_day"),10))
dfWithDates.show()
+------------+--------+----------+----------+
|received_day|finished| t_start| t_end|
+------------+--------+----------+----------+
| 2018-10-12| 1|2018-09-22|2018-10-02|
| 2018-10-13| 1|2018-09-23|2018-10-03|
| 2018-09-25| 1|2018-09-05|2018-09-15|
| 2018-10-14| 1|2018-09-24|2018-10-04|
+------------+--------+----------+----------+
Here for 2018-09-25 we desire count 3
Generate output:
val count_udf = udf((received_day:Date) => {
(dfWithDates.filter((col("t_end").gt(s"$received_day")) && col("t_start").lt(s"$received_day")).count())
})
val dfWithCount = dfWithDates.withColumn("count",count_udf(col("received_day")))
dfWithCount.show()
+------------+--------+----------+----------+-----+
|received_day|finished| t_start| t_end|count|
+------------+--------+----------+----------+-----+
| 2018-10-12| 1|2018-09-22|2018-10-02| 0|
| 2018-10-13| 1|2018-09-23|2018-10-03| 0|
| 2018-09-25| 1|2018-09-05|2018-09-15| 3|
| 2018-10-14| 1|2018-09-24|2018-10-04| 0|
+------------+--------+----------+----------+-----+
To make computation faster i would suggest to cache dfWithDates as there are repetition of same operation for each row.
You can cast date value to string with any pattern using DateTimeFormatter
import java.time.format.DateTimeFormatter
date.format(DateTimeFormatter.ofPattern("yyyy-MM-dd"))

How to divide the value of current row with the following one?

In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the fraction of dividing current row and the next one, for every row?
For example, if I have a table with one column, like so
Age
100
50
20
4
I'd like the following output
Franction
2
2.5
5
The last row is dropped because it has no "next row" to be added to.
Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.
Is there a better way to do this?
Can this be done with a Window function?
Window function should do only partial tricks. Other partial trick can be done by defining a udf function
def div = udf((age: Double, lag: Double) => lag/age)
First we need to find the lag using Window function and then pass that lag and age in udf function to find the div
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dataframe = Seq(
("A",100),
("A",50),
("A",20),
("A",4)
).toDF("person", "Age")
val windowSpec = Window.partitionBy("person").orderBy(col("Age").desc)
val newDF = dataframe.withColumn("lag", lag(dataframe("Age"), 1) over(windowSpec))
And finally cal the udf function
newDF.filter(newDF("lag").isNotNull).withColumn("div", div(newDF("Age"), newDF("lag"))).drop("Age", "lag").show
Final output would be
+------+---+
|person|div|
+------+---+
| A|2.0|
| A|2.5|
| A|5.0|
+------+---+
Edited
As #Jacek has suggested a better solution to use .na.drop instead of .filter(newDF("lag").isNotNull) and use / operator , so we don't even need to call the udf function
newDF.na.drop.withColumn("div", newDF("lag")/newDF("Age")).drop("Age", "lag").show

Convert Spark Data Frame to org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]

I'm very new to scala and spark 2.1.
I'm trying to calculate correlation between many elements in a data frame which looks like this:
item_1 | item_2 | item_3 | item_4
1 | 1 | 4 | 3
2 | 0 | 2 | 0
0 | 2 | 0 | 1
Here is what I've tried:
val df = sqlContext.createDataFrame(
Seq((1, 1, 4, 3),
(2, 0, 2, 0),
(0, 2, 0, 1)
).toDF("item_1", "item_2", "item_3", "item_4")
val items = df.select(array(df.columns.map(col(_)): _*)).rdd.map(_.getSeq[Double](0))
And calcualte correlation between elements:
val correlMatrix: Matrix = Statistics.corr(items, "pearson")
With followning error message:
<console>:89: error: type mismatch;
found : org.apache.spark.rdd.RDD[Seq[Double]]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val correlMatrix: Matrix = Statistics.corr(items, "pearson")
I don't know how to create the org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] from a data frame.
This might be a really easy task but I kinda struggle with it and I'm happy for any advice.
You can for example use VectorAssembler. Assemble vectors and convert to RDD
import org.apache.spark.ml.feature.VectorAssembler
val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
.transform(df)
.select("vs")
.rdd
Extract Vectors from Row:
Spark 1.x:
rows.map(_.getAs[org.apache.spark.mllib.linalg.Vector](0))
Spark 2.x:
rows
.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
Regarding your code:
You have Integer columns not Double.
Data is not an array so the you cannot use _.getSeq[Double](0).
If your goal is to perform pearson correlations, you don't really have to use RDDs and Vectors. Here's an example of performing pearson correlations directly on DataFrame columns (the columns in question are Doubles types).
Code:
import org.apache.spark.sql.{SQLContext, Row, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType}
import org.apache.spark.sql.functions._
val rb = spark.read.option("delimiter","|").option("header","false").option("inferSchema","true").format("csv").load("rb.csv").toDF("name","beerId","brewerId","abv","style","appearance","aroma","palate","taste","overall","time","reviewer").cache()
rb.agg(
corr("overall","taste"),
corr("overall","aroma"),
corr("overall","palate"),
corr("overall","appearance"),
corr("overall","abv")
).show()
In this example, I'm importing a dataframe (with a custom delimiter, no header, and inferred data types), and then simply performing an agg function against the dataframe which has multiple correlations inside it.
Output:
+--------------------+--------------------+---------------------+-------------------------+------------------+
|corr(overall, taste)|corr(overall, aroma)|corr(overall, palate)|corr(overall, appearance)|corr(overall, abv)|
+--------------------+--------------------+---------------------+-------------------------+------------------+
| 0.8762432795943761| 0.789023067942876| 0.7008942639550395| 0.5663593891357243|0.3539158620897098|
+--------------------+--------------------+---------------------+-------------------------+------------------+
As you can see from the results, the (overall, taste) columns are highly correlated, while (overall, abv) not so much.
Here's a link to the Scala Docs DataFrame page which has the Aggregation Correlation Function.