Looking to substract every value in a row based on the value of a separate DF - scala

As the title states, I would like to subtract each value of a specific column by the mean of that column.
Here is my code attempt:
val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()

You can either use Spark's DataFrame functions or a mere SQL query to a DataFrame to aggregate the values of the means for the columns you are focusing on (rating1, rating2).
val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()
Output for the test input DataFrame moviePairs (with the good ol' double precision loss which you can manage as seen here):
+-------------------+-------+-------+-------------------+-------------------+
| movie|rating1|rating2| meanDeltaX| meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
| Moonlight| 7| 8| 0.5999999999999996| 3.2|
| Lord Of The Drinks| 10| 1| 3.5999999999999996| -3.8|
|The Disaster Artist| 3| 5|-3.4000000000000004|0.20000000000000018|
| Airplane!| 7| 9| 0.5999999999999996| 4.2|
| 2001| 5| 1|-1.4000000000000004| -3.8|
+-------------------+-------+-------+-------------------+-------------------+

Related

Spark - group and aggregate only several smallest items

In short
I have cartesian-product (cross-join) of two dataframes and function which gives some score for given element of this product. I want now to get few "best matched" elements of the second DF for every member of the first DF.
In details
What follows is a simplified example as my real code is somewhat bloated with additional fields and filters.
Given two sets of data, each having some id and value:
// simple rdds of tuples
val rdd1 = sc.parallelize(Seq(("a", 31),("b", 41),("c", 59),("d", 26),("e",53),("f",58)))
val rdd2 = sc.parallelize(Seq(("z", 16),("y", 18),("x",3),("w",39),("v",98), ("u", 88)))
// convert them to dataframes:
val df1 = spark.createDataFrame(rdd1).toDF("id1", "val1")
val df2 = spark.createDataFrame(rdd2).toDF("id2", "val2")
and some function which for pair of the elements from the first and second dataset gives their "matching score":
def f(a:Int, b:Int):Int = (a * a + b * b * b) % 17
// convert it to udf
val fu = udf((a:Int, b:Int) => f(a, b))
we can create the product of two sets and calculate score for every pair:
val dfc = df1.crossJoin(df2)
val r = dfc.withColumn("rez", fu(col("val1"), col("val2")))
r.show
+---+----+---+----+---+
|id1|val1|id2|val2|rez|
+---+----+---+----+---+
| a| 31| z| 16| 8|
| a| 31| y| 18| 10|
| a| 31| x| 3| 2|
| a| 31| w| 39| 15|
| a| 31| v| 98| 13|
| a| 31| u| 88| 2|
| b| 41| z| 16| 14|
| c| 59| z| 16| 12|
...
And now we want to have this result grouped by id1:
r.groupBy("id1").agg(collect_set(struct("id2", "rez")).as("matches")).show
+---+--------------------+
|id1| matches|
+---+--------------------+
| f|[[v,2], [u,8], [y...|
| e|[[y,5], [z,3], [x...|
| d|[[w,2], [x,6], [v...|
| c|[[w,2], [x,6], [v...|
| b|[[v,2], [u,8], [y...|
| a|[[x,2], [y,10], [...|
+---+--------------------+
But really we want only to retain only few (say 3) of "matches", those having the best score (say, least score).
The question is
How to get the "matches" sorted and reduced to top-N elements? Probably it is something about collect_list and sort_array, though I don't know how to sort by inner field.
Is there a way to ensure optimization in case of large input DFs - e.g. choosing minimums directly while aggregating. I know it could be done easily if I wrote the code without spark - keeping small array or priority queue for every id1 and adding element where it should be, possibly dropping out some previously added.
E.g. it's ok that cross-join is costly operation, but I want to avoid wasting memory on the results most of which I'm going to drop in the next step. My real use case deals with DFs with less than 1 mln entries so cross-join is yet viable but as we want to select only 10-20 top matches for each id1 it seems to be quite desirable not to keep unnecessary data between steps.
For start we need to take only the first n rows. To do this we are partitioning the DF by 'id1' and sorting the groups by the res. We use it to add row number column to the DF, like that we can use where function to take the first n rows. Than you can continue doing the same code your wrote. Grouping by 'id1' and collecting the list. Only now you already have the highest rows.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val n = 3
val w = Window.partitionBy($"id1").orderBy($"res".desc)
val res = r.withColumn("rn", row_number.over(w)).where($"rn" <= n).groupBy("id1").agg(collect_set(struct("id2", "res")).as("matches"))
A second option that might be better because you won't need to group the DF twice:
val sortTakeUDF = udf{(xs: Seq[Row], n: Int)} => xs.sortBy(_.getAs[Int]("res")).reverse.take(n).map{case Row(x: String, y:Int)}}
r.groupBy("id1").agg(sortTakeUDF(collect_set(struct("id2", "res")), lit(n)).as("matches"))
In here we create a udf that take the array column and an integer value n. The udf sorts the array by your 'res' and returns only the first n elements.

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

Spark2 Dataframe/RDD process in groups

I have the following table stored in Hive called ExampleData:
+--------+-----+---|
|Site_ID |Time |Age|
+--------+-----+---|
|1 |10:00| 20|
|1 |11:00| 21|
|2 |10:00| 24|
|2 |11:00| 24|
|2 |12:00| 20|
|3 |11:00| 24|
+--------+-----+---+
I need to be able to process the data by Site. Unfortunately partitioning it by Site doesn't work (there are over 100k sites, all with fairly small amounts of data).
For each site, I need to select the Time column and Age column separately, and use this to feed into a function (which ideally I want to run on the executors, not the driver)
I've got a stub of how I think I want it to work, but this solution would only run on the driver, so it's very slow. I need to find a way of writing it so it will run an executor level:
// fetch a list of distinct sites and return them to the driver
//(if you don't, you won't be able to loop around them as they're not on the executors)
val distinctSites = spark.sql("SELECT site_id FROM ExampleData GROUP BY site_id LIMIT 10")
.collect
val allSiteData = spark.sql("SELECT site_id, time, age FROM ExampleData")
distinctSites.foreach(row => {
allSiteData.filter("site_id = " + row.get(0))
val times = allSiteData.select("time").collect()
val ages = allSiteData.select("ages").collect()
processTimesAndAges(times, ages)
})
def processTimesAndAges(times: Array[Row], ages: Array[Row]) {
// do some processing
}
I've tried broadcasting the distinctSites across all nodes, but this did not prove fruitful.
This seems such a simple concept and yet I have spent a couple of days looking into this. I'm very new to Scala/Spark, so apologies if this is a ridiculous question!
Any suggestions or tips are greatly appreciated.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Example :
rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
groupByKey().
forEachPartition( iter => doSomeJob(iter) )
In DataFrame you can use aggregate functions,GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum
Example :
val df = sc.parallelize(Seq(
(1, 10.3, 10), (1, 11.5, 10),
(2, 12.6, 20), (3, 2.6, 30))
).toDF("Site_ID ", "Time ", "Age")
df.show()
+--------+-----+---+
|Site_ID |Time |Age|
+--------+-----+---+
| 1| 10.3| 10|
| 1| 11.5| 10|
| 2| 12.6| 20|
| 3| 2.6| 30|
+--------+-----+---+
df.groupBy($"Site_ID ").count.show
+--------+-----+
|Site_ID |count|
+--------+-----+
| 1| 2|
| 3| 1|
| 2| 1|
+--------+-----+
Note : As you have mentioned that solution is very slow ,You need to use partition ,in your case range partition is good option.
http://dev.sortable.com/spark-repartition/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

How to get a value from Dataset and store it in a Scala value?

I have a dataframe which looks like this:
scala> avgsessiontime.show()
+-----------------+
| avg|
+-----------------+
|2.073455735838315|
+-----------------+
I need to store the value 2.073455735838315 in a variable. I tried using
avgsessiontime.collect
but that starts giving me Task not serializable exceptions. So to avoid that I started using foreachPrtition. But I dont know how to extract the value 2.073455735838315 in an array variable.
scala> avgsessiontime.foreachPartition(x => x.foreach(println))
[2.073455735838315]
But when I do this:
avgsessiontime.foreachPartition(x => for (name <- x) name.get(0))
I get a blank/empty result. Even the length returns empty.
avgsessiontime.foreachPartition(x => for (name <- x) name.length)
I know name is of type org.apache.spark.sql.Row then it should return both those results.
You might need:
avgsessiontime.first.getDouble(0)
Here use first to extract the Row object, and .getDouble(0) to extract value from the Row object.
val df = Seq(2.0743).toDF("avg")
df.show
+------+
| avg|
+------+
|2.0743|
+------+
df.first.getDouble(0)
// res6: Double = 2.0743
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
scala> val variable = df.select("id").as[Long].collect
variable: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Same way you can extract values of any type i.e double,string. You just need to give data type while selecting values from df.
rdd and dataframes/datasets are distributed in nature, and foreach and foreachPartition are executed on executors, transforming dataframe or rdd on executors itself without returning anything. So if you want to return the variable to the driver node then you will have to use collect.
Supposing you have a dataframe as
+-----------------+
|avg |
+-----------------+
|2.073455735838315|
|2.073455735838316|
+-----------------+
doing the following will print all the values, which you can store in a variable too
avgsessiontime.rdd.collect().foreach(x => println(x(0)))
it will print
2.073455735838315
2.073455735838316
Now if you want only the first one then you can do
avgsessiontime.rdd.collect()(0)(0)
which will give you
2.073455735838315
I hope the answer is helpful

Scala add new column to dataframe by expression

I am going to add new column to a dataframe with expression.
for example, I have a dataframe of
+-----+----------+----------+-----+
| C1 | C2 | C3 |C4 |
+-----+----------+----------+-----+
|steak|1 |1 | 150|
|steak|2 |2 | 180|
| fish|3 |3 | 100|
+-----+----------+----------+-----+
and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.
Is there a good way to do this?
I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.
And normally I am using df.withColumn to add new column.
Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.
This can be done using expr to create a Column from an expression:
val df = Seq((1,2)).toDF("x","y")
val myExpression = "x+y"
import org.apache.spark.sql.functions.expr
df.withColumn("z",expr(myExpression)).show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Two approaches:
import spark.implicits._ //so that you could use .toDF
val df = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
import org.apache.spark.sql.functions._
// 1st approach using expr
df.withColumn("C5", expr("C2/(C3 + C4)")).show()
// 2nd approach using selectExpr
df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()
+-----+---+---+---+--------------------+
| C1| C2| C3| C4| C5|
+-----+---+---+---+--------------------+
|steak| 1| 1|150|0.006622516556291391|
|steak| 2| 2|180| 0.01098901098901099|
| fish| 3| 3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+
In Spark 2.x, you can create a new column C5 with expression "C2/C3+C4" using withColumn() and org.apache.spark.sql.functions._,
val currentDf = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
val requiredDf = currentDf
.withColumn("C5", (col("C2")/col("C3")+col("C4")))
Also, you can do the same using org.apache.spark.sql.Column as well.
(But the space complexity is bit higher in this approach than using org.apache.spark.sql.functions._ due to the Column object creation)
val requiredDf = currentDf
.withColumn("C5", (new Column("C2")/new Column("C3")+new Column("C4")))
This worked perfectly for me. I am using Spark 2.0.2.