Expand expression in Spark Scala aggregation - scala

I'm trying to convert a simple aggregation code from PySpark to Scala.
The dataframes:
# PySpark
from pyspark.sql import functions as F
df = spark.createDataFrame(
[([10, 100],),
([20, 200],)],
['vals'])
// Scala
val df = Seq(
(Seq(10, 100)),
(Seq(20, 200)),
).toDF("vals")
Aggregation expansion - OK in PySpark:
df2 = df.agg(
*[F.sum(F.col("vals")[i]).alias(f"col{i}") for i in range(2)]
)
df2.show()
# +----+----+
# |col0|col1|
# +----+----+
# | 30| 300|
# +----+----+
But in Scala...
val df2 = df.agg(
(0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
)
(0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
^
On line 2: error: no `: _*` annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
The syntax seems almost the same to this select which works well:
val df2 = df.select(
(0 until 2).map(i => $"vals"(i).alias(s"col$i")): _*
)
Does expression expansion work in Scala Spark aggregations? How?

i'm not fully understanding why this is happening for the compiler but it seems that it is not unpacking your Seq[Column] to vararg as params.
as #RvdV has mentioned in his post, the signature of the method is
def agg(expr: Column, exprs: Column*): DataFrame
so a temp solution is you unpack it manually, like:
val seq = Seq(0, 1).map(i => sum($"vals"(i)).alias(s"col$i"))
val df2 = df.agg(seq(0), seq(1))

If you look at the documentation of Dataset.agg, you see that it first has a fixed parameter and then a list of unspecified length:
def agg(expr: Column, exprs: Column*): DataFrame
So you should first have any other aggregation, then for the second argument you can do the list expansion. So something like
val df2 = df.agg(
first($"vals"), (0 until 2).map(i => sum($"vals"(i)).alias(s"col$i")): _*
)
or any other single aggregation in front of the list should work.
I don't know why it is like this, maybe it's a Scala limitation so you can't pass an empty list and have no aggregation at all?

Related

filter after flatMap spark

I have a simple example
val arrayStructureData = Seq(
Row("test1|value1"),
Row("test2|value2"),
Row("test3|value3")
)
val arrayStructureSchema = new StructType()
.add("name", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData), arrayStructureSchema)
import spark.implicits._
val distPhens = df.flatMap(row => row.getString(0).split("\\|"))
.filter(x => x.like("test[0-9]+"))
.toDF("distinct_phens")
where I'm trying to run filter after running flatMap. The desired output is :
value1
value2
value3
If I understand correctly, like expects a column but I am not sure how to "refer" to the column after flatMap has been executed.
I need this filter operation to run after flatMap.
Thanks in advance.
You can refer to the column object using col, and do an rlike filter:
val result = df.flatMap(row => row.getString(0).split("\\|")).filter(col("value").rlike("test[0-9]+"))
result.show
+-----+
|value|
+-----+
|test1|
|test2|
|test3|
+-----+

Use Map to replace column values in Spark

I have to map a list of columns to another column in a Spark dataset: think something like this
val translationMap: Map[Column, Column] = Map(
lit("foo") -> lit("bar"),
lit("baz") -> lit("bab")
)
And I have a dataframe like this one:
val df = Seq("foo", "baz").toDF("mov")
So I intend to perform the translation like this:
df.select(
col("mov"),
translationMap(col("mov"))
)
but this piece of code spits the following error
key not found: movs
java.util.NoSuchElementException: key not found: movs
Is there a way to perform such translation without concatenating hundreds of whens? think that translationMap could have lots of pairs key-value.
Instead of Map[Column, Column] you should use a Column containing a map literal:
import org.apache.spark.sql.functions.typedLit
val translationMap: Column = typedLit(Map(
"foo" -> "bar",
"baz" -> "bab"
))
The rest of your code can stay as-is:
df.select(
col("mov"),
translationMap(col("mov"))
).show
+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo| bar|
|baz| bab|
+---+---------------------------------------+
You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.
val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value") ).show
//+---+-----+
//|mov|value|
//+---+-----+
//|foo| bar|
//|baz| bab|
//+---+-----+
Another solution would be to load the Map as a DataSet[(String, String)] and the join the two datasets taking mov as the key.

describe() function over rows instead columns

As said in:
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
The describe() function works for each numerical column, It is possible to do it against rows? My DF size is 53 cols and 346,143 rows, so transpose is not an option. How can I do it?
I'm using Spark 2.11
You can do your own UDF. Either you make an separate UDF for each quantity or put everything in 1 UDF returning a complex result:
val df = Seq(
(1.0,2.0,3.0,4.0,5.0)
).toDF("x1","x2","x3","x4","x5")
val describe = udf(
{ xs : Seq[Double] =>
val xmin = xs.min
val xmax = xs.max
val mean = xs.sum/xs.size.toDouble
(xmin,xmax,mean)
}
)
df
.withColumn("describe",describe(array("*")))
.withColumn("min",$"describe._1")
.withColumn("max",$"describe._2")
.withColumn("mean",$"describe._3")
.drop($"describe")
.show
gives:
+---+---+---+---+---+---+---+----+
| x1| x2| x3| x4| x5|min|max|mean|
+---+---+---+---+---+---+---+----+
|1.0|2.0|3.0|4.0|5.0|1.0|5.0| 3.0|
+---+---+---+---+---+---+---+----+

Sum up the values of the DataFrame based on conditions

I have a DataFrame that is created as follows:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF()
This is the content of data.txt:
123,2016-11-09,1
124,2016-11-09,2
123,2016-11-10,1
123,2016-11-11,1
123,2016-11-12,1
124,2016-11-13,1
124,2016-11-14,1
Is it possible to filter df in order to get the sum of 3rd column values for 123 for the last N days starting from now? I am interested in a flexible solution so that N could be defined as a parameter.
For example, if today would be 2016-11-16 and N would be equal to 5, then the sum of 3rd column values for 124 would be equal to 2.
This is my current solution:
df = sc
.textFile("s3n://bucket/key/data.txt")
.map(_.split(","))
.toDF(["key","date","qty"])
val starting_date = LocalDate.now().minusDays(x_last_days)
df.filter(col("key") === "124")
.filter(to_date(df("date")).gt(starting_date))
.agg(sum(col("qty")))
but it does not seem to work properly. 1. The line where I define column names ["key","date","qty"] does not compile for Scala 2.10.6 and Spark 1.6.2. 2. Also it returns a dataframe, while I need Int. Should I just do toString.toInt?
Both of the following won't compile :
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
// <console>:1: error: illegal start of simple expression
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF(["key","date","qty"])
^
scala> val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
// <console>:27: error: value toDF is not a member of org.apache.spark.rdd.RDD[Array[String]]
// val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1")).map(_.split(",")).toDF
^
The first one won't because it's a incorrect syntax and as for the second, it is because, like the error says, it's not a member, in other terms, the action is not supported.
The later one will compile with Spark 2.x but the following solution would also apply or you'll have a DataFrame with one column of type ArrayType.
Now let's solve the issue :
scala> :pa
// Entering paste mode (ctrl-D to finish)
import sqlContext.implicits._ // you don't need to import this in the shell.
val df = sc.parallelize(Seq("123,2016-11-09,1","124,2016-11-09,2","123,2016-11-10,1","123,2016-11-11,1","123,2016-11-12,1","124,2016-11-13,1","124,2016-11-14,1"))
.map{ _.split(",") match { case Array(a,b,c) => (a,b,c) }}.toDF("key","date","qty")
// Exiting paste mode, now interpreting.
// df: org.apache.spark.sql.DataFrame = [key: string, date: string, qty: string]
You can apply any filter you want and compute the aggregation needed, e.g :
scala> val df2 = df.filter(col("key") === "124").agg(sum(col("qty")))
// df2: org.apache.spark.sql.DataFrame = [sum(qty): double]
scala> df2.show
// +--------+
// |sum(qty)|
// +--------+
// | 4.0|
// +--------+
PS: The above code has been tested in Spark 1.6.2 and 2.0.0

apache spark groupBy pivot function

I am new to spark and using spark 1.6.1. I am using the pivot function to create a new column based on a integer value. Say I have a csv file like this:
year,winds
1990,50
1990,55
1990,58
1991,45
1991,42
1991,58
I am loading the csv file like this:
var df =sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/sample.csv")
I want to aggregate the winds colmnn filtering those winds greater than 55 so that I get an output file like this:
year, majorwinds
1990,2
1991,1
I am using the code below:
val df2=df.groupBy("major").pivot("winds").agg(>55)->"count")
But I get this error
error: expected but integer literal found
What is the correct syntax here? Thanks in advance
In your case, if you just want output like:
+----+----------+
|year|majorwinds|
+----+----------+
|1990| 2|
|1991| 1|
+----+----------+
It's not necessary to use pivot.
You could reach this by using filter, groupBy and count:
df.filter($"winds" >= 55)
.groupBy($"year")
.count()
.withColumnRenamed("count", "majorwinds")
.show()
use this generic funtion to do pivot
def transpose(sqlCxt: SQLContext, df: DataFrame, compositeId: Vector[String], pair: (String, String), distinctCols: Array[Any]): DataFrame = {
val rdd = df.map { row => (compositeId.collect { case id => row.getAs(id).asInstanceOf[Any] }, scala.collection.mutable.Map(row.getAs(pair._1).asInstanceOf[Any] -> row.getAs(pair._2).asInstanceOf[Any])) }
val pairRdd = rdd.reduceByKey(_ ++ _)
val rowRdd = pairRdd.map(r => dynamicRow(r, distinctCols))
sqlCxt.createDataFrame(rowRdd, getSchema(compositeId ++ distinctCols))
}