I am looking for a way to get a new column in a data frame in Scala that calculates the min/max of the values in col1, col2, ..., col10 for each row.
I know I can do it with a UDF but maybe there is an easier way.
Thanks!
Porting this Python answer by user6910411
import org.apache.spark.sql.functions._
val df = Seq(
(1, 3, 0, 9, "a", "b", "c")
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "Col7")
val cols = Seq("col1", "col2", "col3", "col4")
val rowMax = greatest(
cols map col: _*
).alias("max")
val rowMin = least(
cols map col: _*
).alias("min")
df.select($"*", rowMin, rowMax).show
// +----+----+----+----+----+----+----+---+---+
// |col1|col2|col3|col4|col5|col6|Col7|min|max|
// +----+----+----+----+----+----+----+---+---+
// | 1| 3| 0| 9| a| b| c|0.0|9.0|
// +----+----+----+----+----+----+----+---+---+
Related
Is it possible to evaluate formulas in a dataframe which refer to columns? e.g. if I have data like this (Scala example):
val df = Seq(
( 1, "(a+b)/d", 1, 20, 2, 3, 1 ),
( 2, "(c+b)*(a+e)", 0, 1, 2, 3, 4 ),
( 3, "a*(d+e+c)", 7, 10, 6, 2, 1 )
)
.toDF( "Id", "formula", "a", "b", "c", "d", "e" )
df.show()
Expected results:
I have been unable to get selectExpr, expr, eval() or combinations of them to work.
You can use the scala toolbox eval in a UDF:
import org.apache.spark.sql.functions.col
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val cols = df.columns.tail
val eval_udf = udf(
(r: Seq[String]) =>
tb.eval(tb.parse(
("val %s = %s;" * cols.tail.size).format(
cols.tail.zip(r.tail).flatMap(x => List(x._1, x._2)): _*
) + r(0)
)).toString
)
val df2 = df.select(col("id"), eval_udf(array(df.columns.tail.map(col):_*)).as("result"))
df2.show
+---+------+
| id|result|
+---+------+
| 1| 7|
| 2| 12|
| 3| 63|
+---+------+
A slightly different version of mck's answer, by replacing the variables in the formula column by their corresponding values from the other columns then calling eval udf :
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val eval = udf((f: String) => {
val toolbox = currentMirror.mkToolBox()
toolbox.eval(toolbox.parse(f)).toString
})
val formulaExpr = expr(df.columns.drop(2).foldLeft("formula")((acc, c) => s"replace($acc, '$c', $c)"))
df.select($"Id", eval(formulaExpr).as("result")).show()
//+---+------+
//| Id|result|
//+---+------+
//| 1| 7|
//| 2| 12|
//| 3| 63|
//+---+------+
A DataFrame as following:
import spark.implicits._
val df1 = List(
("id1", Array(0,2)),
("id1",Array(2,1)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[0, 2]|
|id1|[2, 1]|
|id2|[0, 3]|
+---+------+
I want to groupBy id to get max pooling of every value array. Max id1 value is Array(2,2). The result I want to get is:
import spark.implicits._
val res = List(
("id1", Array(2,2)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[2, 2]|
|id2|[0, 3]|
+---+------+
import spark.implicits._
val df1 = List(
("id1", Array(0,2,3)),
("id1",Array(2,1,4)),
("id2",Array(0,7,3))
).toDF("id", "value")
val df2rdd = df1.rdd
.map(x => (x(0).toString,x.getSeq[Int](1)))
.reduceByKey((x,y) => {
val arrlength = x.length
var i = 0
val resarr = scala.collection.mutable.ArrayBuffer[Int]()
while(i < arrlength){
if (x(i) >= y(i)){
resarr.append(x(i))
} else {
resarr.append(y(i))
}
i += 1
}
resarr
}).toDF("id","newvalue")
You can do like below
//Input df
+---+---------+
| id| value|
+---+---------+
|id1|[0, 2, 3]|
|id1|[2, 1, 4]|
|id2|[0, 7, 3]|
+---+---------+
//Solution approach:
import org.apache.spark.sql.functions.udf
val df1=df.groupBy("id").agg(collect_set("value").as("value"))
val maxUDF = udf{(s:Seq[Seq[Int]])=>s.reduceLeft((prev,next)=>prev.zip(next).map(tup=>if(tup._1>tup._2) tup._1 else tup._2))}
df1.withColumn("value",maxUDF(df1.col("value"))).show
//Sample Output:
+---+---------+
| id| value|
+---+---------+
|id1|[2, 2, 4]|
|id2|[0, 7, 3]|
+---+---------+
I hope, this will help you.
I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+
Using Python's Pandas, one can do bulk operations on multiple columns in one pass like this:
# assuming we have a DataFrame with, among others, the following columns
cols = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
df[cols] = df[cols] / df['another_column']
Is there a similar functionality using Spark in Scala?
Currently I end up doing:
val df2 = df.withColumn("col1", $"col1" / $"another_column")
.withColumn("col2", $"col2" / $"another_column")
.withColumn("col3", $"col3" / $"another_column")
.withColumn("col4", $"col4" / $"another_column")
.withColumn("col5", $"col5" / $"another_column")
.withColumn("col6", $"col6" / $"another_column")
.withColumn("col7", $"col7" / $"another_column")
.withColumn("col8", $"col8" / $"another_column")
You can use foldLeft to process the column list as below:
val df = Seq(
(1, 20, 30, 4),
(2, 30, 40, 5),
(3, 10, 30, 2)
).toDF("id", "col1", "col2", "another_column")
val cols = Array("col1", "col2")
val df2 = cols.foldLeft( df )( (acc, c) =>
acc.withColumn( c, df(c) / df("another_column") )
)
df2.show
+---+----+----+--------------+
| id|col1|col2|another_column|
+---+----+----+--------------+
| 1| 5.0| 7.5| 4|
| 2| 6.0| 8.0| 5|
| 3| 5.0|15.0| 2|
+---+----+----+--------------+
For completeness: a slightly different version from #Leo C's, not using foldLeft but a single select expression instead:
import org.apache.spark.sql.functions._
import spark.implicits._
val toDivide = List("col1", "col2")
val newColumns = toDivide.map(name => col(name) / col("another_column") as name)
val df2 = df.select(($"id" :: newColumns) :+ $"another_column": _*)
Produces the same output.
You can use plain select on operated columns. The solution is very similar to the Python Panda solution.
//Define the dataframe df1
case class ARow(col1: Int, col2: Int, anotherCol: Int)
val df1 = spark.createDataset(Seq(
ARow(1, 2, 3),
ARow(4, 5, 6),
ARow(7, 8, 9))).toDF
// Perform the operation using a map
val cols = Array("col1", "col2")
val opCols = cols.map(c => df1(c)/df1("anotherCol"))
// Select the columns operated
val df2 = df1.select(opCols: _*)
The .show on df2
df2.show()
+-------------------+-------------------+
|(col1 / anotherCol)|(col2 / anotherCol)|
+-------------------+-------------------+
| 0.3333333333333333| 0.6666666666666666|
| 0.6666666666666666| 0.8333333333333334|
| 0.7777777777777778| 0.8888888888888888|
+-------------------+-------------------+
I know I can extract columns like this:
userData1.select(userData1("job"))
But what if I already have a column, or an array of columns, how do I get a dataframe out of it? What has worked for me so far is:
userData1.select(userData1("id"), userData1("age"))
This is a bit verbose and ugly compared to what you can do in R:
userData1[, c("id", "age")]
And what about rows? For example:
userData1.head(5)
How do you convert this into a new dataframe?
To select multiple columns you can use varargs syntax:
import org.apache.spark.sql.DataFrame
val df: DataFrame = sc.parallelize(Seq(
(1, "x", 2.0), (2, "y", 3.0), (3, "z", 4.0)
)).toDF("foo", "bar", "baz")
// or df.select(Seq("foo", "baz") map col: _*)
val fooAndBaz: DataFrame = df.select(Seq($"foo", $"baz"): _*)
fooAndBaz.show
// +---+---+
// |foo|baz|
// +---+---+
// | 1|2.0|
// | 2|3.0|
// | 3|4.0|
// +---+---+
PySpark equivalent is arguments unpacking:
df = sqlContext.createDataFrame(
[(1, "x", 2.0), (2, "y", 3.0), (3, "z", 4.0)],
("foo", "bar", "baz"))
df.select(*["foo", "baz"]).show()
## +---+---+
## |foo|baz|
## +---+---+
## | 1|2.0|
## | 2|3.0|
## | 3|4.0|
## +---+---+
To limit number of rows without collecting you can use limit method:
val firstTwo: DataFrame = df.limit(2)
firstTwo.show
// +---+---+---+
// |foo|bar|baz|
// +---+---+---+
// | 1| x|2.0|
// | 2| y|3.0|
// +---+---+---+
which is equivalent to SQL LIMIT clause:
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df LIMIT 2").show
// +---+---+---+
// |foo|bar|baz|
// +---+---+---+
// | 1| x|2.0|
// | 2| y|3.0|
// +---+---+---+