select column of a dataframe using udf - scala

I use spark-shell and want to create a dataframe (df2) from another dataframe (df1) using select and udf. But there is an error when I want to show the df2 ==> df2.show(1)
var df1 = sql(s"select * from table_1")
val slice = udf ((items: Array[String]) => if (items == null) items
else {
if (items.size <= 20)
items
else
items.slice(0, 20)
})
var df2 = df1.select($"col1", slice($"col2"))
and the df1 schema is:
scala> df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.printSchema
root
|-- col1: string (nullable = true)
|-- UDF(col2): array (nullable = true)
| |-- element: string (containsNull = true)
error:
Failed to execute user defined function($anonfun$1: (array<string>) => array<string>)

Used Seq[String] instead of Array[String] in the udf and the issue is resolved.

Related

How to convert two columns to List[(Long,Int)] - Spark UDF

I am new to Spark and trying to call an existing convertor method. I have a Spark DataFrame df with two columns. I want to convert these two columns to a List[(Long,Int)]
val df = (spark.read.parquet(s"<file path>")
.select(
explode($"groups") as "groups"))
.select(
explode($"groups.lanes") as "lanes")
.select(
$"lanes.lane_path" as "lane_path")
Schema
root
|-- lane_path: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- diffs: array (nullable = true)
| | |-- element: integer (containsNull = true)
UDF
val generateValues = spark.udf.register("my_udf",
(laneAttrs: List[(Long,Int)], x: Long, y: Int) => {
Converter.convert(laneAttrs, x, y)
.map {
case coord#Left(f) => throw new InterruptedException(s"$coord: $f")
case Right(result) => Map("a" -> result.a,
"b" -> result.b)
}
})
How can I pass "$lane_path.coordinates" and "$lane_path.diffs" as List[(Long,Int)] to call generateValues UDF?

Select column based on Map

I have the following dataframe df with the following schema:
|-- type: string (nullable = true)
|-- record_sales: array (nullable = false)
| |-- element: string (containsNull = false)
|-- record_marketing: array (nullable = false)
| |-- element: string (containsNull = false)
and a map
typemap = Map("sales" -> "record_sales", "marketing" -> "record_marketing")
I want a new column "record" that is either the value of record_sales or record_marketing based on the value of type.
I've tried some variants of this:
val typeMapCol = typedLit(typemap)
val df2 = df.withColumn("record", col(typeMapCol(col("type"))))
But nothing has worked. Does anyone have any idea? Thanks!
You can iterate over the map typemap and use when function to get case/when expressions depending on the value of type column:
val recordCol = typemap.map{case (k,v) => when(col("type") === k, col(v))}.toSeq
val df2 = df.withColumn("record", coalesce(recordCol: _*))

Spark - select multiple columns from array object

I have a dataset with following schema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- subEntities: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- status: string (nullable = true)
| | |-- subEntityId: long (nullable = true)
| | |-- subEntityName: string (nullable = true)
dataset.select($"id", $"name", $"subEntities.subEntityId", $"subEntities.subEntityName") put subEntityId and subEntityName into separate arrays. How to select multiple columns and put them into single array?
If working on Spark >= 2.4 you can use the transform function to generate an array which contains a subset of the original array's fields:
import org.apache.spark.sql.functions.expr
dataset.withColumn("newArray", expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))"))
// or with select
dataset.select(
$"id",
$"name",
expr("transform(subEntities, i -> struct(i.subEntityId, i.subEntityName))").as("newArray")
)
.withColumn("status",col("subEntities").getField("status"))
.withColumn("subEntityId",col("subEntities").getField("subEntityId"))
To extract value out of your array
Below is working example
import org.apache.spark.sql.functions._
object ExplodeArrauy {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(bean57("1",Array(bean55("aaa",2),bean55("aaa1",21))),
bean57("2",Array(bean55("bbb",3),bean55("bbb3",31)))).toDF
df
.withColumn("status",col("subEntities").getField("status"))
.withColumn("subEntityId",col("subEntities").getField("subEntityId"))
.show()
}
}
case class bean57(id:String,subEntities:Array[bean55])
case class bean55(status: String,subEntityId:Long)

scala.collection.mutable.ArrayBuffer cannot be cast to java.lang.Double (Spark)

I have a DataFrame like this:
root
|-- midx: double (nullable = true)
|-- future: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: long (nullable = false)
| | |-- _2: long (nullable = false)
Using this code I am trying to transfer it into something like this:
val T = withFfutures.where($"midx" === 47.0).select("midx","future").collect().map((row: Row) =>
Row {
row.getAs[Seq[Row]]("future").map { case Row(e: Long, f: Long) =>
(row.getAs[Double]("midx"), e, f)
}
}
).toList
root
|-- id: double (nullable = true)
|-- event: long (nullable = true)
|-- future: long (nullable = true)
So the plan is to transfer the array of (event, future) into a dataframe that has those two fields as a column. I am trying to transfer T into a DataFrame like this:
val schema = StructType(Seq(
StructField("id", DoubleType, nullable = true)
, StructField("event", LongType, nullable = true)
, StructField("future", LongType, nullable = true)
))
val df = sqlContext.createDataFrame(context.parallelize(T), schema)
But when I a, trying to look into the df I get this error:
java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to java.lang.Double
After a while I found what was the problem: First and foremost that Array of structs in the column should be casted to Row. So the final code to build the final data frame should look like this:
val T = withFfutures.select("midx","future").collect().flatMap( (row: Row) =>
row.getAs[Seq[Row]]("future").map { case Row(e: Long, f: Long) =>
(row.getAs[Double]("midx") , e, f)
}.toList
).toList
val all = context.parallelize(T).toDF("id","event","future")

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.
for( i <- 0 to origCols.length - 1) {
df.withColumnRenamed(
df.columns(i),
df.columns(i).toLowerCase
);
}
If structure is flat:
val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
// |-- _1: long (nullable = false)
// |-- _2: string (nullable = true)
// |-- _3: string (nullable = true)
// |-- _4: double (nullable = false)
the simplest thing you can do is to use toDF method:
val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)
If you want to rename individual columns you can use either select with alias:
df.select($"_1".alias("x1"))
which can be easily generalized to multiple columns:
val lookup = Map("_1" -> "foo", "_3" -> "bar")
df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
or withColumnRenamed:
df.withColumnRenamed("_1", "x1")
which use with foldLeft to rename multiple columns:
lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))
With nested structures (structs) one possible option is renaming by selecting a whole structure:
val nested = spark.read.json(sc.parallelize(Seq(
"""{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))
nested.printSchema
// root
// |-- foobar: struct (nullable = true)
// | |-- foo: struct (nullable = true)
// | | |-- bar: struct (nullable = true)
// | | | |-- first: double (nullable = true)
// | | | |-- second: double (nullable = true)
// |-- id: long (nullable = true)
#transient val foobarRenamed = struct(
struct(
struct(
$"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
).alias("point")
).alias("location")
).alias("record")
nested.select(foobarRenamed, $"id").printSchema
// root
// |-- record: struct (nullable = false)
// | |-- location: struct (nullable = false)
// | | |-- point: struct (nullable = false)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
// |-- id: long (nullable = true)
Note that it may affect nullability metadata. Another possibility is to rename by casting:
nested.select($"foobar".cast(
"struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
or:
import org.apache.spark.sql.types._
nested.select($"foobar".cast(
StructType(Seq(
StructField("location", StructType(Seq(
StructField("point", StructType(Seq(
StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
For those of you interested in PySpark version (actually it's same in Scala - see comment below) :
merchants_df_renamed = merchants_df.toDF(
'merchant_id', 'category', 'subcategory', 'merchant')
merchants_df_renamed.printSchema()
Result:
root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}
In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.
Suppose the dataframe df has 3 columns id1, name1, price1
and you wish to rename them to id2, name2, price2
val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)
I found this approach useful in many cases.
Sometime we have the column name is below format in SQLServer or MySQL table
Ex : Account Number,customer number
But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.
Solution:
val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)
tow table join not rename the joined key
// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)
// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
day1 = day1.withColumnRenamed(x, y)
}
works!