I am working with Spark 2.3.2.
On one column within my Dataframe I am performing many spark.sql.functions sequentually. How can I wrap this sequence of functions into a user-defined-function (UDF) to make it reusable?
Here is my example focusing on the one column "columnName". First I am creating my test data:
val testSchema = new StructType()
.add("columnName", new StructType()
.add("2020-11", LongType)
.add("2020-12", LongType)
)
val testRow = Seq(Row(Row(1L, 2L)))
val testRDD = spark.sparkContext.parallelize(testRow)
val testDF = spark.createDataFrame(testRDD, testSchema)
testDF.printSchema()
/*
root
|-- columnName: struct (nullable = true)
| |-- 2020-11: long (nullable = true)
| |-- 2020-12: long (nullable = true)
*/
testDF.show(false)
/*
+----------+
|columnName|
+----------+
|[1, 2] |
+----------+
*/
And here is the sequence of applied Spark SQL functions (just as an example):
val testResult = testDF
.select(explode(split(regexp_replace(to_json(col("columnName")), "[\"{}]", ""), ",")).as("result"))
I am failing to create a UDF "myUDF", such that I can get the same result when calling
val testResultWithUDF = testDF.select(myUDF(col("columnName"))
This is what I "would like" to do:
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ",")
}
val myUDF = udf(parseAndExplode _)
testDF.withColumn("udf_result", myUDF(col("columnName"))).show(false)
but it is throwing an Exception:
Schema for type org.apache.spark.sql.Column is not supported
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
Also tried with using a Row as input parameter but then again failed trying to apply built-in SQL functions.
There is no need to use an udf here. explode, split and most other functions from org.apache.spark.sql.functions return already an object of type Column.
def parseAndExplode(spalte: Column): Column = {
explode(split(regexp_replace(to_json(spalte), "[\"{}]", ""), ","))
}
testDF.withColumn("udf_result",parseAndExplode('columnName)).show(false)
prints
+----------+----------+
|columnName|udf_result|
+----------+----------+
|[1, 2] |2020-11:1 |
|[1, 2] |2020-12:2 |
+----------+----------+
I need help converting a flat dataset into a nested format using Apache Spark / Scala.
Is it possible to automatically create a nested structure derived from input column namespaces
[level 1].[level 2]? In my example, the nesting level is determined by the period symbol '.' within the column headers.
I assuming this is possible to achieve using a map function. I am open to alternative solutions, particularly if there is a more elegant way of achieving the same outcome.
package org.acme.au
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import scala.collection.Seq
object testNestedObject extends App {
// Configure spark
val spark = SparkSession.builder()
.appName("Spark batch demo")
.master("local[*]")
.config("spark.driver.host", "localhost")
.getOrCreate()
// Start spark
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
// Define schema for input data
val flatSchema = new StructType()
.add(StructField("id", StringType, false))
.add(StructField("name", StringType, false))
.add(StructField("custom_fields.fav_colour", StringType, true))
.add(StructField("custom_fields.star_sign", StringType, true))
// Create a row with dummy data
val row1 = Row("123456", "John Citizen", "Blue", "Scorpio")
val row2 = Row("990087", "Jane Simth", "Green", "Taurus")
val flatData = Seq(row1, row2)
// Convert into dataframe
val dfIn = spark.createDataFrame(spark.sparkContext.parallelize(flatData), flatSchema)
// Print to console
dfIn.printSchema()
dfIn.show()
// Convert flat data into nested structure as either Parquet or JSON format
val dfOut = dfIn.rdd
.map(
row => ( /* TODO: Need help with mapping flat data to nested structure derived from input column namespaces
*
* For example:
*
* <id>12345<id>
* <name>John Citizen</name>
* <custom_fields>
* <fav_colour>Blue</fav_colour>
* <star_sign>Scorpio</star_sign>
* </custom_fields>
*
*/ ))
// Stop spark
sc.stop()
}
This solution is for the revised requirement that the JSON output would consist of an array of {K:valueK, V:valueV} rather than {valueK1: valueV1, valueK2: valueV2, ...}. For example:
// FROM:
"custom_fields":{"fav_colour":"Blue", "star_sign":"Scorpio"}
// TO:
"custom_fields":[{"key":"fav_colour", "value":"Blue"}, {"key":"star_sign", "value":"Scorpio"}]
Sample code below:
import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio"),
(990087, "Jane Simth", "Green", "Taurus")
).toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map( v =>
struct(lit(v).as("key"), col("`" + kv._1 + "." + v + "`").as("value"))
)
accDF.withColumn(kv._1, array(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show(false)
// +------+------------+----------------------------------------+
// |id |name |custom_fields |
// +------+------------+----------------------------------------+
// |123456|John Citizen|[[fav_colour,Blue], [star_sign,Scorpio]]|
// |990087|Jane Simth |[[fav_colour,Green], [star_sign,Taurus]]|
// +------+------------+----------------------------------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- custom_fields: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- key: string (nullable = false)
// | | |-- value: string (nullable = true)
dfResult.toJSON.show(false)
// +-------------------------------------------------------------------------------------------------------------------------------+
// |value |
// +-------------------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","custom_fields":[{"key":"fav_colour","value":"Blue"},{"key":"star_sign","value":"Scorpio"}]}|
// |{"id":990087,"name":"Jane Simth","custom_fields":[{"key":"fav_colour","value":"Green"},{"key":"star_sign","value":"Taurus"}]} |
// +-------------------------------------------------------------------------------------------------------------------------------+
Note that we cannot make value type Any to accommodate a mix of different types, as Spark DataFrame API doesn't support type Any. As a consequence, the value in the array must be of a given type (e.g. String). Like the previous solution, this also handles only up to one nested level.
This can be solved with a dedicated case class and a UDF that converts the input data into case class instances. For example:
Define the case class
case class NestedFields(fav_colour: String, star_sign: String)
Define the UDF that takes the original column values as input and returns an instance of NestedFields:
private val asNestedFields = udf((fc: String, ss: String) => NestedFields(fc, ss))
Transform the original DataFrame and drop the flat columns:
val res = dfIn.withColumn("custom_fields", asNestedFields($"`custom_fields.fav_colour`", $"`custom_fields.star_sign`"))
.drop($"`custom_fields.fav_colour`")
.drop($"`custom_fields.star_sign`")
It produces
root
|-- id: string (nullable = false)
|-- name: string (nullable = false)
|-- custom_fields: struct (nullable = true)
| |-- fav_colour: string (nullable = true)
| |-- star_sign: string (nullable = true)
+------+------------+---------------+
| id| name| custom_fields|
+------+------------+---------------+
|123456|John Citizen|[Blue, Scorpio]|
|990087| Jane Simth|[Green, Taurus]|
+------+------------+---------------+
Here's a generalized solution that first assembles a Map of column names that contain the ., traverses the Map to add converted struct columns to the DataFrame, and finally drop the original columns with the .. A slightly more generalized dfIn is used as the sample data.
import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio", "a", 1),
(990087, "Jane Simth", "Green", "Taurus", "b", 2)
).
toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign", "s.c1", "s.c2")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign, s.c1, s.c2)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(s -> Array(c1, c2), custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "." + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show
// +------+------------+-----+--------------+
// |id |name |s |custom_fields |
// +------+------------+-----+--------------+
// |123456|John Citizen|[a,1]|[Blue,Scorpio]|
// |990087|Jane Simth |[b,2]|[Green,Taurus]|
// +------+------------+-----+--------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- s: struct (nullable = false)
// | |-- c1: string (nullable = true)
// | |-- c2: integer (nullable = false)
// |-- custom_fields: struct (nullable = false)
// | |-- fav_colour: string (nullable = true)
// | |-- star_sign: string (nullable = true)
Note that this solution handles only up to one nested level.
To convert each row to JSON format, consider using toJSON as follows:
dfResult.toJSON.show(false)
// +---------------------------------------------------------------------------------------------------------------------+
// |value |
// +---------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","s":{"c1":"a","c2":1},"custom_fields":{"fav_colour":"Blue","star_sign":"Scorpio"}}|
// |{"id":990087,"name":"Jane Simth","s":{"c1":"b","c2":2},"custom_fields":{"fav_colour":"Green","star_sign":"Taurus"}} |
// +---------------------------------------------------------------------------------------------------------------------+
I'm reading a JSON file into a spark data frame in Scala. I have a JSON field like
"areaGlobalIdList":[2389,3,2,1,2147,2142,2518]
Spark is automatically inferring the datatype of this field as Array[long]. I tried concat_ws, but it seems only works with array[string]. When I tried converting this to array[string], the output is showing as
scala> val cmrdd = sc.textFile("/user/nkthn/cm.json")
scala> val cmdf = sqlContext.read.json(cmrdd)
scala> val dfResults = cmdf.select($"areaGlobalIdList".cast(StringType)).withColumn("AREAGLOBALIDLIST", regexp_replace($"areaGlobalIdList" , ",", "." ))
scala> dfResults.show(20,false)
+------------------------------------------------------------------+
|AREAGLOBALIDLIST |
+------------------------------------------------------------------+
|org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#6364b584|
+------------------------------------------------------------------+
I'm expecting the output to be
[2389.3.2.1.2147.2142.2518]
Any assistance is greatly helpful.
Given the schema of the areaGlobalIdList column as
|-- areaGlobalIdList: array (nullable = true)
| |-- element: long (containsNull = false)
You can achieve this with simple udf function as
import org.apache.spark.sql.functions._
val concatWithDot = udf((array: collection.mutable.WrappedArray[Long]) => array.mkString("."))
df.withColumn("areaGlobalIdList", concatWithDot($"areaGlobalIdList")).show(false)
I would like to read HBase data in a Spark stream code for looking up and further enhancement of streaming data. I am using spark-hbase-connector_2.10-1.0.3.jar.
In my code the following line is successful
val docRdd =
sc.hbaseTable[(Option[String], Option[String])]("hbase_customer_profile")
.select("id","gender").inColumnFamily("data")
docRdd.count returns the right count.
docRdd is of type
HBaseReaderBuilder(org.apache.spark.SparkContext#3a49e5,hbase_customer_profile,Some(data),WrappedArray(id,
gender),None,None,List())
How can I read all the rows in id, gender columns please. Also how can I convert docRdd into a data frame so that SparkSQL can be used.
You can read all rows from the RDD using
docRdd.collect().foreach(println)
To convert the RDD to a DataFrame you could define a case class:
case class Customer(rowKey: String, id: Option[String], gender: Option[String])
I have added the row key to the case class; that's not strictly necessary, so if you don't need it, you can omit it.
Then map over the RDD:
// Row key, id, gender
type Record = (String, Option[String], Option[String])
val rdd =
sc.hbaseTable[Record]("customers")
.select("id","gender")
.inColumnFamily("data")
.map(r => Customer(r._1, r._2, r._3))
and then - based on the case class - convert the RDD to a DataFrame
import sqlContext.implicits._
val df = rdd.toDF()
df.show()
df.printSchema()
The output from spark-shell looks like this:
scala> df.show()
+---------+----+------+
| rowKey| id|gender|
+---------+----+------+
|customer1| 1| null|
|customer2|null| f|
|customer3| 3| m|
+---------+----+------+
scala> df.printSchema()
root
|-- rowKey: string (nullable = true)
|-- id: string (nullable = true)
|-- gender: string (nullable = true)
Let's assume that we have a Spark DataFrame
df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame
with the following schema
df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
| |-- element: string (containsNull = true)
Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?
You don't have to write a custom function because there is one:
import org.apache.spark.sql.functions.size
df.select(size($"tk"))
If you really want you can write an udf:
import org.apache.spark.sql.functions.udf
val size_ = udf((xs: Seq[String]) => xs.size)
or even create custom a expression but there is really no point in that.
One way is to access them using the sql like below.
df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")
df2.show()
To get size of array column,
val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()
If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.
I would also try for some thing that traverses.