I am new to spark and scala. I have a json array struct as input, similar to the below schema.
root
|-- entity: struct (nullable = true)
| |-- email: string (nullable = true)
| |-- primaryAddresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- postalCode: string (nullable = true)
| | | |-- streetAddress: struct (nullable = true)
| | | | |-- line1: string (nullable = true)
I flattened the array struct to the below sample Dataframe
+-------------+--------------------------------------+--------------------------------------+
|entity.email |entity.primaryAddresses[0].postalCode |entity.primaryAddresses[1].postalCode |....
+-------------+--------------------------------------+--------------------------------------+
|a#b.com | | |
|a#b.com | |12345 |
|a#b.com |12345 | |
|a#b.com |0 |0 |
+-------------+--------------------------------------+--------------------------------------+
My end goal is to calculate presence/absence/zero counts for each of the columns for data quality metrics.But before I calculate the data quality metrics I am looking for an approach to derive one new column for each of the array column elements as below such that
if all values of particular array element is empty, then the derived column is empty for that element
if at least one value is present for an array element, the element presence is considered 1
if all values of an array element is zero the I mark the element as zero (I calibrate this as presence =1 and zero =1 when I calculate data quality later)
Below is a sample intermediate dataframe that I am trying to achieve with a column derived for each of array elements. The original array elements are dropped.
+-------------+--------------------------------------+
|entity.email |entity.primaryAddresses.postalCode |.....
+-------------+--------------------------------------+
|a#b.com | |
|a#b.com |1 |
|a#b.com |1 |
|a#b.com |0 |
+-------------+--------------------------------------+
The input json records elements are dynamic and can change. To derive columns for array element I build a scala map with a key as column name without array index (example:entity.primaryAddresses.postalCode) and value as list of array elements to run rules on for the specific key. I am looking for an approach to achieve the above intermediate data frame.
One concern is that for certain input files after I flatten the Dataframe , the dataframe column count exceeds 70k+. And since the record count is expected to be in millions I am wondering if instead of flattening the json if I should explode each of elements for better performance.
Appreciate any ideas. Thank you.
Created helper function & You can directly call df.explodeColumns on DataFrame.
Below code will flatten multi level array & struct type columns.
Use below function to extract columns & then apply your transformations on that.
scala> df.printSchema
root
|-- entity: struct (nullable = false)
| |-- email: string (nullable = false)
| |-- primaryAddresses: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- postalCode: string (nullable = false)
| | | |-- streetAddress: struct (nullable = false)
| | | | |-- line1: string (nullable = false)
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try
implicit class DFHelpers(df: DataFrame) {
def columns = {
val dfColumns = df.columns.map(_.toLowerCase)
df.schema.fields.flatMap { data =>
data match {
case column if column.dataType.isInstanceOf[StructType] => {
column.dataType.asInstanceOf[StructType].fields.map { field =>
val columnName = column.name
val fieldName = field.name
col(s"${columnName}.${fieldName}").as(s"${columnName}_${fieldName}")
}.toList
}
case column => List(col(s"${column.name}"))
}
}
}
def flatten: DataFrame = {
val empty = df.schema.filter(_.dataType.isInstanceOf[StructType]).isEmpty
empty match {
case false =>
df.select(columns: _*).flatten
case _ => df
}
}
def explodeColumns = {
#tailrec
def columns(cdf: DataFrame):DataFrame = cdf.schema.fields.filter(_.dataType.typeName == "array") match {
case c if !c.isEmpty => columns(c.foldLeft(cdf)((dfa,field) => {
dfa.withColumn(field.name,explode_outer(col(s"${field.name}"))).flatten
}))
case _ => cdf
}
columns(df.flatten)
}
}
scala> df.explodeColumns.printSchema
root
|-- entity_email: string (nullable = false)
|-- entity_primaryAddresses_postalCode: string (nullable = true)
|-- entity_primaryAddresses_streetAddress_line1: string (nullable = true)
You can leverage on a custom user define function that can help you do the data quality metrics.
val postalUdf = udf((postalCode0: Int, postalCode1: Int) => {
//TODO implement you logic here
})
then use is to create a new dataframe column
df
.withColumn("postcalCode", postalUdf(col("postalCode_0"), col("postalCode_1")))
.show()
Related
Spark 2.4.5
In my data frame, I have an array of struct and the array holds the snapshot of a field from time to time.
Now, I am looking for a way to have only the snapshots when the data has changed.
My schema is as below
root
|-- fee: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- updated_at: long (nullable = true)
| | |-- fee: float (nullable = true)
|-- status: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- updated_at: long (nullable = true)
| | |-- status: string (nullable = true)
Existing output:
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|fee |status |
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|[[1584579671000, 12.11], [1584579672000, 12.11], [1584579673000, 12.11]]|[[1584579671000, Closed-A], [1584579672000, Closed-A], [1584579673000, Closed-B], [1584579674000, Closed], [1584579675000, Closed-A]]|
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
As the 'fee' column has not changed so it should have only one entry
As the status has changed a few times so the o/p would be [[1584579671000, Closed-A], [1584579673000, Closed-B], [1584579674000, Closed], [1584579675000, Closed-A]]
Note here the status 'Closed-A' appears twice.
Trying to get the below output:
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|fee |status |
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|[[1584579671000, 12.11]]|[[1584579671000, Closed-A], [1584579673000, Closed-B], [1584579674000, Closed], [1584579675000, Closed-A]]|
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
Note: Trying not to have a user-defined function.
Using Spark Dataframe APIs the above problem could be approached as; Add a monotonically increasing id to uniquely identify each record, explode and flatten the dataframe, group by fee and status separately (as per requirements), aggregate grouped datafarme by id to collect the struct, join both dataframe using id, id could be dropped in the final datafarme.
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.sql.functions.collect_list
import org.apache.spark.sql.functions.struct
val idDF = df.withColumn("id", monotonically_increasing_id)
val explodeDf = idDF
.select(col("id"), col("status"), explode(col("fee")).as("fee"))
.select(col("id"), col("fee"), explode(col("status")).as("status"))
val flatDF = explodeDf.select(col("id"), col("fee.fee"), col("fee.updated_at").as("updated_at_fee"), col("status.status"), col("status.updated_at").as("updated_at_status"))
val feeDF = flatDF.groupBy("id", "fee").min("updated_at_fee")
val feeSelectDF = feeDF.select(col("id"), col("fee"), col("min(updated_at_fee)").as("updated_at"))
val feeAggDF = feeSelectDF.groupBy("id").agg(collect_list(struct("fee", "updated_at")).as("fee"))
val statusDF = flatDF.groupBy("id", "status").min("updated_at_status")
val statusSelectDF = statusDF.select(col("id"), col("status"), col("min(updated_at_status)").as("updated_at"))
val statusAggDF = statusSelectDF.groupBy("id").agg(collect_list(struct("status", "updated_at")).as("status"))
val finalDF = feeAggDF.join(statusAggDF, "id")
finalDF.show(10)
finalDF.printSchema()
I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,
The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+
for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.
Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem
I have a JSON string that I load into a Spark DataFrame. The JSON string can have between 0 and 3 key-value pairs.
When more than one kv pairs are sent, the product_facets is correctly formatted as an array like below:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]}
}}}
I can now use the explode function:
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", explode($"productData.product.product_facets.entry") as "kvPairs")
However when only one key value was sent, the source JSON string for entry is not formatted as an array with square braces:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":{"key":"test","value":"success"}}
}}}
The schema for product tag looks like:
| |-- product: struct (nullable = true)
| | |-- product_facets: struct (nullable = true)
| | | |-- entry: string (nullable = true)
| | |-- product_name: string (nullable = true)
How can I change the entry to an array of key value pairs that is compatible with the explode function. My end goal is to pivot the keys into individual columns and I want to use group by on exploding the kv pairs. I tried using from_json but could not get it to work.
val schema =
StructType(
Seq(
StructField("entry", ArrayType(
StructType(
Seq(
StructField("key", StringType),
StructField("value",StringType)
)
)
))
)
)
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", from_json($"productData.product.product_facets.entry", schema) as "kvPairsFromJson")
But the above does creates a new column kvPairsFromJson that looks like "[]" and using explode does nothing.
Any pointers on whats going on or if there is a better way to do this?
I think one approach could be :
1. Create a udf which takes entry value as json string, and converts it to List( Tuple(K, V))
2. In udf, check if entry value is array or not and do conversion accordingly.
The code below explains above approach:
// one row where entry is array and other non-array
val ds = Seq("""{"id":1,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":[{"key":"test","value":"success"},{"key":"test2","value":"fail"}]}}}}""", """{"id":2,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":{"key":"test","value":"success"}}}}}""").toDS
val df = spark.read.json(ds)
// Schema used by udf to generate output column
import org.apache.spark.sql.types._
val outputSchema = ArrayType(StructType(Seq(
StructField("key", StringType, false),
StructField("value", StringType, false)
)))
// Converts non-array entry value to array
val toArray = udf((json: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
if(!json.startsWith("[")) {
val jsonMap = jsonMapper.readValue(json, classOf[Map[String, String]])
List((jsonMap("key"), jsonMap("value")))
} else {
jsonMapper.readValue(json, classOf[List[Map[String, String]]]).map(f => (f("key"), f("value")))
}
}, outputSchema)
val arrayResult = df.select(col("id").as("id"), toArray(col("productData.product.product_facets.entry")).as("entry"))
val arrayExploded = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry"))
val explodedToCols = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")).select(col("id"), col("entry.key").as("key"), col("entry.value").as("value"))
Results in:
scala> arrayResult.printSchema
root
|-- id: long (nullable = true)
|-- entry: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = false)
| | |-- value: string (nullable = false)
scala> arrayExploded.printSchema
root
|-- id: long (nullable = true)
|-- entry: struct (nullable = true)
| |-- key: string (nullable = false)
| |-- value: string (nullable = false)
scala> arrayResult.show(false)
+---+--------------------------------+
|id |entry |
+---+--------------------------------+
|1 |[[test, success], [test2, fail]]|
|2 |[[test, success]] |
+---+--------------------------------+
scala> arrayExploded.show(false)
+---+---------------+
|id |entry |
+---+---------------+
|1 |[test, success]|
|1 |[test2, fail] |
|2 |[test, success]|
+---+---------------+
I have a column in a dataframe that is an array [always of a single item], that looks like this:
root
|-- emdaNo: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- _value: string (nullable = true)
| | |-- id: string (nullable = true)
I can't for the life of me work out how to get the _value from it, in to a string...
Assuming x is the dataframe, i've tried:
x.select($"arrayName._value") // Yields ["myStringHere"]
and
x.select($"arrayName[0]._value") // Errors
How do i get a nice string of the value held in _value out please?
case class Element(_value: String, id: String)
val df = Seq(Array(Element("foo", "bar"))).toDF("emdaNo")
df.select(element_at($"emdaNo._value", 1) as "_value").show()
Output:
+------+
|_value|
+------+
| foo|
+------+
Alternatively (and before Spark 2.4)
df.select($"emdaNo._value"(0))
or
df.select($"emdaNo._value".getItem(0))
I am trying to flatten a schema of existing dataframe with nested fields. Structure of my dataframe is something like that:
root
|-- Id: long (nullable = true)
|-- Type: string (nullable = true)
|-- Uri: string (nullable = true)
|-- Type: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Gender: array (nullable = true)
| |-- element: string (containsNull = true)
Type and gender can contain array of elements, one element or null value.
I tried to use the following code:
var resDf = df.withColumn("FlatType", explode(df("Type")))
But as a result in a resulting data frame I loose rows for which I had null values for Type column. It means, for example, if I have 10 rows and in 7 rows type is null and in 3 type is not null, after I use explode in resulting data frame I have only three rows.
How can I keep rows with null values but explode array of values?
I found some kind of workaround but still stuck in one place. For standard types we can do the following:
def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
case "string" =>
val avoidNull = udf((column: Seq[String]) =>
if (column == null) Seq[String](null)
else column)
exploded = Some(explode(avoidNull(df(field))))
case "boolean" =>
val avoidNull = udf((xs: Seq[Boolean]) =>
if (xs == null) Seq[Boolean]()
else xs)
exploded = Some(explode(avoidNull(df(field))))
case _ => exploded = Some(explode(df(field)))
}
exploded.get
}
And after that just use it like this:
val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)
However, I have a problem for struct type for the following type of structure:
|-- Address: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AddressType: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- DEA: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Number: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- ExpirationDate: array (nullable = true)
| | | | | |-- element: timestamp (containsNull = true)
| | | | |-- Status: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
How can we process that kind of schema when DEA is null?
Thank you in advance.
P.S. I tried to use Lateral views but result is the same.
Maybe you can try using when:
val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))
As shown in the when function's documentation, the value null is inserted for the values that do not match the conditions.
I think what you wanted is to use explode_outer instead of explode
see apache docs : explode and explode_outer