Remove field from array.struct in Spark - scala

I want to delete one field from array.struct as follow:
case class myObj (id: String, item_value: String, delete: String)
case class myObj2 (id: String, item_value: String)
val df2=Seq (
("1", "2","..100values", Seq(myObj ("A", "1a","1"),myObj ("B", "4r","2"))),
("1", "2","..100values", Seq(myObj ("X", "1p","11"),myObj ("V", "7w","8")))
).toDF("1","2","100fields","myArr")
val deleteColumn : (mutable.WrappedArray[myObj]=>mutable.WrappedArray[myObj2])= {
(array: mutable.WrappedArray[myObj]) => array.map(o => myObj2(o.id, o.item_value))
}
val myUDF3 = functions.udf(deleteColumn)
df2.withColumn("newArr",myUDF3($"myArr")).show(false)
Error is very clear:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array<struct<id:string,item_value:string,delete:string>>) => array<struct< id:string,item_value:string>>)
It does not match, but is that I want to do, parse from one structure to another ¿?
I am using a UDF because df.map() is not good for mapping specific column and it forces to indicates all columns. So I didn´t find best method to apply this mapping for one column.

You can rewrite your UDF that takes a Row instead of custom object as below
val deleteColumn = udf((value: Seq[Row]) => {
value.map(row => MyObj2(row.getString(0), row.getString(1)))
})
df2.withColumn("newArr", deleteColumn($"myArr"))
Output:
+---+---+-----------+---------------------+----------------+
|1 |2 |100fields |myArr |newArr |
+---+---+-----------+---------------------+----------------+
|1 |2 |..100values|[[A,1a,1], [B,4r,2]] |[[A,1a], [B,4r]]|
|1 |2 |..100values|[[X,1p,11], [V,7w,8]]|[[X,1p], [V,7w]]|
+---+---+-----------+---------------------+----------------+

Not using udf, one can easily remove fields from array of structs using dropFields together with transform.
Test input:
val df = spark.createDataFrame(Seq(("v1", "v2", "v3", "v4"))).toDF("f1", "f2", "f3", "f4")
.select(
array(
struct("f1", "f2"),
struct(col("f3").as("f1"), col("f4").as("f2")),
).as("myArr")
)
df.printSchema()
// root
// |-- myArr: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- f1: string (nullable = true)
// | | |-- f2: string (nullable = true)
Script:
val df2 = df.withColumn(
"myArr",
transform(
$"myArr",
x => x.dropFields("f2")
)
)
df2.printSchema()
// root
// |-- myArr: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- f1: string (nullable = true)

Related

How to convert two columns to List[(Long,Int)] - Spark UDF

I am new to Spark and trying to call an existing convertor method. I have a Spark DataFrame df with two columns. I want to convert these two columns to a List[(Long,Int)]
val df = (spark.read.parquet(s"<file path>")
.select(
explode($"groups") as "groups"))
.select(
explode($"groups.lanes") as "lanes")
.select(
$"lanes.lane_path" as "lane_path")
Schema
root
|-- lane_path: struct (nullable = true)
| |-- coordinates: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- diffs: array (nullable = true)
| | |-- element: integer (containsNull = true)
UDF
val generateValues = spark.udf.register("my_udf",
(laneAttrs: List[(Long,Int)], x: Long, y: Int) => {
Converter.convert(laneAttrs, x, y)
.map {
case coord#Left(f) => throw new InterruptedException(s"$coord: $f")
case Right(result) => Map("a" -> result.a,
"b" -> result.b)
}
})
How can I pass "$lane_path.coordinates" and "$lane_path.diffs" as List[(Long,Int)] to call generateValues UDF?

how to update spark dataframe column containing array using udf

I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,
The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+
for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.
Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem

Change to empty array if another column is false

I am trying to create a dataframe that returns an empty array for a nested struct type if another column is false. I created a dummy dataframe to illustrate my problem.
import spark.implicits._
val newDf = spark.createDataFrame(Seq(
("user1","true", Some(8), Some("usd"), Some("tx1")),
("user1", "true", Some(9), Some("usd"), Some("tx2")),
("user2", "false", None, None, None))).toDF("userId","flag", "amount", "currency", "transactionId")
val amountStruct = struct("amount"
,"currency").alias("amount")
val transactionStruct = struct("transactionId"
, "amount").alias("transactions")
val dataStruct = struct("flag","transactions").alias("data")
val finalDf = newDf.
withColumn("amount", amountStruct).
withColumn("transactions", transactionStruct).
select("userId", "flag","transactions").
groupBy("userId", "flag").
agg(collect_list("transactions").alias("transactions")).
withColumn("data", dataStruct).
drop("transactions","flag")
This is the output:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, [[, [,]]]]|
| user1|[true, [[tx1, [8,...|
+------+--------------------+
and schema:
root
|-- userId: string (nullable = true)
|-- data: struct (nullable = false)
| |-- flag: string (nullable = true)
| |-- transactions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- transactionId: string (nullable = true)
| | | |-- amount: struct (nullable = false)
| | | | |-- amount: integer (nullable = true)
| | | | |-- currency: string (nullable = true)
The output I want:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, []] |
| user1|[true, [[tx1, [8,...|
+------+--------------------+
I've tried doing this before doing collect_list but no luck.
import org.apache.spark.sql.functions.typedLit
val emptyArray = typedLit(Array.empty[(String, Array[(Int, String)])])
testDf.withColumn("transactions", when($"flag" === "false", emptyArray).otherwise($"transactions")).show()
You were moments from victory. The approach with collect_list is the way to go, it just needs a little nudge.
TL;DR Solution
val newDf = spark
.createDataFrame(
Seq(
("user1", "true", Some(8), Some("usd"), Some("tx1")),
("user1", "true", Some(9), Some("usd"), Some("tx2")),
("user2", "false", None, None, None)
)
)
.toDF("userId", "flag", "amount", "currency", "transactionId")
val dataStruct = struct("flag", "transactions")
val finalDf2 = newDf
.groupBy("userId", "flag")
.agg(
collect_list(
when(
$"transactionId".isNotNull && $"amount".isNotNull && $"currency".isNotNull,
struct(
$"transactionId",
struct($"amount", $"currency").alias("amount")
)
)).alias("transactions"))
.withColumn("data", dataStruct)
.drop("transactions", "flag")
Explanation
SQL Aggregate Function Behavior
First of all, when it comes to behavior Spark follows SQL conventions. All the SQL aggregation functions (and collect_list is an aggregate function) ignore NULL on input as if it never was there.
Let's take a look at how does collect_list behave:
Seq(
("a", Some(1)),
("a", Option.empty[Int]),
("a", Some(3)),
("b", Some(10)),
("b", Some(20)),
("b", Option.empty[Int])
)
.toDF("col1", "col2")
.groupBy($"col1")
.agg(collect_list($"col2") as "col2_list")
.show()
And the result is:
+----+---------+
|col1|col2_list|
+----+---------+
| b| [10, 20]|
| a| [1, 3]|
+----+---------+
Tracking Down Nullability
It looks like collect_list behaves properly. So the reason you are seeing those blanks in your output is that the column that gets passed to the collect_list is not nullable.
To prove it let's examine the schema of the DataFrame just before it gets aggregated:
newDf
.withColumn("amount", amountStruct)
.withColumn("transactions", transactionStruct)
.printSchema()
root
|-- userId: string (nullable = true)
|-- flag: string (nullable = true)
|-- amount: struct (nullable = false)
| |-- amount: integer (nullable = true)
| |-- currency: string (nullable = true)
|-- currency: string (nullable = true)
|-- transactionId: string (nullable = true)
|-- transactions: struct (nullable = false)
| |-- transactionId: string (nullable = true)
| |-- amount: struct (nullable = false)
| | |-- amount: integer (nullable = true)
| | |-- currency: string (nullable = true)
Note the transactions: struct (nullable = false) part. It proves the suspicion.
If we translate all the nested NULLables to Scala here's what you got:
case class Row(
transactions: Transactions,
// other fields
)
case class Transactions(
transactionId: Option[String],
amount: Option[Amount],
)
case class Amount(
amount: Option[Int],
currency: Option[String]
)
And here's what you want instead:
case class Row(
transactions: Option[Transactions], // this is optional now
// other fields
)
case class Transactions(
transactionId: String, // while this is not optional
amount: Amount, // neither is this
)
case class Amount(
amount: Int, // neither is this
currency: String // neither is this
)
Fixing the Nullability
Now the last step is simple. To make the column that is the input to collect_list "properly" nullable you have to check the nullability of all the amount, currency and transactionId columns.
The result will be NOT NULL if and only if all the input columns are NOT NULL.
You can use the same when API method to construct the result. The otherwise clause if omitted implicitly returns NULL which is exactly what you need.

Convert a json string to array of key-value pairs in Spark scala

I have a JSON string that I load into a Spark DataFrame. The JSON string can have between 0 and 3 key-value pairs.
When more than one kv pairs are sent, the product_facets is correctly formatted as an array like below:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]}
}}}
I can now use the explode function:
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", explode($"productData.product.product_facets.entry") as "kvPairs")
However when only one key value was sent, the source JSON string for entry is not formatted as an array with square braces:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":{"key":"test","value":"success"}}
}}}
The schema for product tag looks like:
| |-- product: struct (nullable = true)
| | |-- product_facets: struct (nullable = true)
| | | |-- entry: string (nullable = true)
| | |-- product_name: string (nullable = true)
How can I change the entry to an array of key value pairs that is compatible with the explode function. My end goal is to pivot the keys into individual columns and I want to use group by on exploding the kv pairs. I tried using from_json but could not get it to work.
val schema =
StructType(
Seq(
StructField("entry", ArrayType(
StructType(
Seq(
StructField("key", StringType),
StructField("value",StringType)
)
)
))
)
)
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", from_json($"productData.product.product_facets.entry", schema) as "kvPairsFromJson")
But the above does creates a new column kvPairsFromJson that looks like "[]" and using explode does nothing.
Any pointers on whats going on or if there is a better way to do this?
I think one approach could be :
1. Create a udf which takes entry value as json string, and converts it to List( Tuple(K, V))
2. In udf, check if entry value is array or not and do conversion accordingly.
The code below explains above approach:
// one row where entry is array and other non-array
val ds = Seq("""{"id":1,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":[{"key":"test","value":"success"},{"key":"test2","value":"fail"}]}}}}""", """{"id":2,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":{"key":"test","value":"success"}}}}}""").toDS
val df = spark.read.json(ds)
// Schema used by udf to generate output column
import org.apache.spark.sql.types._
val outputSchema = ArrayType(StructType(Seq(
StructField("key", StringType, false),
StructField("value", StringType, false)
)))
// Converts non-array entry value to array
val toArray = udf((json: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
if(!json.startsWith("[")) {
val jsonMap = jsonMapper.readValue(json, classOf[Map[String, String]])
List((jsonMap("key"), jsonMap("value")))
} else {
jsonMapper.readValue(json, classOf[List[Map[String, String]]]).map(f => (f("key"), f("value")))
}
}, outputSchema)
val arrayResult = df.select(col("id").as("id"), toArray(col("productData.product.product_facets.entry")).as("entry"))
val arrayExploded = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry"))
val explodedToCols = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")).select(col("id"), col("entry.key").as("key"), col("entry.value").as("value"))
Results in:
scala> arrayResult.printSchema
root
|-- id: long (nullable = true)
|-- entry: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = false)
| | |-- value: string (nullable = false)
scala> arrayExploded.printSchema
root
|-- id: long (nullable = true)
|-- entry: struct (nullable = true)
| |-- key: string (nullable = false)
| |-- value: string (nullable = false)
scala> arrayResult.show(false)
+---+--------------------------------+
|id |entry |
+---+--------------------------------+
|1 |[[test, success], [test2, fail]]|
|2 |[[test, success]] |
+---+--------------------------------+
scala> arrayExploded.show(false)
+---+---------------+
|id |entry |
+---+---------------+
|1 |[test, success]|
|1 |[test2, fail] |
|2 |[test, success]|
+---+---------------+

How to extract values from JSON-encoded column? [duplicate]

This question already has answers here:
Querying Spark SQL DataFrame with complex types
(3 answers)
Closed 4 years ago.
I have a Spark Dataframe with the following schema.
[{ "map": {
"completed-stages": 1,
"total-stages": 1 },
"rec": "test-plan",
"status": {
"state": "SUCCESS"
}
},
{ "map": {
"completed-stages": 1,
"total-stages": 1 },
"rec": "test-proc",
"status": {
"state": "FAILED"
}
}]
I want to transform it into another DF having the following schema
[{"rec": "test-plan", "status": "SUCCESS"}, {"rec": "test-pROC", "status": "FAILED"}]
I have written the following code, but it doesn't compile and complains of wrong encoding.
val fdf = DF.map(f => {
val listCommands = f.get(0).asInstanceOf[WrappedArray[Map[String, Any]]]
val m = listCommands.map(h => {
var rec = "none"
var status = "none"
if(h.exists("status" == "state" -> _)) {
status = (h.get("status") match {
case Some(x) => x.asInstanceOf[HashMap[String, String]].getOrElse("state", "none")
case _ => "none"
})
if(h.contains("rec")) {
rec = (h.get("rec") match {
case Some(x: String) => x
case _ => "none"
})
}
}
Map("status"->status, "rec"->rec)
})
val rm = m.flatten
rm
})
Please suggest the right way.
That's going to be tricky since the top-level elements of the JSONs are not the same, i.e. you have map1 and map2, and hence the schema is inconsistent. I'd speak to the "data producer" and requests a change so the name of the command is described by a separate element.
Given the schema of the DataFrame is as follows:
scala> commands.printSchema
root
|-- commands: array (nullable = true)
| |-- element: string (containsNull = true)
and the number of elements (rows) in it:
scala> commands.count
res1: Long = 1
You have to explode the commands array of elements first followed by accessing the fields of interest.
// 1. Explode the array
val commandsExploded = commands.select(explode($"commands") as "command")
scala> commandsExploded.count
res2: Long = 2
Let's create the schema of the JSON-encoded records. One could be as follows.
// Note that it accepts map1 and map2 fields
import org.apache.spark.sql.types._
val schema = StructType(
StructField("map1",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("map2",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("rec", StringType,true) ::
StructField("status", StructType(
StructField("state", StringType, true) :: Nil), true
) :: Nil)
With that, you should use from_json standard function that takes a column with JSON-encoded strings and a schema.
val commands = commandsExploded.select(from_json($"command", schema) as "command")
scala> commands.show(truncate = false)
+-------------------------------+
|command |
+-------------------------------+
|[[1, 1],, test-plan, [SUCCESS]]|
|[, [1, 1], test-proc, [FAILED]]|
+-------------------------------+
Let's have a look at the schema of the commands dataset.
scala> commands.printSchema
root
|-- command: struct (nullable = true)
| |-- map1: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- map2: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- rec: string (nullable = true)
| |-- status: struct (nullable = true)
| | |-- state: string (nullable = true)
The complex fields like rec and status are structs that are .-accessible.
val recs = commands.select(
$"command.rec" as "rec",
$"command.status.state" as "status")
scala> recs.show
+---------+-------+
| rec| status|
+---------+-------+
|test-plan|SUCCESS|
|test-proc| FAILED|
+---------+-------+
Converting it to a single-record JSON-encoded dataset requires Dataset.toJSON followed by collect_list standard function.
val result = recs.toJSON.agg(collect_list("value"))
scala> result.show(truncate = false)
+-------------------------------------------------------------------------------+
|collect_list(value) |
+-------------------------------------------------------------------------------+
|[{"rec":"test-plan","status":"SUCCESS"}, {"rec":"test-proc","status":"FAILED"}]|
+-------------------------------------------------------------------------------+
You didn't provide the schema for the df so the below might not work for you.
I saved the json sample in a test.json file and read it with val df=spark.read.option("multiLine",true).json("test.json") in which case to get the json you want you just df.select($"rec",$"status.state").write.json("test1.json")