Remove an elemet from an array of struct in spark scala - scala

I want to implement an functionality to remove an element from an array of struct in spark scala.For the date "2019-01-26" I want to remove the entire struct from the array column. Following is my code :
import org.apache.spark.sql.types._
val df=Seq(("123","Jack",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"))),("124","jones",Seq(("2020-04-26","200","72","ABC"),("2020-05-26","300","71","ABC"),("2020-06-26","200","70","ABC"),("2020-08-26","300","69","ABC"),("2020-08-26","300","69","ABC"))),("125","daniel",Seq(("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF"),("2019-01-26","200","70","DEF")))).toDF("id","name","history").withColumn("history",$"history".cast("array<struct<infodate:Date,amount1:Integer,amount2:Integer,detail:string>>"))
scala> df.printSchema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- infodate: date (nullable = true)
| | |-- amount1: integer (nullable = true)
| | |-- amount2: integer (nullable = true)
| | |-- detail: string (nullable = true)
So the for the date 2019-01-26 , I want to remove the struct in which
it is present so that it is removed from the array column.I want a solution like this.
I manage to find the solution but it involves lot of hardcoding
and I'm searching for a solution/suggestion that is optimal.
Hardcoded solution:
val dfnew=df
.withColumn( "history" ,
array_except(
col("history"),
array(
struct(
lit("2019-01-26").cast(DataTypes.DateType).alias("infodate"),
lit("200").cast(DataTypes.IntegerType).alias("amount1"),
lit("70").cast(DataTypes.IntegerType).alias("amount2"),
lit("DEF").alias("detail")
)
)
)
)
Is there any way of optimally doing it with one filter condition only
on date "2019-01-26", which removes the struct/array from the array
column.

I use an expression / filter here. Obviosuly it's a string so you can replace the date with a value so that there is even less hard coding. Filters are handy expressions as they let you use SQL notation to reference sub-components of the struct.
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn( "history" ,
expr( "filter( history , x -> x.infodate != '2019-01-26' )" )
).show(10,false)
// Exiting paste mode, now interpreting.
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |history |
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+
|123|Jack |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC]] |
|124|jones |[[2020-04-26, 200, 72, ABC], [2020-05-26, 300, 71, ABC], [2020-06-26, 200, 70, ABC], [2020-08-26, 300, 69, ABC], [2020-08-26, 300, 69, ABC]]|
|125|daniel|[] |
+---+------+--------------------------------------------------------------------------------------------------------------------------------------------+

Related

In SparkSQL how could I select a subset of columns from a nested struct and keep it as a nested struct in the result using SQL statement?

I can do the following statement in SparkSQL:
result_df = spark.sql("""select
one_field,
field_with_struct
from purchases""")
And resulting data frame will have the field with full struct in field_with_struct.
one_field
field_with_struct
123
{name1,val1,val2,f2,f4}
555
{name2,val3,val4,f6,f7}
I want to select only few fields from field_with_struct, but keep them still in struct in the resulting data frame. If something could be possible (this is not real code):
result_df = spark.sql("""select
one_field,
struct(
field_with_struct.name,
field_with_struct.value2
) as my_subset
from purchases""")
To get this:
one_field
my_subset
123
{name1,val2}
555
{name2,val4}
Is there any way of doing this with SQL? (not with fluent API)
There's a much simpler solution making use of arrays_zip, no need to explode/collect_list (which can be error prone/difficult with complex data since it relies on using something like an id column):
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import arrays_zip
>>> df = sc.createDataFrame((([Row(x=1, y=2, z=3), Row(x=2, y=3, z=4)],),), ['array_of_structs'])
>>> df.show(truncate=False)
+----------------------+
|array_of_structs |
+----------------------+
|[{1, 2, 3}, {2, 3, 4}]|
+----------------------+
>>> df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
| | |-- z: long (nullable = true)
>>> # Selecting only two of the nested fields:
>>> selected_df = df.select(arrays_zip("array_of_structs.x", "array_of_structs.y").alias("array_of_structs"))
>>> selected_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> selected_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
EDIT Adding in the corresponding Spark SQL code, since that was requested by the OP:
>>> df.createTempView("test_table")
>>> sql_df = sc.sql("""
SELECT
transform(array_of_structs, x -> struct(x.x, x.y)) as array_of_structs
FROM test_table
""")
>>> sql_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> sql_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
In fact, the pseudo code which I have provided is working. For a nested array of object it's not so straightforward. At first, the array should be exploded (EXPLODE() function) and then selected a subset. After that it's possible to make a COLLECT_LIST().
WITH
unfold_by_items AS (SELECT id, EXPLODE(Items) AS item FROM spark_tbl_items)
, format_items as (SELECT
id
, STRUCT(
item.item_id
, item.name
) AS item
FROM unfold_by_items)
, fold_by_items AS (SELECT id, COLLECT_LIST(item) AS Items FROM format_items GROUP BY id)
SELECT * FROM fold_by_items
This will choose only two fields from the struct in Items and in the end returns a dataset which contains again an array with Items.

How to apply Sha2 for a particular column which is inside in the form of array struct in Hive or spark sql? Dynamically

I am having data in Hive
id name kyc
1001 smith [pnno:999,ssn:12345,email:ss#mail.com]
when we select these columns the output will be
1001.smith, [999,12345,ss#mail.com]
I have to apply SHA2 inside this array column and also the output should display
1001,smith,[999,*****(sha2 masked value), ss#gmail.com]
The output should be same array struct format
I am currently creating a separate view and joining the query, Is there any way to handle this in a Hive query or inside spark/scala using dataframe Dynamically?
Also, using any config for spark/scala?
Thank you
You can use transform to encrypt the ssn field in the array of structs:
// sample dataframe
df.show(false)
+----+-----+---------------------------+
|id |name |kyc |
+----+-----+---------------------------+
|1001|smith|[[999, 12345, ss#mail.com]]|
+----+-----+---------------------------+
// sample schema
df.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = false)
// |-- kyc: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- pnno: integer (nullable = false)
// | | |-- ssn: integer (nullable = false)
// | | |-- email: string (nullable = false)
val df2 = df.withColumn(
"kyc",
expr("""
transform(kyc,
x -> struct(x.pnno pnno, sha2(string(x.ssn), 512) ssn, x.email email)
)
""")
)
df2.show(false)
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |kyc |
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+
|1001|smith|[[999, 3627909a29c31381a071ec27f7c9ca97726182aed29a7ddd2e54353322cfb30abb9e3a6df2ac2c20fe23436311d678564d0c8d305930575f60e2d3d048184d79, ss#mail.com]]|
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+

Converting a dataframe to an array of struct of column names and values

Suppose I have a dataframe like this
val customer = Seq(
("C1", "Jackie Chan", 50, "Dayton", "M"),
("C2", "Harry Smith", 30, "Beavercreek", "M"),
("C3", "Ellen Smith", 28, "Beavercreek", "F"),
("C4", "John Chan", 26, "Dayton","M")
).toDF("cid","name","age","city","sex")
How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark
The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:
val cols = customer.columns.tail
val result = customer.select('cid,
array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")
result.show(false)
+---+-----------------------------------------------------------+
|cid|array |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]] |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]] |
+---+-----------------------------------------------------------+
result.printSchema()
root
|-- cid: string (nullable = true)
|-- array: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- name: string (nullable = false)
| | |-- value: string (nullable = true)
You can do it using array and struct functions:
customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))
will make:
|-- cid: string (nullable = true)
|-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- column_name: string (nullable = false)
| | |-- column_value: string (nullable = true)
Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
or wrap the map col in an array if you want it
This way you can still do numerical or string transformations on the relevant key or value. For example:
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
df.select('*',
map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)
Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere

In PySpark how to parse an embedded JSON

I am new to PySpark.
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
This is related JSON file parsing in Pyspark, but doesn't answer my question.
As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
However, this only gives the first element (if any) of UrlsInfo in the second column.
EDIT: I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+

Add derived column (as array of struct) based on values and ordering of other columns in Spark Scala dataframe

I have a Scala Spark dataframe with four columns (all string type) - P, Q, R, S - and a primary key (called PK) (integer type).
Each of these 4 columns may have null values. The left to right ordering of the columns is the importance/relevance of the column and needs to be preserved. The structure of the base dataframe stays the same as shown.
I want the final output to be as follows:
root
|-- PK: integer (nullable = true)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: string (nullable = true)
|-- categoryList: array (nullable = true)
| |-- myStruct: struct (nullable = true)
| | |-- category: boolean (nullable = true)
| | |-- relevance: boolean (nullable = true)
I need to create a new column derived from the 4 columns P, Q, R, S based on the following algorithm:
For every element in each of the four rows, check whether the element exists in Map "mapM"
If element exists, the "category" in the struct will be the corresponding value from map M. If the element does not exist in Map M, the category shall be null.
The "relevance" in the struct shall be the order of the column from left to right: P -> 1, Q -> 2, R -> 3, S -> 4.
The array formed by these four structs is then added to a new column on the dataframe provided.
I'm new to Scala and here is what I have until now:
case class relevanceCaseClass(category: String, relevance: Integer)
def myUdf = udf((code: String, relevance: Integer) => relevanceCaseClass(mapM.value.getOrElse(code, null), relevance))
df.withColumn("newColumn", myUdf(col("P/Q/R/S"), 1))
The problem with this is that I cannot pass the value of the ordering inside the withColumn function. I need to let the myUdf function know the value of the relevance. Am I doing something fundamentally wrong?
Thus I should get the output:
PK P Q R S newCol
1 a b c null array(struct("a", 1), struct(null, 2), struct("c", 3), struct(null, 4))
Here, the value "b" was not found in the map and hence the value (for category) is null. Since the value for column S was already null, it stayed null. The relevance is according to the left-right column ordering.
Given a input dataframe (testing as given in OP) as
+---+---+---+---+----+
|PK |P |Q |R |S |
+---+---+---+---+----+
|1 |a |b |c |null|
+---+---+---+---+----+
root
|-- PK: integer (nullable = false)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: null (nullable = true)
and a broadcasted Map as
val mapM = spark.sparkContext.broadcast(Map("a" -> "a", "c" -> "c"))
You can define the udf function and call that udf function as below
def myUdf = udf((pqrs: Seq[String]) => pqrs.zipWithIndex.map(code => relevanceCaseClass(mapM.value.getOrElse(code._1, "null"), code._2+1)))
val finaldf = df.withColumn("newColumn", myUdf(array(col("P"), col("Q"), col("R"), col("S"))))
with case class as in OP
case class relevanceCaseClass(category: String, relevance: Integer)
which should give you your desired output i.e. finaldf would be
+---+---+---+---+----+--------------------------------------+
|PK |P |Q |R |S |newColumn |
+---+---+---+---+----+--------------------------------------+
|1 |a |b |c |null|[[a, 1], [null, 2], [c, 3], [null, 4]]|
+---+---+---+---+----+--------------------------------------+
root
|-- PK: integer (nullable = false)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: null (nullable = true)
|-- newColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- category: string (nullable = true)
| | |-- relevance: integer (nullable = true)
I hope the answer is helpful
You can pass multiple columns to udf as the following example code
case class Relevance(category: String, relevance: Integer)
def myUdf = udf((p: String,q: String,s: String,r: String) => Seq(
Relevance(mapM.value.getOrElse(p, null), 1),
Relevance(mapM.value.getOrElse(q, null), 2),
Relevance(mapM.value.getOrElse(s, null), 3),
Relevance(mapM.value.getOrElse(r, null), 4)
))
df.withColumn("newColumn", myUdf(df("P"),df("Q"),df("S"),df("R")))