How to find the "lowest" element from array<struct>? - scala

I've a dataframe with following schema -
|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _v1: string (nullable = true)
| | |-- _v2: string (nullable = true)
VALUES are like -
[["1","a"],["2","b"],["3","c"],["4","d"]]
[["4","g"]]
[["3","e"],["4","f"]]
I want to take the VALUES with the lowest integer i.e.
The result df should look like - (which will be StructType now, not Array[Struct])
["1","a"]
["4","g"]
["3","e"]
Can someone please guide me how can I approach this problem by creating a udf ?
Thanks in advance.

You don't need a UDF for that. Just use sort_array and pick the first element.
df.show
+--------------------+
| data_arr|
+--------------------+
|[[4,a], [2,b], [1...|
| [[1,a]]|
| [[3,b], [1,v]]|
+--------------------+
df.printSchema
root
|-- data_arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = false)
import org.apache.spark.sql.functions.sort_array
df.withColumn("first_asc", sort_array($"data_arr")(0)).show
+--------------------+---------+
| data_arr|first_asc|
+--------------------+---------+
|[[4,a], [2,b], [1...| [1,c]|
| [[1,a]]| [1,a]|
| [[3,b], [1,v]]| [1,v]|
+--------------------+---------+

Using the same dataframe as in the example:
val findSmallest = udf((rows: Seq[Row]) => {
rows.map(row => (row.getAs[String](0), row.getAs[String](1))).sorted.head
})
df.withColumn("SMALLEST", findSmallest($"VALUES"))
Will give a result like this:
+---+--------------------+--------+
| ID| VALUES|SMALLEST|
+---+--------------------+--------+
| 1|[[1,a], [2,b], [3...| [1,2]|
| 2| [[4,e]]| [4,g]|
| 3| [[3,g], [4,f]]| [3,g]|
+---+--------------------+--------+
If you only want the final values use select("SMALLEST).

Related

Scala spark: extract columns from a schema

I have a schema that looks following:
|-- contributors: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- type: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- id: string (nullable = true)
I would like to have a dataframe that have the columns key, name and id
I have used the following code to get name and id but how do I get the column key?
df.select(explode(col("contributors")))
.select(explode(col("value")))
.select(col("col.*"))
Update
I tried to apply the first solution to the following schema but the compiler does not like it. I would like to get value._name and subgenres.element.value._name
|-- mainGenre: struct (nullable = true)
| |-- value: struct (nullable = true)
| | |-- _name: string (nullable = true)
| |-- subgenres: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- value: struct (nullable = true)
| | | | |-- type: string (nullable = true))
| | | | |-- _name: string (nullable = true)
| | | |-- name: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
I tried to create a variable with value._name and then insert it in my second variable like this.
val col_mainGenre_name = df_r.select(col("mainGenre.*"))
.select(col("value.*"))
.select(col("_name"))
.drop("readableName")
.drop("description")
val df_exploded = df_r.select(col("mainGenre.*"))
.select(col_mainGenre_name, col("value.*"))
You can add key column in your second and third select. select method of dataframe accepts several columns as argument.
You should modify your code as follows:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.*"))
With the following contributors input column:
+--------------------------------------------------------------------------------------------+
|contributors |
+--------------------------------------------------------------------------------------------+
|{key1 -> [{type11, name11, id11}, {type12, name12, id12}], key2 -> [{type21, name21, id21}]}|
|{key3 -> [{type31, name31, id31}, {type32, name32, id32}], key4 -> []} |
+--------------------------------------------------------------------------------------------+
You get the following ouput:
+----+------+------+----+
|key |type |name |id |
+----+------+------+----+
|key1|type11|name11|id11|
|key1|type12|name12|id12|
|key2|type21|name21|id21|
|key3|type31|name31|id31|
|key3|type32|name32|id32|
+----+------+------+----+
if you want to keep only name and id columns from value, you should also modify the last select to select only col.id and col.name columns:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.name"), col("col.id"))
With the same contributors column input, you get your expected ouput:
+----+------+----+
|key |name |id |
+----+------+----+
|key1|name11|id11|
|key1|name12|id12|
|key2|name21|id21|
|key3|name31|id31|
|key3|name32|id32|
+----+------+----+

Flattening map<string,string> column in spark scala

Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()

In PySpark how to parse an embedded JSON

I am new to PySpark.
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
This is related JSON file parsing in Pyspark, but doesn't answer my question.
As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
However, this only gives the first element (if any) of UrlsInfo in the second column.
EDIT: I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+

How cast a WrappedArray[WrappedArray[(String, String)]] to Array[String] in Spark (Scala)

Good,
I'm working with the spark framework in Scala. My dataframe has a column with the following structure and content:
+---------------------------------------------------------------------------------------------+
|Email_Code |
+---------------------------------------------------------------------------------------------+
|[WrappedArray([3,spain]), WrappedArray([,]), WrappedArray([3,spain])] |
|[WrappedArray([3,spain]), WrappedArray([3,spain])] |
+---------------------------------------------------------------------------------------------+
|-- Email_Code: array (nullable = true)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = false)
| | | |-- Code: string (nullable = true)
| | | |-- Value: string (nullable = true)
And I am trying to develop a udf function that takes all the values ​​of the "Code" structure present in the array. But I'm not able ...
I would like an exit like the following:
+---------------------------------------------------------------------------------------------+
|Email_Code |
+---------------------------------------------------------------------------------------------+
|[3,,3] |
|[3,3] |
+---------------------------------------------------------------------------------------------+
Any help please?
I got to fix it:
val transformation = udf((data: Seq[Seq[Row]]) => {data.flatMap(x => x).map{case Row(code:String, value:String) => code}})
df.withColumn("result", transformation($"columnName"))

How to project an array of structs in spark dataframe API

I have a Dataframe like this:
val df = Seq(
Seq(("a","b","c"))
)
.toDF("arr")
.select($"arr".cast("array<struct<c1:string,c2:string,c3:string>>"))
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c2: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,b,c]]|
+---------+
I want to select only c1 and c3, such that:
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,c]] |
+---------+
Can this be done without UDF?
I can do it with an UDF, but I'd like a solution without it, something like
df
.select($"arr.c1".as("arr"))
root
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
But this only works to select 1 struct element, I've also tried :
df
.select(array(struct($"arr.c1",$"arr.c3")).as("arr"))
but this gives
root
|-- arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- c1: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- c3: array (nullable = true)
| | | |-- element: string (containsNull = true)
I can only give an answer for the Python API but I am sure the Scala API has something very similar.
The key is the function arrays_zip, which, according to the documentation, "[r]eturns a merged array of structs in which the N-th struct contains all N-th values of input arrays."
Example (still from the documentation):
from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
# Prints: [Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]