Spark: Check whether a value exists in a nested array without exploding - scala

I have a dataset like below:
val df = Seq(("beatles", Seq(Seq("help", "hey jude"))),
("romeo", Seq(Seq("help2", "hey judge"),Seq("help3", "they judge")))).toDF("col1", "col2")
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
I want to add a column to the dataframe, hasHitSong, which will iterate the sequence of hitsongs under col2, check if a hit song exist, for eg. "Hey Jude" and mark it as 1, else 0.
| col1 | col2 | hasHitSongs |
|---------|-------------------------------------------------|-------------|
| beatles | ["help", "hey jude"] | 1 |
| romeo | [["help2", "hey judge"],["help3", "hey judge"]] | 0 |
Is there a way to do this without exploding the column col2 and just iterating the nested arrays under col2?

If you are using spark version 2.4 or higher version:
Using built-in function
df.withColumn("hasHitSongs", array_contains(flatten(col("col2")), "hey jude"))
Using higher order function
df.withColumn("hasHitSongs, expr("exists(col2, a -> exists(a, b -> b = 'hey jude'))"))

Related

Flattening map<string,string> column in spark scala

Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()

Retrive subkey values of all the keys in json spark dataframe

i have a data frame with schema like below: (I have large number of keys )
|-- loginRequest: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
|-- loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
I want to create a column with status of all the keys of responseHeader.status
Expected
+--------------------+--------------------+------------+
| loginRequest| loginResponse| status |
+--------------------+--------------------+------------+
|[0,1] | null| 0 |
| null|[0,1] | 0 |
| null| [0,1]| 0 |
| null| [1,0]| 1 |
+--------------------+--------------------+-------------
Thanks in Advance
A simple select will solve your problem.
You have a nest field :
loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status
A quick way would be to flatten your dataframe.
Doing something like this:
df.select(df.col("loginRequest.*"),df.col("loginResponse.*"))
And get it working from there:
Or,
You could use something like this:
var explodeDF = df.withColumn("statusRequest", df("loginRequest. responseHeader"))
which you helped me into and these questions:
Flattening Rows in Spark
DataFrame explode list of JSON objects
In order to get it to populate either from response or request, you can use and when condition in spark.
- How to use AND or OR condition in when in Spark
You are able to get the subfields with the . delimiter in the select statement and with the help of the coalesce method, you should get exactly what you aim for, i.e. let's call the input dataframe df with your specified input schema, then this piece of code should do the work:
import org.apache.spark.sql.functions.{coalesce, col}
val df_status = df.withColumn("status",
coalesce(
col("loginRequest.responseHeader.status"),
col("loginResponse.responseHeader.status")
)
)
What coalesce does, is that it takes first non-null value in the order of the input columns to the method and in case there is no non-null value, it will return null (see https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#coalesce-org.apache.spark.sql.Column...-).

PySpark Dataframe Transpose as List

I'm working with pyspark sql api, and trying to group rows with repeated values into a list of rest of contents. It's similar to transpose, but instead of pivoting all values, will put values into array.
Current output:
group_id | member_id | name
55 | 123 | jake
55 | 234 | tim
65 | 345 | chris
Desired output:
group_id | members
55 | [[123, 'jake'], [234, 'tim']]
65 | [345, 'chris']
You need to groupby the group_id and use pyspark.sql.functions.collect_list() as the aggregation function.
As for combining the member_id and name columns, you have two options:
Option 1: Use pyspark.sql.functions.array:
from pyspark.sql.functions import array, collect_list
df1 = df.groupBy("group_id")\
.agg(collect_list(array("member_id", "name")).alias("members"))
df1.show(truncate=False)
#+--------+-------------------------------------------------+
#|group_id|members |
#+--------+-------------------------------------------------+
#|55 |[WrappedArray(123, jake), WrappedArray(234, tim)]|
#|65 |[WrappedArray(345, chris)] |
#+--------+-------------------------------------------------+
This returns a WrappedArray of arrays of strings. The integers are converted to strings because you can't have mixed type arrays.
df1.printSchema()
#root
# |-- group_id: integer (nullable = true)
# |-- members: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: string (containsNull = true)
Option 2: Use pyspark.sql.functions.struct
from pyspark.sql.functions import collect_list, struct
df2 = df.groupBy("group_id")\
.agg(collect_list(struct("member_id", "name")).alias("members"))
df2.show(truncate=False)
#+--------+-----------------------+
#|group_id|members |
#+--------+-----------------------+
#|65 |[[345,chris]] |
#|55 |[[123,jake], [234,tim]]|
#+--------+-----------------------+
This returns an array of structs, with named fields for member_id and name
df2.printSchema()
#root
# |-- group_id: integer (nullable = true)
# |-- members: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- member_id: integer (nullable = true)
# | | |-- name: string (nullable = true)
What's useful about the struct method is that you can access elements of the nested array by name using the dot accessor:
df2.select("group_id", "members.member_id").show()
#+--------+----------+
#|group_id| member_id|
#+--------+----------+
#| 65| [345]|
#| 55|[123, 234]|
#+--------+----------+

Lookup table in Spark

I have a dataframe in Spark with no clearly defined schema that I want to use as a lookup table. For example, the dataframe below:
+------------------------------------------------------------------------+
|lookupcolumn |
+------------------------------------------------------------------------+
|[val1,val2,val3,val4,val5,val6] |
+------------------------------------------------------------------------+
The schema would look like this:
|-- lookupcolumn: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- key3: string (nullable = true)
| |-- key4: string (nullable = true)
| |-- key5: string (nullable = true)
| |-- key6: string (nullable = true)
I'm saying "schema not clearly defined" since the number of keys is unknown while the data is being read, so I leave it to Spark to infer the schema.
Now, if I have another dataframe with a column as below:
+-----------------+
| datacolumn|
+-----------------+
| key1 |
| key3 |
| key5 |
| key2 |
| key4 |
+-----------------+
and I want the result to be:
+-----------------+
| resultcolumn|
+-----------------+
| val1 |
| val3 |
| val5 |
| val2 |
| val4 |
+-----------------+
I tried a UDF like this:
val get_val = udf((keyindex: String) => {
val res = lookupDf.select($"lookupcolumn"(keyindex).alias("result"))
res.head.toString
})
But it throws a Null Pointer exception error.
Can someone tell me what's wrong with the UDF, and if there's a better/simpler way of doing this lookup in Spark?
I assume that the lookup table is quite small, in this case it would make more sense to collect it to the driver and convert it to a normal Map. Then use this Map in the UDF function. It can be done in many way, for example like this:
val values = lookupDf.select("lookupcolumn.*").head.toSeq.map(_.toString)
val keys = lookupDf.select("lookupcolumn.*").columns
val lookup_map = keys.zip(values).toMap
Using the above lookup_map variable, the UDF will simply be:
val lookup = udf((key: String) => lookup_map.get(key))
And the final dataframe can be obtained by:
val df2 = df.withColumn("resultcolumn", lookup($"datacolumn"))

Adding attribute of type Array[long] from existing attribute value in DF

I am using spark 2.0 and have a use case where I need to convert the attribute type of a column from string to Array[long].
Suppose I have a dataframe with schema :
root
|-- unique_id: string (nullable = true)
|-- column2 : string (nullable = true)
DF :
+----------+---------+
|unique_id | column2 |
+----------+---------+
| 1 | 123 |
| 2 | 125 |
+----------+---------+
now i want to add a new column with name "column3" of type Array[long]having the values from "column2"
like :
root
|-- unique_id: string (nullable = true)
|-- column2: long (nullable = true)
|-- column3: array (nullable = true)
| |-- element: long (containsNull = true)
new DF :
+----------+---------+---------+
|unique_id | column2 | column3 |
+----------+---------+---------+
| 1 | 123 | [123] |
| 2 | 125 | [125] |
+----------+---------+---------+
I there a way to achieve this ?
You can simply use withColumn and array function as
df.withColumn("column3", array(df("columnd")))
And I also see that you are trying to change the column2 from string to Long. A simple udf function should do the trick. So final solution would be
def changeToLong = udf((str: String) => str.toLong)
val finalDF = df
.withColumn("column2", changeToLong(col("column2")))
.withColumn("column3", array(col("column2")))
You need to import functions library too as
import org.apache.spark.sql.functions._