Lookup table in Spark - scala

I have a dataframe in Spark with no clearly defined schema that I want to use as a lookup table. For example, the dataframe below:
+------------------------------------------------------------------------+
|lookupcolumn |
+------------------------------------------------------------------------+
|[val1,val2,val3,val4,val5,val6] |
+------------------------------------------------------------------------+
The schema would look like this:
|-- lookupcolumn: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- key3: string (nullable = true)
| |-- key4: string (nullable = true)
| |-- key5: string (nullable = true)
| |-- key6: string (nullable = true)
I'm saying "schema not clearly defined" since the number of keys is unknown while the data is being read, so I leave it to Spark to infer the schema.
Now, if I have another dataframe with a column as below:
+-----------------+
| datacolumn|
+-----------------+
| key1 |
| key3 |
| key5 |
| key2 |
| key4 |
+-----------------+
and I want the result to be:
+-----------------+
| resultcolumn|
+-----------------+
| val1 |
| val3 |
| val5 |
| val2 |
| val4 |
+-----------------+
I tried a UDF like this:
val get_val = udf((keyindex: String) => {
val res = lookupDf.select($"lookupcolumn"(keyindex).alias("result"))
res.head.toString
})
But it throws a Null Pointer exception error.
Can someone tell me what's wrong with the UDF, and if there's a better/simpler way of doing this lookup in Spark?

I assume that the lookup table is quite small, in this case it would make more sense to collect it to the driver and convert it to a normal Map. Then use this Map in the UDF function. It can be done in many way, for example like this:
val values = lookupDf.select("lookupcolumn.*").head.toSeq.map(_.toString)
val keys = lookupDf.select("lookupcolumn.*").columns
val lookup_map = keys.zip(values).toMap
Using the above lookup_map variable, the UDF will simply be:
val lookup = udf((key: String) => lookup_map.get(key))
And the final dataframe can be obtained by:
val df2 = df.withColumn("resultcolumn", lookup($"datacolumn"))

Related

Flattening map<string,string> column in spark scala

Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()

Spark: Check whether a value exists in a nested array without exploding

I have a dataset like below:
val df = Seq(("beatles", Seq(Seq("help", "hey jude"))),
("romeo", Seq(Seq("help2", "hey judge"),Seq("help3", "they judge")))).toDF("col1", "col2")
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
I want to add a column to the dataframe, hasHitSong, which will iterate the sequence of hitsongs under col2, check if a hit song exist, for eg. "Hey Jude" and mark it as 1, else 0.
| col1 | col2 | hasHitSongs |
|---------|-------------------------------------------------|-------------|
| beatles | ["help", "hey jude"] | 1 |
| romeo | [["help2", "hey judge"],["help3", "hey judge"]] | 0 |
Is there a way to do this without exploding the column col2 and just iterating the nested arrays under col2?
If you are using spark version 2.4 or higher version:
Using built-in function
df.withColumn("hasHitSongs", array_contains(flatten(col("col2")), "hey jude"))
Using higher order function
df.withColumn("hasHitSongs, expr("exists(col2, a -> exists(a, b -> b = 'hey jude'))"))

Pivot spark dataframe array of kv pairs into individual columns

I have following schema:
root
|-- id: string (nullable = true)
|-- date: timestamp (nullable = true)
|-- config: struct (nullable = true)
| |-- entry: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- key: string (nullable = true)
| | | |-- value: string (nullable = true)
There will not be more than 3 key-value pairs (k1,k2,k3) in the array and I would like to make value from each key into its own column while the corresponding data would come from the value from the same kv pair.
+--------+----------+----------+----------+---------+
|id |date |k1 |k2 |k3 |
+--------+----------+----------+----------+---------+
| id1 |2019-08-12|id1-v1 |id1-v2 |id1-v3 |
| id2 |2019-08-12|id2-v1 |id2-v2 |id2-v3 |
+--------+----------+----------+----------+---------+
So far I tried something like this:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", $"config.entry" as "kvpairs")
.withColumn($"kvpairs".getItem(0).getField("key").toString(), $"kvpairs".getItem(0).getField("value"))
.withColumn($"kvpairs".getItem(1).getField("key").toString(), $"kvpairs".getItem(1).getField("value"))
.withColumn($"kvpairs".getItem(2).getField("key").toString(), $"kvpairs".getItem(2).getField("value"))
But in this case, the column names are shown as kvpairs[0][key], kvpairs[1][key] and kvpairs[2][key] as shown below:
+--------+----------+---------------+---------------+---------------+
|id |date |kvpairs[0][key]|kvpairs[1][key]|kvpairs[2][key]|
+--------+----------+---------------+---------------+---------------+
| id1 |2019-08-12| id1-v1 | id1-v2 | id1-v3 |
| id2 |2019-08-12| id2-v1 | id2-v2 | id2-v3 |
+--------+----------+---------------+---------------+---------------+
Two questions:
Is my approach right? Is there a better and easier way to pivot this
such that I get one row per array with the 3 kv pairs as 3 columns? I want to handle cases where order of the kv pairs may differ.
If the above approach is fine, how do I alias the column name to the data of the "key" element in the array?
Using multiple withColumn together with getItem will not work since the order of the kv pairs may differ. What you can do instead is explode the array and then use pivot as follows:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", explode($"config.entry") as "exploded")
.select($"id", $"date", $"exploded.*")
.groupBy("id", "date")
.pivot("key")
.agg(first("value"))
The usage of first inside the aggregation here assumes there will be a single value for each key. Otherwise collect_list or collect_set can be used.
Result:
+---+----------+------+------+------+
|id |date |k1 |k2 |k2 |
+---+----------+------+------+------+
|id1|2019-08-12|id1-v1|id1-v2|id1-v3|
|id2|2019-08-12|id2-v1|id2-v2|id2-v3|
+---+----------+------+------+------+

How to count the elements in a column of arrays?

I'm trying to count the number of elements in FavouriteCities column in the following DataFrame.
+-----------------+
| FavouriteCities |
+-----------------+
| [NY, Canada] |
+-----------------+
The schema is as follows:
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
Expected output should be something like,
+------------+-------------+
| City | Count |
+------------+-------------+
| NY | 1 |
| Canada | 1 |
+------------+-------------+
I have tried using the agg() and count() but like the following, but it fails to extract individual elements from the array and tries to find the most common set of elements in the column.
data.agg(count("FavouriteCities").alias("count"))
Can someone please guide me with this?
To match schema you've shown:
scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
Explode:
val counts = data
.select(explode($"FavouriteCities" as "City"))
.groupBy("City")
.count
and aggregate:
import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)

Adding attribute of type Array[long] from existing attribute value in DF

I am using spark 2.0 and have a use case where I need to convert the attribute type of a column from string to Array[long].
Suppose I have a dataframe with schema :
root
|-- unique_id: string (nullable = true)
|-- column2 : string (nullable = true)
DF :
+----------+---------+
|unique_id | column2 |
+----------+---------+
| 1 | 123 |
| 2 | 125 |
+----------+---------+
now i want to add a new column with name "column3" of type Array[long]having the values from "column2"
like :
root
|-- unique_id: string (nullable = true)
|-- column2: long (nullable = true)
|-- column3: array (nullable = true)
| |-- element: long (containsNull = true)
new DF :
+----------+---------+---------+
|unique_id | column2 | column3 |
+----------+---------+---------+
| 1 | 123 | [123] |
| 2 | 125 | [125] |
+----------+---------+---------+
I there a way to achieve this ?
You can simply use withColumn and array function as
df.withColumn("column3", array(df("columnd")))
And I also see that you are trying to change the column2 from string to Long. A simple udf function should do the trick. So final solution would be
def changeToLong = udf((str: String) => str.toLong)
val finalDF = df
.withColumn("column2", changeToLong(col("column2")))
.withColumn("column3", array(col("column2")))
You need to import functions library too as
import org.apache.spark.sql.functions._