Pyspark exploding nested JSON into multiple columns and rows - pyspark

I am new to Pyspark and not yet familiar with all the functions and capabilities it has to offer.
I have a PySpark Dataframe with a column which contains nested JSON values, for example:
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
rows = [['Alice', """{
"level1":{
"tag1":{
"key1":"value1",
"key2":"value2",
"key3":"value3",
}
},
"level2":{
"tag1":{
"key1":"value1",
}
},
"level3":{
"tag1":{
"key1":"value1",
"key2":"value2",
"key3":"value3",
},
"tag2":{
"key1":'value1'
}
}}"""
]]
columns = ['name', 'Levels']
df = spark.createDataFrame(rows, columns)
The number of levels, tags, and key:value pairs in each tag are not in my control and may change.
My goal is to create a new Dataframe from the original with a new row for each tuple (level, tag, key, value) with the corresponding columns. Therefore, from the row with in the example, there will be new 8 rows in the form of:
(name, level, tag, key, value)
Alice, level1, tag1, key1, value1
Alice, level1, tag1, key2, value2
Alice, level1, tag1, key3, value3
Alice, level2, tag1, key1, value1
Alice, level3, tag1, key1, value1
Alice, level3, tag1, key2, value2
Alice, level3, tag1, key3, value3
Alice, level3, tag2, key1, value1

As first step the Json is transformed into an array of (level, tag, key, value)-tuples using an udf. The second step is to explode the array to get the individual rows:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = ...
def to_array(lvl):
def to_tuple(lvl):
levels=lvl.asDict()
for l in levels:
level=l
tags = levels[l].asDict()
for t in tags:
keys = tags[t].asDict()
for k in keys:
v=keys[k]
yield (l, t, k, v)
return list(to_tuple(lvl))
outputschema=T.ArrayType(T.StructType([
T.StructField("level", T.StringType(), True),
T.StructField("tag", T.StringType(), True),
T.StructField("key", T.StringType(), True),
T.StructField("value", T.StringType(), True)
]))
to_array_udf = F.udf(to_array, outputschema)
df.withColumn("tmp", to_array_udf("Levels")) \
.withColumn("tmp", F.explode("tmp")) \
.select("Levels", "tmp.*") \
.show()
Output:
+--------------------+------+----+----+------+
| Levels| level| tag| key| value|
+--------------------+------+----+----+------+
|{{{value1, value2...|level1|tag1|key1|value1|
|{{{value1, value2...|level1|tag1|key2|value2|
|{{{value1, value2...|level1|tag1|key3|value3|
|{{{value1, value2...|level2|tag1|key1|value1|
|{{{value1, value2...|level3|tag1|key1|value1|
|{{{value1, value2...|level3|tag1|key2|value2|
|{{{value1, value2...|level3|tag1|key3|value3|
|{{{value1, value2...|level3|tag2|key1|value1|
+--------------------+------+----+----+------+

Related

Spark join dataframe based on column of different type spark 1.6

I have 2 dataframes df1 and df2 . I am joining both df1 and df2 based on columns col1 and col2. However the datatype of col1 is string in df1 and type of col2 is int in df2. When I try to join like below,
val df3 = df1.join(df2,df1("col1") === df2("col2"),inner).select(df2("col2"))
The join does not work and return empty datatype. Will it be possible to get proper output without changing type of col2 in df2
val dDF1 = List("1", "2", "3").toDF("col1")
val dDF2 = List(1, 2).toDF("col2")
val res1DF = dDF1.join(dDF2, dDF1.col("col1") === dDF2.col("col2").cast("string"), "inner")
.select(dDF2.col("col2"))
res1DF.printSchema()
res1DF.show(false)
// root
// |-- col2: integer (nullable = false)
//
// +----+
// |col2|
// +----+
// |1 |
// |2 |
// +----+
get schema DataFrame
val sch1 = dDF1.schema
sch1: org.apache.spark.sql.types.StructType = StructType(StructField(col1,StringType,true))
// public StructField(String name,
// DataType dataType,
// boolean nullable,
// Metadata metadata)

Apache Spark SQL query and DataFrame as reference data

I have two Spark DataFrames:
cities DataFrame with the following column:
city
-----
London
Austin
bigCities DataFrame with the following column:
name
------
London
Cairo
I need to transform DataFrame cities and add an additional Boolean column there: bigCity Value of this column must be calculated based on the following condition "cities.city IN bigCities.name"
I can do this in the following way(with a static bigCities collection):
cities.createOrReplaceTempView("cities")
var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")
but I don't know how to replace the static bigCities collection ['London', 'Cairo'] with bigCities DataFrame in the query. I want to use bigCities as the reference data in the query.
Please advise how to achieve this.
val df = cities.join(bigCities, $"name".equalTo($"city"), "leftouter").
withColumn("bigCity", when($"name".isNull, "N").otherwise("Y")).
drop("name")
You can use collect_list() on the the bigCities table. Check this out
scala> val df_city = Seq(("London"),("Austin")).toDF("city")
df_city: org.apache.spark.sql.DataFrame = [city: string]
scala> val df_bigCities = Seq(("London"),("Cairo")).toDF("name")
df_bigCities: org.apache.spark.sql.DataFrame = [name: string]
scala> df_city.createOrReplaceTempView("cities")
scala> df_bigCities.createOrReplaceTempView("bigCities")
scala> spark.sql(" select city, case when array_contains((select collect_list(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>
If the dataset is big, you can use collect_set which will be more efficient.
scala> spark.sql(" select city, case when array_contains((select collect_set(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>

Join two rdds in Spark where first rdd's value is second rdd's key

There two rdds, the first one is a (key, value) pair rdd_1:
key1,[value1, value2]
the second one is also a (key, value) pair rdd_2:
(key2, value3), (key3, value4)...
I want join rdd1 and rdd2 , and rdd_1's value1 & value2 is the key2 in rdd2. The result that I want is
key1, [value1: value3, value2: value4]
I can process rdd1 with flatMap and then change the order, which means:
key1,[value1, value2] -> (key1, value1),(key1, value2)->(value1, key1),(value2, key1)
then to join rdd2, and then to change order & merge with key1...
Is there a more efficient to do it? thx.
Why not use dataframe, much faster than rdd's.
With dataframes, you can do something like this
from pyspark.sql import functions as f
x = [(0, [1,2]),(1,[7,8])]
y = [(1,4),(2,6),(8,4), (7,3)]
x = spark.createDataFrame(sc.parallelize(x)).toDF("id", "vals")
y = spark.createDataFrame(sc.parallelize(y)).toDF("id2", "val")
(x.join(y, f.expr("array_contains(vals, id2)")).select("id",
f.struct(["id2", "val"]).alias("map")).
groupBy("id").agg(f.collect_list("map").alias("map"))).show()
+---+--------------+
| id| map|
+---+--------------+
| 0|[[1,4], [2,6]]|
| 1|[[8,4], [7,3]]|
+---+--------------+

Spark dataframe explode pair of lists

My dataframe has 2 columns which look like this:
col_id| col_name
-----------
id1 | name1
id2 | name2
------------
id3 | name3
id4 | name4
....
so for each row, there are 2 matching arrays of the same length in columns id and name. What I want is to get each pair of id/name as a separate row like:
col_id| col_name
-----------
id1 | name1
-----------
id2 | name2
....
explode seems like the function to use but I can't seem to get it to work. What I tried is:
rdd.explode(col("col_id"), col("col_name")) ({
case row: Row =>
val ids: java.util.List[String] = row.getList(0)
val names: java.util.List[String] = row.getList(1)
var res: Array[(String, String)] = new Array[(String, String)](ids.size)
for (i <- 0 until ids.size) {
res :+ (ids.get(i), names.get(i))
}
res
})
This however returns only nulls so it might just be my poor knowledge of Scala. Can anyone point out the issue?
Looks like the last 10mins out of the past 1-2hours did the trick lol. This works just fine:
df.explode(col("id"), col("name")) ({
case row: Row =>
val ids: List[String] = row.getList(0).asScala.toList
val names: List[String] = row.getList(1).asScala.toList
ids zip names
})

Create a dataframe from a hashmap with keys as column names and values as rows in Spark

I have a dataframe and I have a column which is a map in dataframe like this -
scala> df.printSchema
root
|-- A1: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I need to select all the keys from dataframe as column name and values as rows.
For eg:
Let say I have 2 records like this-
1. key1 -> value1, key2 -> value2, key3 -> value3 ....
2. key1 -> value11, key3 -> value13, key4 -> value14 ...
I want the output dataframe as
key1 key2 key3 key4
value1 value2 value3 null
value11 null value13 value14
How can I do this?
First we need to create an id column by which we can group your data, then explode the map column A1, and finally reshape your df using pivot():
import org.apache.spark.sql.functions.{monotonically_increasing_id, explode, first}
df.withColumn("id", (monotonically_increasing_id()))
.select($"id", explode($"A1"))
.groupBy("id")
.pivot("key")
.agg(first("value")).show()
+---+-------+------+-------+-------+
| id| key1| key2| key3| key4|
+---+-------+------+-------+-------+
| 0| value1|value2| value3| null|
| 1|value11| null|value13|value14|
+---+-------+------+-------+-------+
Assuming that column with the Map is named "my_map"
Get set of unique keys (skip this step if you already has the keys beforehand):
val keys = df
.select(explode(expr("map_keys(my_map)")).as("keys_to_rows"))
.agg(collect_set("keys_to_rows"))
.collect()
.head.getSeq[String](0)
Select map values by keys as a columns:
df.select(
keys.map(key => col(s"B.$key").as(key)): _*
)