Unable to explode() Map[String, Struct] in Spark - scala

Been struggling with this for a while and still can't make my mind around it.
I'm trying to flatMap (or use .withColumn with explode() instead as it seems easier so I don't lose column names), but I'm always getting the error UDTF expected 2 aliases but got 'name' instead.
I've revisited some similar questions but none of them shed some light as their schemas are too simple.
The column of the schema I'm trying to perform flatMap with is the following one...
StructField(CarMake,
StructType(
List(
StructField(
Models,
MapType(
StringType,
StructType(
List(
StructField(Variant, StringType),
StructField(GasOrPetrol, StringType)
)
)
)
)
)
))
What I'm trying to achieve by calling explode() like this...
carsDS
.withColumn("modelsAndVariant", explode($"carmake.models"))
...is to achieve a Row without that nested Map and Struct so I get as many rows as variants are.
Example input
(country: Sweden, carMake: Volvo, carMake.Models: {"850": ("T5", "petrol"), "V50": ("T5", "petrol")})
Example output
(country: Sweden, carMake: Volvo, Model: "850", Variant: "T5", GasOrPetrol: "petrol"}
(country: Sweden, carMake: Volvo, Model: "V50", Variant: "T5", GasOrPetrol: "petrol"}
Basically leaving the nested Map with its inner Struct all in the same level.

Try this:
case class Models(variant:String, gasOrPetrol:String)
case class CarMake(brand:String, models : Map[String, Models] )
case class MyRow(carMake:CarMake)
val df = List(
MyRow(CarMake("volvo",Map(
"850" -> Models("T5","petrol"),
"V50" -> Models("T5","petrol")
)))
).toDF()
df.printSchema()
df.show()
gives
root
|-- carMake: struct (nullable = true)
| |-- brand: string (nullable = true)
| |-- models: map (nullable = true)
| | |-- key: string
| | |-- value: struct (valueContainsNull = true)
| | | |-- variant: string (nullable = true)
| | | |-- gasOrPetrol: string (nullable = true)
+--------------------+
| carMake|
+--------------------+
|[volvo, [850 -> [...|
+--------------------+
now explode, note that withColumn does not work because èxplode on a map returns 2 columns (key and value), so you need to use select:
val cols: Array[Column] = df.columns.map(col)
df
.select((cols:+explode($"carMake.models")):_*)
.select((cols:+$"key".as("model"):+$"value.*"):_*)
.show()
gives:
+--------------------+-----+-------+-----------+
| carMake|model|variant|gasOrPetrol|
+--------------------+-----+-------+-----------+
|[volvo, [850 -> [...| 850| T5| petrol|
|[volvo, [850 -> [...| V50| T5| petrol|
+--------------------+-----+-------+-----------+

Related

Concatenate keys with first element in the values array in a MapType column

The schema of the dataframe is given below.
|-- idMap: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- linked: boolean (nullable = true)
If there are 3 keys in a row for example, I'm trying to convert this to a new string column of the format key1:id;key2:id;key3:id where id is part of the element at index 0.
What I've tried is
Collect the keys to a list
Create a list of columns from the list of keys
val expr = new scala.collection.mutable.ListBuffer[org.apache.spark.sql.Column]
keyList.foldLeft(expr)((expr, key) => expr += (lit(key), lit(":"), col("idMap")(key)(0)("id"), lit(";")))
Add a new column with the list of columns passed to concat
val finalDf = df.withColumn("concatColumn", concat(expr.toList:_*))
It's giving me a null column so I'm assuming this approach is flawed. Any input would be appreciated.
Edit: #mck's answer works. Also using concat_ws in step 3 will work as well.
val finalDf = df.withColumn("concatColumn", concat_ws(expr.toList:_*))
If you have Spark 3, you can use transform_values to transform the map column to get your desired output.
// sample dataframe
val df = spark.sql("select map('key1', array(struct('id1' id, true linked)), 'key2', array(struct('id2' id, false linked))) idMap")
val df2 = df.withColumn(
"concatColumn",
expr("""
concat_ws(';',
map_values(
transform_values(
idMap,
(k, v) -> concat(k, ':', transform(v, y -> y.id)[0])
)
)
)
""")
)
df2.show(false)
+-----------------------------------------------+-----------------+
|idMap |concatColumn |
+-----------------------------------------------+-----------------+
|[key1 -> [[id1, true]], key2 -> [[id2, false]]]|key1:id1;key2:id2|
+-----------------------------------------------+-----------------+

Scala compare dataframe complex array type field

I'm trying to create a dataframe to feed to a function as part of my unit tests. If I have the following
val myDf = sparkSession.sqlContext.createDataFrame(
sparkSession.sparkContext.parallelize(Seq(
Row(Some(Seq(MyObject(1024, 100001D), MyObject(1, -1D)))))),
StructType(List(
StructField("myList", ArrayType[???], true)
)))
MyObject is a case class.
I don't know what to put for the object type. Any suggestions? I've tried ArrayType of pretty much every combination I can think of.
I'm looking for a dataframe that looks something like:
+--------------------+
| myList |
+--------------------+
| [1024, 100001] |
| [1, -1] |
+--------------------+
Coming in the reverse way...
val s = Seq(Array(1024, 100001D), Array(1, -1D)).toDS().toDF("myList")
println(s.schema)
s.printSchema
s.show
Your schema is like below... DoubleType is coming since these 100001D and -1D are double.
StructType(StructField(myList,ArrayType(DoubleType,false),true))
Output you needed:
root
|-- myList: array (nullable = true)
| |-- element: double (containsNull = false)
+------------------+
| myList|
+------------------+
|[1024.0, 100001.0]|
| [1.0, -1.0]|
+------------------+
Or this way also you can do that.
case class MyObject(a:Int , b:Double)
val s = Seq(MyObject(1024, 100001D), MyObject(1, -1D)).toDS()
.select(struct($"a",$"b").as[MyObject] as "myList")
println(s.schema)
s.printSchema
s.show
Result:
//schema :
StructType(StructField(myList,StructType(StructField(a,IntegerType,false), StructField(b,DoubleType,false)),false))
root
|-- myList: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: double (nullable = false)
+----------------+
| myList|
+----------------+
|[1024, 100001.0]|
| [1, -1.0]|
+----------------+
Try this
scala> case class MyObject(prop1:Int, prop2:Double)
defined class MyObject
scala> val df = Seq((1024, 100001D), (1, -1D)).toDF("prop1","prop2").select(struct($"prop1",$"prop2").as[MyObject] as "myList")
df: org.apache.spark.sql.DataFrame = [myList: struct<prop1: int, prop2: double>]
scala> df.show(false)
+----------------+
|myList |
+----------------+
|[1024, 100001.0]|
|[1, -1.0] |
+----------------+
scala> df.printSchema
root
|-- myList: struct (nullable = false)
| |-- prop1: integer (nullable = false)
| |-- prop2: double (nullable = false)

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case?
def drop_col(df, struct_nm, delete_struct_child_col_nm):
fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select("
{}.*".format(struct_nm)).columns)
fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep))
return df.withColumn(struct_nm, struct(fields_to_keep))
I built a simple example with a struct column and a few dummy columns:
from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
[
StructField('addresses',
StructType(
[StructField("state", StringType(), True),
StructField("street", StringType(), True),
StructField("country", StringType(), True),
StructField("code", IntegerType(), True)]
)
)
]
)
rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
({'state': 'ca', 'street': 'baker', 'country': 'USA', 'code': 101},)]
df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))
print(df.show())
print(df.printSchema())
Output:
+--------------------+-----------+----+
| addresses| id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+
root
|-- addresses: struct (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- country: string (nullable = true)
| |-- code: integer (nullable = true)
|-- id: long (nullable = false)
|-- name: string (nullable = false)
To drop the whole struct column, you can simply use the drop function:
df2 = df.drop('addresses')
print(df2.show())
Output:
+-----------+----+
| id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
To drop specific fields, in a struct column, it's a bit more complicated - there are some other similar questions here:
Dropping a nested column from Spark DataFrame
Dropping nested column of Dataframe with PySpark
In any case, I found them to be a bit complicated - my approach would just be to reassign the original column with the subset of struct fields you want to keep:
columns_to_keep = ['country', 'code']
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+----------+-----------+----+
| addresses| id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
Alternatively, if you just wanted to specify the columns you want to remove rather than the columns you want to keep:
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+------------+-----------+----+
| addresses| id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
Hope this helps!

Converting a dataframe to an array of struct of column names and values

Suppose I have a dataframe like this
val customer = Seq(
("C1", "Jackie Chan", 50, "Dayton", "M"),
("C2", "Harry Smith", 30, "Beavercreek", "M"),
("C3", "Ellen Smith", 28, "Beavercreek", "F"),
("C4", "John Chan", 26, "Dayton","M")
).toDF("cid","name","age","city","sex")
How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark
The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:
val cols = customer.columns.tail
val result = customer.select('cid,
array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")
result.show(false)
+---+-----------------------------------------------------------+
|cid|array |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]] |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]] |
+---+-----------------------------------------------------------+
result.printSchema()
root
|-- cid: string (nullable = true)
|-- array: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- name: string (nullable = false)
| | |-- value: string (nullable = true)
You can do it using array and struct functions:
customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))
will make:
|-- cid: string (nullable = true)
|-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- column_name: string (nullable = false)
| | |-- column_value: string (nullable = true)
Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
or wrap the map col in an array if you want it
This way you can still do numerical or string transformations on the relevant key or value. For example:
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
df.select('*',
map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)
Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere

Explode array of structs to columns in Spark

I'd like to explode an array of structs to columns (as defined by the struct fields). E.g.
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- name: string (nullable = true)
Should be transformed to
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
I can achieve this with
df
.select(explode($"arr").as("tmp"))
.select($"tmp.*")
How can I do that in a single select statement?
I thought this could work, unfortunately it does not:
df.select(explode($"arr")(".*"))
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
such struct field .* in col;
Single step solution is available only for MapType columns:
val df = Seq(Tuple1(Map((1L, "bar"), (2L, "foo")))).toDF
df.select(explode($"_1") as Seq("foo", "bar")).show
+---+---+
|foo|bar|
+---+---+
| 1|bar|
| 2|foo|
+---+---+
With arrays you can use flatMap:
val df = Seq(Tuple1(Array((1L, "bar"), (2L, "foo")))).toDF
df.as[Seq[(Long, String)]].flatMap(identity)
A single SELECT statement can written in SQL:
df.createOrReplaceTempView("df")
spark.sql("SELECT x._1, x._2 FROM df LATERAL VIEW explode(_1) t AS x")