Explode array of structs to columns in Spark - scala

I'd like to explode an array of structs to columns (as defined by the struct fields). E.g.
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- name: string (nullable = true)
Should be transformed to
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
I can achieve this with
df
.select(explode($"arr").as("tmp"))
.select($"tmp.*")
How can I do that in a single select statement?
I thought this could work, unfortunately it does not:
df.select(explode($"arr")(".*"))
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
such struct field .* in col;

Single step solution is available only for MapType columns:
val df = Seq(Tuple1(Map((1L, "bar"), (2L, "foo")))).toDF
df.select(explode($"_1") as Seq("foo", "bar")).show
+---+---+
|foo|bar|
+---+---+
| 1|bar|
| 2|foo|
+---+---+
With arrays you can use flatMap:
val df = Seq(Tuple1(Array((1L, "bar"), (2L, "foo")))).toDF
df.as[Seq[(Long, String)]].flatMap(identity)
A single SELECT statement can written in SQL:
df.createOrReplaceTempView("df")
spark.sql("SELECT x._1, x._2 FROM df LATERAL VIEW explode(_1) t AS x")

Related

In SparkSQL how could I select a subset of columns from a nested struct and keep it as a nested struct in the result using SQL statement?

I can do the following statement in SparkSQL:
result_df = spark.sql("""select
one_field,
field_with_struct
from purchases""")
And resulting data frame will have the field with full struct in field_with_struct.
one_field
field_with_struct
123
{name1,val1,val2,f2,f4}
555
{name2,val3,val4,f6,f7}
I want to select only few fields from field_with_struct, but keep them still in struct in the resulting data frame. If something could be possible (this is not real code):
result_df = spark.sql("""select
one_field,
struct(
field_with_struct.name,
field_with_struct.value2
) as my_subset
from purchases""")
To get this:
one_field
my_subset
123
{name1,val2}
555
{name2,val4}
Is there any way of doing this with SQL? (not with fluent API)
There's a much simpler solution making use of arrays_zip, no need to explode/collect_list (which can be error prone/difficult with complex data since it relies on using something like an id column):
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import arrays_zip
>>> df = sc.createDataFrame((([Row(x=1, y=2, z=3), Row(x=2, y=3, z=4)],),), ['array_of_structs'])
>>> df.show(truncate=False)
+----------------------+
|array_of_structs |
+----------------------+
|[{1, 2, 3}, {2, 3, 4}]|
+----------------------+
>>> df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
| | |-- z: long (nullable = true)
>>> # Selecting only two of the nested fields:
>>> selected_df = df.select(arrays_zip("array_of_structs.x", "array_of_structs.y").alias("array_of_structs"))
>>> selected_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> selected_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
EDIT Adding in the corresponding Spark SQL code, since that was requested by the OP:
>>> df.createTempView("test_table")
>>> sql_df = sc.sql("""
SELECT
transform(array_of_structs, x -> struct(x.x, x.y)) as array_of_structs
FROM test_table
""")
>>> sql_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> sql_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
In fact, the pseudo code which I have provided is working. For a nested array of object it's not so straightforward. At first, the array should be exploded (EXPLODE() function) and then selected a subset. After that it's possible to make a COLLECT_LIST().
WITH
unfold_by_items AS (SELECT id, EXPLODE(Items) AS item FROM spark_tbl_items)
, format_items as (SELECT
id
, STRUCT(
item.item_id
, item.name
) AS item
FROM unfold_by_items)
, fold_by_items AS (SELECT id, COLLECT_LIST(item) AS Items FROM format_items GROUP BY id)
SELECT * FROM fold_by_items
This will choose only two fields from the struct in Items and in the end returns a dataset which contains again an array with Items.

Converting a dataframe to an array of struct of column names and values

Suppose I have a dataframe like this
val customer = Seq(
("C1", "Jackie Chan", 50, "Dayton", "M"),
("C2", "Harry Smith", 30, "Beavercreek", "M"),
("C3", "Ellen Smith", 28, "Beavercreek", "F"),
("C4", "John Chan", 26, "Dayton","M")
).toDF("cid","name","age","city","sex")
How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark
The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:
val cols = customer.columns.tail
val result = customer.select('cid,
array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")
result.show(false)
+---+-----------------------------------------------------------+
|cid|array |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]] |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]] |
+---+-----------------------------------------------------------+
result.printSchema()
root
|-- cid: string (nullable = true)
|-- array: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- name: string (nullable = false)
| | |-- value: string (nullable = true)
You can do it using array and struct functions:
customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))
will make:
|-- cid: string (nullable = true)
|-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- column_name: string (nullable = false)
| | |-- column_value: string (nullable = true)
Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
or wrap the map col in an array if you want it
This way you can still do numerical or string transformations on the relevant key or value. For example:
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
df.select('*',
map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)
Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere

In PySpark how to parse an embedded JSON

I am new to PySpark.
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
This is related JSON file parsing in Pyspark, but doesn't answer my question.
As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
However, this only gives the first element (if any) of UrlsInfo in the second column.
EDIT: I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+

Convert array of vectors to DenseVector

I am running Spark 2.1 with Scala. I am trying to convert and array of vectors into a DenseVector.
Here is my dataframe:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
For example, I need to extract the value of the hashValues column into a DenseVectorfor id 401310732094.
This can be done with an UDF:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
This will overwrite the hashValues column with a new one containing a DenseVector.
Tested with a dataframe with following schema:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
The result is:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)

udf Function for DataType casting, Scala

I have next DataFrame:
df.show()
+---------------+----+
| x| num|
+---------------+----+
|[0.1, 0.2, 0.3]| 0|
|[0.3, 0.1, 0.1]| 1|
|[0.2, 0.1, 0.2]| 2|
+---------------+----+
This DataFrame has follow Datatypes of columns:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: double (containsNull = true)
|-- num: long (nullable = true)
I try to convert currently the DoubleArray inside of DataFrame to the FloatArray. I do it with the next statement of udf:
val toFloat = udf[(val line: Seq[Double]) => line.map(_.toFloat)]
val test = df.withColumn("testX", toFloat(df("x")))
This code is currently not working. Can anybody share with me the solution how to change the array Type inseide of DataFrame?
What I want is:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: float (containsNull = true)
|-- num: long (nullable = true)
This question is based on the question How tho change the simple DataType in Spark SQL's DataFrame
Your udf is wrongly declared. You should write it as follows :
val toFloat = udf((line: Seq[Double]) => line.map(_.toFloat))