concatenating json list attributes in pyspark into one value - pyspark

I'm using pyspark 3.0.1 and I have a json file where i need to parse a json column, the json looks as follows:
df1.select("mycol").show()
[
{"l1": 0, "l2": "abc", "l3": "xyz"},
{"l1": 1, "l2": "def", "l3": "xzz"},
{"l1": 2, "l2": "ghi", "l3": "yyy"},
]
I want either a dataframe column or a string that returns the following output in the form of "l2.value: l3.value" for each array in the list.
abc: xyz
def: xzz
ghi: yyy
so far i have this:
df1.createOrReplaceTempView("MY_TEST")
select mc.l2 ||": "|| mc.l3 from (select explode(mycol) as mc from MY_TEST)
and it does give me the result i want but each line is in a different row because of the explode, i need it all in 1 single row or 1 single string (including end of line):
concat(concat(mc.l2 AS l2, : ), mc.l3 AS l3)
abc: xyz
def: xzz
ghi: yyy
desired output:
result
abc: xyz/ndef: xzz/nghi: yyy
i wonder also if there's anything more efficient and perhaps not having to go through a temp table.

You can use transform higher order function, since you are in Spark 3.0.1 you can use transform available as a SQL expression. And then concatenate the elements in the array using 'concat_ws`.
from pyspark.sql import functions as F
data_row = ([
{"l1": 0, "l2": "abc", "l3": "xyz"},
{"l1": 1, "l2": "def", "l3": "xzz"},
{"l1": 2, "l2": "ghi", "l3": "yyy"},
], )
df = spark.createDataFrame([data_row], "STRUCT<mycol:ARRAY<STRUCT<l1: INT, l2: STRING, l3: STRING>>>")
(df.select(F.concat_ws("/n",
F.expr("transform(mycol, x -> concat(x.l2, ':', x.l3))"))
.alias("result"))
.show(truncate=False))
"""
+-------------------------+
|result |
+-------------------------+
|abc:xyz/ndef:xzz/nghi:yyy|
+-------------------------+
"""

Related

Spark: How to convert array of objects with fields key-value into columns with keys as names

I have a column that contains array of objects as a value.
Objects have the following structure:
[
{
"key": "param1",
"val": "value1"
},
{
"key": "param2",
"val": "value2"
},
{
"key": "param3",
"val": "value3"
}
]
someColumn
colName
text
[{key: "param1", val: "value1"}, {key: "param2", val: "value2"}, {key: "param3", val: "value3"}]
When I do:
df.withColumn("exploded", explode(col("colName")))
I get
someColumn
exploded
text
{key: "param1", val: "value1"}
text
{key: "param2", val: "value2"}
text
{key: "param3", val: "value3"}
Then I do next:
df.select("*", "exploded.*").drop("exploded")
I get this:
someColumn
key
value
text
param1
value1
text
param2
value2
text
param3
value3
I understand why I get such result but I need to get other structure.
I want to get next result:
someColumn
param1
param2
param3
text
value1
value2
value3
Maybe do I have to transform array of Object[key, value] to Map and then to transform Map to Columns? What is the sequence of transformations I have to do?
Once you explode your dataset, you can:
df = df.groupBy("someColumn").pivot("exploded.key").agg(first("exploded.val"))
This is created from the above statement:
+----------+------+------+------+
|someColumn|param1|param2|param3|
+----------+------+------+------+
|text |value1|value2|value3|
+----------+------+------+------+
which is what you like!
I found one more solution:
val mappedDF = df
.select(
$"*",
col("ColName").getField("key").as("keys"),
col("ColName").getField("val").as("values")
)
.drop("ColName")
.select(
$"*",
map_from_arrays($"keys", $"values").as("ColName")
)
val keysDF = mappedDF.select(explode(map_keys($"CalName"))).distinct()
val keys = keysDF.collect().map(f=>f.get(0))
val keyCols = keys.map(f=> col("CalName").getItem(f).as(f.toString))
mappedDF.select(col("*") +: keyCols:_*).drop("CalName")
This solution work faster than pivot. But I'm not sure that it's the best solution.
BTW If we know list of keys and this list is fixed, this solution becomes more faster because we don't have to get list of keys from DF.
I wrote more universal code when we need to group by a few cols. In my post I simplified example for understanding purposes.

Filter out struct of null values from an array of structs in spark dataframe

I have a col in a dataframe which is an array of structs. There are some structs with all null values which I would like to filter out. For example with the following dataframe:
+—————+————————————————---------------------—+
|advertiser |brands |
+—————+—————————————---------------------————+
Advertiser1 [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2 [{"id" : null, "type" : null, "name" : null}]
+—————+————————————---------------------—————+
I would like to filter out the struct with the null values to get:
+—————+———————————————---------------------——+
|advertiser |brands |
+—————+————————————---------------------—————+
Advertiser1 [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2 []
+—————+—————————————---------------------————+
I'm thinking it's something along the lines of this if I can come up with a struct of null values:
.withColumn(
"brands",
when(
col("brands").equalTo(*emptyStruct?*),
null
)
)
You can try to use the to_json function, brands with all null values will returns [{}].
You want to filter the array elements. Since Spark 3.0 there is a method in org.apache.spark.sql.functions with the signature:
filter(column: Column, f: Column => Column): Column
which does this for you:
df.select(
col("advertiser"),
filter(col("brands"),
b =>
b.getField("id").isNotNull
&& b.getField("type").isNotNull
&& b.getField("name").isNotNull) as "brands"
Using Spark 1.6+ you can explode the array, filter the structs and then group by advertiser:
df.select(col("advertiser"), explode(col("brands")) as "b").
filter(
col("b.id").isNotNull &&
col("b.type").isNotNull &&
col("b.name").isNotNull)
.groupBy("advertiser").agg(collect_list("b") as "brands")

Values of a Dataframe Column into an Array in Scala Spark

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

How to get data out of Wrapped Array in Apache Spark / Scala

I have a Dataframe with rows that look like this:
[WrappedArray(1, 5DC7F285-052B-4739-8DC3-62827014A4CD, 1, 1425450997, 714909, 1425450997, 714909, {}, 2013, GAVIN, ST LAWRENCE, M, 9)]
[WrappedArray(2, 17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF, 2, 1425450997, 714909, 1425450997, 714909, {}, 2013, LEVI, ST LAWRENCE, M, 9)]
[WrappedArray(3, 53E20DA8-8384-4EC1-A9C4-071EC2ADA701, 3, 1425450997, 714909, 1425450997, 714909, {}, 2013, LOGAN, NEW YORK, M, 44)]
...
Everything before the year (2013 in this example) is nonsense that should be dropped. I would like to map the data to a Name class that I have created and put it into a new dataframe.
How do I get to the data and do that mapping?
Here is my Name class:
case class Name(year: Int, first_name: String, county: String, sex: String, count: Int)
Basically, I would like to fill my dataframe with rows and columns according to the schema of the Name class. I know how to do this part, but I just don't know how to get to the data in the dataframe.
Assuming the data is an array of strings like this:
val df = Seq(Seq("1", "5DC7F285-052B-4739-8DC3-62827014A4CD", "1", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "GAVIN", "STLAWRENCE", "M", "9"),
Seq("2", "17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF", "2", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LEVI", "ST LAWRENCE", "M", "9"),
Seq("3", "53E20DA8-8384-4EC1-A9C4-071EC2ADA701", "3", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LOGAN", "NEW YORK", "M", "44"))
.toDF("array")
You could either use an UDF that returns a case class or you can use withColumn multiple times. The latter should be more efficient and can be done like this:
val df2 = df.withColumn("year", $"array"(8).cast(IntegerType))
.withColumn("first_name", $"array"(9))
.withColumn("county", $"array"(10))
.withColumn("sex", $"array"(11))
.withColumn("count", $"array"(12).cast(IntegerType))
.drop($"array")
.as[Name]
This will give you a DataSet[Name]:
+----+----------+-----------+---+-----+
|year|first_name|county |sex|count|
+----+----------+-----------+---+-----+
|2013|GAVIN |STLAWRENCE |M |9 |
|2013|LEVI |ST LAWRENCE|M |9 |
|2013|LOGAN |NEW YORK |M |44 |
+----+----------+-----------+---+-----+
Hope it helped!

Adding columns in Spark dataframe based on rules

I have a dataframe df, which contains below data:
**customers** **product** **Val_id**
1 A 1
2 B X
3 C
4 D Z
i have been provided 2 rules, which are as below:
**rule_id** **rule_name** **product value** **priority**
123 ABC A,B 1
456 DEF A,B,D 2
Requirement is to apply these rules on dataframe df in priority order, customers who have passed rule 1, should not be considered for rule 2 and in final dataframe add two more columns rule_id and rule_name, i have written below code to achieve it:
val rule_name = when(col("product").isin("A","B"), "ABC").otherwise(when(col("product").isin("A","B","D"), "DEF").otherwise(""))
val rule_id = when(col("product").isin("A","B"), "123").otherwise(when(col("product").isin("A","B","D"), "456").otherwise(""))
val df1 = df_customers.withColumn("rule_name" , rule_name).withColumn("rule_id" , rule_id)
df1.show()
Final output looks like below:
**customers** **product** **Val_id** **rule_name** **rule_id**
1 A 1 ABC 123
2 B X ABC 123
3 C
4 D Z DEF 456
Is there any better way to achieve it, adding both columns by just going though entire dataset once instead of going through entire dataset twice?
Question : Is there any better way to achieve it, adding both columns
by just going though entire dataset once instead of going through
entire dataset twice?
Answer : you can have a Map return type in scala...
Limitation : This udf if you are using with With Column for example
column name is ruleIDandRuleName then you can use a single fuction
with Map data type or any acceptable data type of spark sql column.
Other wise you cant use the below mentioned approach
shown in the below example snippet
def ruleNameAndruleId = udf((product : String) => {
if(Seq("A", "B").contains(product)) Map("ruleName"->"ABC","ruleId"->"123")
else if(Seq("A", "B", "D").contains(product)) (Map("ruleName"->"DEF","ruleId"->"456")
else (Map("ruleName"->"","ruleId"->"") })
caller will be
df.withColumn("ruleIDandRuleName",ruleNameAndruleId(product here) ) // ruleNameAndruleId will return a map containing rulename and rule id
An alternative to your solution would be to use udf functions. Its almost similar to when function as both required serialization and deserialization. Its upto you to test which is faster and efficient.
def rule_name = udf((product : String) => {
if(Seq("A", "B").contains(product)) "ABC"
else if(Seq("A", "B", "D").contains(product)) "DEF"
else ""
})
def rule_id = udf((product : String) => {
if(Seq("A", "B").contains(product)) "123"
else if(Seq("A", "B", "D").contains(product)) "456"
else ""
})
val df1 = df_customers.withColumn("rule_name" , rule_name(col("product"))).withColumn("rule_id" , rule_id(col("product")))
df1.show()