Nest rows of dataframe as an array column - pyspark

I have a dataframe that looks more or less like this:
| id | category | value | item_id |
|----|----------|-------|---------|
| 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 |
| 3 | 1 | 3 | 1 |
| 4 | 2 | 4 | 2 |
In this case, some of the categories must be considered sub-categories in some parts of the code (computation up to this point is similar regardless of the hierarchy, and thus they are all on the same table). However, now they must be nested according to a certain set of rules, defined in a separate dataframe:
| id | children |
|----|----------|
| 1 | [2,3] |
| 2 | null |
| 3 | null |
This nesting depends on the item column. That is, for each row, only those entries with the same item value which have a subcategory have to be nested. Which means that the categories 2 and 3 for item 1 have to be nested under the entry with id 1. If output to JSON, the result should look like this:
[{
"id": 1,
"item": 1,
"category": 1,
"value": 1,
"children": [{
"id": 2,
"item": 1,
"category": 2,
"value": 2
}, {
"id": 3,
"item": 1,
"category": 3,
"value": 3
}]
},
{
"id": 4,
"item": 2,
"category": 1,
"value": 4,
"children": []
}]
While fairly simple to implement using my own code, I would like to achieve this nesting using the PySpark Dataframe API. So far, here is what I have tried:
Join the two tables so that the list of children for a certain row is added as a column:
df_data = df_data.join(df_rules, df_data.category == df_rules.id, "left")
After some coalescing, here is the result:
| id | category | value | children |
|----|----------|-------|----------|
| 1 | 1 | 1 | [2,3] |
| 2 | 2 | 2 | [] |
| 3 | 1 | 3 | [] |
| 4 | 2 | 4 | [] |
Now, I would like to apply some sort of transformation so that I get something like this:
| id | category | value | item | children |
|----|----------|-------|------|-------------------------|
| 1 | 1 | 1 | 1 |[(2,2,2,1),(3,3,3,1)] |
| 2 | 2 | 2 | 1 |[] |
| 3 | 1 | 3 | 1 |[] |
| 4 | 2 | 4 | 1 |[] |
That is, rows with ids 2 and 3 are nested into row 1. The rest receive an empty list, as there are no matches. Afterwards the subcategories can be removed, but that is trivial to implement.
I am struggling a bit to implement this. My first thought was to use something like this:
spark.sql("SELECT *, ARRAY(SELECT * FROM my_table b WHERE b.item = a.item AND b.category IN a.children) FROM my_table a")
However, as soon as I add a SELECT statement to ARRAY it complains. I have also considered window functions or UDFs, but I am not sure how I could proceed with those, or if it is even possible.

I think found a way to do this. Here is an MCVE using Pandas to create data. It assumes an already initialized Spark session
import pandas as pd
from pyspark.sql import functions as F
columns = ['id', 'category', 'value', 'item_id']
data = [(1,1,1,1), (2,2,2,2), (3,3,3,1), (4,4,4,2)]
spark_data = spark.createDataFrame(pd.DataFrame.from_records(data=data, columns=columns))
rules_columns = ['category', 'children_rules']
rules_data = [(1, [2, 3]), (2, []), (3, [])]
spark_rules_data = spark.createDataFrame(pd.DataFrame.from_records(data=rules_data, columns=rules_columns))
First, perform a left join with the rules to apply:
joined_data = spark_data.join(spark_rules_data, on="category", how="left")
joined_data = joined_data.withColumn("children_rules", F.coalesce(F.col("children_rules"), F.array()))
joined_data.createOrReplaceTempView("joined_data")
joined_data.show()
+--------+---+-----+-------+--------------+
|category| id|value|item_id|children_rules|
+--------+---+-----+-------+--------------+
| 1| 1| 1| 1| [2, 3]|
| 3| 3| 3| 1| []|
| 2| 2| 2| 2| []|
| 4| 4| 4| 2| []|
+--------+---+-----+-------+--------------+
Join the table with itself according to the rules in the children column
nested_data = spark.sql("""SELECT joined_data_1.id as id, joined_data_1.category as category, joined_data_1.value as value, joined_data_1.item_id as item_id,
STRUCT(joined_data_2.id as id, joined_data_2.category as category, joined_data_2.value as value, joined_data_2.item_id as item_id) as children
FROM joined_data AS joined_data_1 LEFT JOIN joined_data AS joined_data_2
ON array_contains(joined_data_1.children_rules, joined_data_2.category)""")
nested_data.createOrReplaceTempView("nested_data")
nested_data.show()
+---+--------+-----+-------+------------+
| id|category|value|item_id| children|
+---+--------+-----+-------+------------+
| 1| 1| 1| 1|[3, 3, 3, 1]|
| 1| 1| 1| 1|[2, 2, 2, 2]|
| 3| 3| 3| 1| [,,,]|
| 2| 2| 2| 2| [,,,]|
| 4| 4| 4| 2| [,,,]|
+---+--------+-----+-------+------------+
Group by category value and aggregate the children column into a list
grouped_data = spark.sql("SELECT category, collect_set(children) as children FROM nested_data GROUP BY category")
grouped_data.createOrReplaceTempView("grouped_data")
grouped_data.show()
+--------+--------------------+
|category| children|
+--------+--------------------+
| 1|[[2, 2, 2, 2], [3...|
| 3| [[,,,]]|
| 2| [[,,,]]|
| 4| [[,,,]]|
+--------+--------------------+
Join the grouped table with the original one
original_with_children = spark_data.join(grouped_data, on="category")
original_with_children.createOrReplaceTempView("original_with_children")
original_with_children.show()
+--------+---+-----+-------+--------------------+
|category| id|value|item_id| children|
+--------+---+-----+-------+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...|
| 3| 3| 3| 1| [[,,,]]|
| 2| 2| 2| 2| [[,,,]]|
| 4| 4| 4| 2| [[,,,]]|
+--------+---+-----+-------+--------------------+
Here is the tricky bit. We need to remove the entries in children with NULL values. I tried to do a CASE statement with an empty array, casting to 'array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>> (this value comes from original_with_children.dtypes:
[('category', 'bigint'),
('id', 'bigint'),
('value', 'bigint'),
('item_id', 'bigint'),
('children',
'array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>')]
array_type = "array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>"
spark.sql(f"""SELECT *, CASE WHEN children[0]['category'] IS NULL THEN CAST(ARRAY() AS {array_type}) ELSE children END as no_null_children
FROM original_with_children""").show()
This raises the following exception (only relevant bit shown):
Py4JJavaError Traceback (most recent call last)
~/miniconda3/envs/sukiyaki_venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/miniconda3/envs/sukiyaki_venv/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling o28.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<string> to array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>; line 1 pos 57;
...
I could not figure out a way to create an empty array of the right type (casting does not work, as there is no conversion between the default, array of string, and an array of structures). Instead, here is my approach:
The order of the fields changes on every call and that surprisingly causes a type mismatch, so it is necessar to query it every time
array_type = next(value for key, value in original_with_children.dtypes if key == 'children')
empty_array_udf = F.udf(lambda : [], array_type)
aux = original_with_children.withColumn("aux", empty_array_udf())
aux.createOrReplaceTempView("aux")
There must be a better way to create an empty column with this complicated type. The UDF causes an unnecessary overhead for such a simple thing.
no_null_children = spark.sql("""SELECT *, CASE WHEN children[0]['category'] IS NULL THEN aux ELSE children END as no_null_children
FROM aux""")
no_null_children.createOrReplaceTempView("no_null_children")
no_null_children.show()
+--------+---+-----+-------+--------------------+---+--------------------+
|category| id|value|item_id| children|aux| no_null_children|
+--------+---+-----+-------+--------------------+---+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...| []|[[2, 2, 2, 2], [3...|
| 3| 3| 3| 1| [[,,,]]| []| []|
| 2| 2| 2| 2| [[,,,]]| []| []|
| 4| 4| 4| 2| [[,,,]]| []| []|
+--------+---+-----+-------+--------------------+---+--------------------+
Remove unnecessary columns:
result = no_null_children.drop("aux").drop("children").withColumnRenamed("no_null_children", "children")
Remove nested entries from the top level
nested_categories = spark.sql("""SELECT explode(children['category']) as category FROM removed_columns""")
nested_categories.createOrReplaceTempView("nested_categories")
nested_categories.show()
+--------+
|category|
+--------+
| 2|
| 3|
+--------+
result = spark.sql("SELECT * from removed_columns WHERE category NOT IN (SELECT category FROM nested_categories)")
result.show()
+--------+---+-----+-------+--------------------+
|category| id|value|item_id| children|
+--------+---+-----+-------+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...|
| 4| 4| 4| 2| []|
+--------+---+-----+-------+--------------------+
And the final JSON result looks as desired:
result.toJSON().collect()
['{"category":1,"id":1,"value":1,"item_id":1,"children":[{"id":2,"category":2,"value":2,"item_id":2},{"id":3,"category":3,"value":3,"item_id":1}]}',
'{"category":4,"id":4,"value":4,"item_id":2,"children":[]}']
Prettified:
{
"category":1,
"id":1,
"value":1,
"item_id":1,
"children":[
{
"id":2,
"category":2,
"value":2,
"item_id":2
},
{
"id":3,
"category":3,
"value":3,
"item_id":1
}
]
}
{
"category":4,
"id":4,
"value":4,
"item_id":2,
"children":[
]
}

Related

Spark dataframe groupby and order group?

I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given

Sum over another column returning 'col should be column error'

I'm trying to add a new column where by it shows the sum of a double (things to sum column) based on the respective ID in the ID column. this however is currently throwing the 'col should be column error'
df = df.withColumn('sum_column', (df.groupBy('id').agg({'thing_to_sum': 'sum'})))
Example Data Set:
| id | thing_to_sum | sum_column |
|----|--------------|------------
| 1 | 5 | 7 |
| 1 | 2 | 7 |
| 2 | 4 | 4 |
Any help on this would be greatly appreciated.
Also any reference on the most efficient way to do this would also be appreciated.
You can register any DataFrame as a temporary table to query it via SQLContext.sql.
myValues = [(1,5),(1,2),(2,4),(2,3),(2,1)]
df = sqlContext.createDataFrame(myValues,['id','thing_to_sum'])
df.show()
+---+------------+
| id|thing_to_sum|
+---+------------+
| 1| 5|
| 1| 2|
| 2| 4|
| 2| 3|
| 2| 1|
+---+------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, thing_to_sum, sum(thing_to_sum) over (partition by id) as sum_column from table_view'
)
df1.show()
+---+------------+----------+
| id|thing_to_sum|sum_column|
+---+------------+----------+
| 1| 5| 7|
| 1| 2| 7|
| 2| 4| 8|
| 2| 3| 8|
| 2| 1| 8|
+---+------------+----------+
Think i found the solution to my own question but advice would still be appreciated:
sum_calc = F.sum(df.thing_to_sum).over(Window.partitionBy("id"))
df = df.withColumn("sum_column", sum_calc)

PySpark Aggregation and Group By

I have seen multiple posts but the aggregation is done on multiple columns , but I want the aggregation based on col OPTION_CD, based on the following condition:
If have conditions attached to the dataframe query, which is giving me the error 'DataFrame' object has no attribute '_get_object_id'
IF NULL(STRING AGG(OPTION_CD,'' order by OPTION_CD),'').
What I can understand is that if OPTION_CD col is null then place a blank else append the OPTION_CD in one row separated by a blank.Following is the sample table :
first there is filteration to get only 1 and 2 from COl 1, then the result should be like this :
Following is the query that I am writing on my dataframe
df_result = df.filter((df.COL1 == 1)|(df.COL1 == 2)).select(df.COL1,df.COL2,(when(df.OPTION_CD == "NULL", " ").otherwise(df.groupBy(df.OPTION_CD))).agg(
collect_list(df.OPTION_CD)))
But not getting the desired results. Can anyone help in this? I am using pyspark.
You do not express your question clearly enough but I will make a try to answer it.
You need to understand that a dataframe column can have only one data type for all the rows. If you initial data are integers, then you can not check for string equality with the empty string but rather with Null value.
Also collect list returns an array of integers, so you cannot have [7 , 5] in one row and "'" in another row. In any way does this work for you?
from pyspark.sql.functions import col, collect_list
listOfTuples = [(1, 3, 1),(2, 3, 2),(1, 4, 5),(1, 4, 7),(5, 5, 8),(4, 1, 3),(2,4,None)]
df = spark.createDataFrame(listOfTuples , ["A", "B", "option"])
df.show()
>>>
+---+---+------+
| A| B|option|
+---+---+------+
| 1| 3| 1|
| 2| 3| 2|
| 1| 4| 5|
| 1| 4| 7|
| 5| 5| 8|
| 4| 1| 3|
| 2| 4| null|
+---+---+------+
dfFinal = df.filter((df.A == 1)|(df.A == 2)).groupby(['A','B']).agg(collect_list(df['option']))
dfFinal.show()
>>>
+---+---+--------------------+
| A| B|collect_list(option)|
+---+---+--------------------+
| 1| 3| [1]|
| 1| 4| [5, 7]|
| 2| 3| [2]|
| 2| 4| []|
+---+---+--------------------+

Skip the current row COUNT and sum up the other COUNTS for current key with Spark Dataframe

My input:
val df = sc.parallelize(Seq(
("0","car1", "success"),
("0","car1", "success"),
("0","car3", "success"),
("0","car2", "success"),
("1","car1", "success"),
("1","car2", "success"),
("0","car3", "success")
)).toDF("id", "item", "status")
My intermediary group by output looks like this:
val df2 = df.groupBy("id", "item").agg(count("item").alias("occurences"))
+---+----+----------+
| id|item|occurences|
+---+----+----------+
| 0|car3| 2|
| 0|car2| 1|
| 0|car1| 2|
| 1|car2| 1|
| 1|car1| 1|
+---+----+----------+
The output I would like is:
Calculating the sum of occurrences of item skipping the occurrences value of the current id's item.
For example in the output table below, car3 appeared for id "0" 2 times, car 2 appeared 1 time and car 1 appeared 2 times.
So for id "0", the sum of other occurrences for its "car3" item would be value of car2(1) + car1(2) = 3.
For the same id "0", the sum of other occurrences for its "car2" item would be value of car3(2) + car1(2) = 4.
This continues for the rest. Sample output
+---+----+----------+----------------------+
| id|item|occurences| other_occurences_sum |
+---+----+----------+----------------------+
| 0|car3| 2| 3 |<- (car2+car1) for id 0
| 0|car2| 1| 4 |<- (car3+car1) for id 0
| 0|car1| 2| 3 |<- (car3+car2) for id 0
| 1|car2| 1| 1 |<- (car1) for id 1
| 1|car1| 1| 1 |<- (car2) for id 1
+---+----+----------+----------------------+
That's perfect target for a window function.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum
val w = Window.partitionBy("id")
df2.withColumn(
"other_occurences_sum", sum($"occurences").over(w) - $"occurences"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+
where sum($"occurences").over(w) is a sum of all occurrences for the current id. Of course join is also valid:
df2.join(
df2.groupBy("id").agg(sum($"occurences") as "total"), Seq("id")
).select(
$"*", ($"total" - $"occurences") as "other_occurences_sum"
).show
// +---+----+----------+--------------------+
// | id|item|occurences|other_occurences_sum|
// +---+----+----------+--------------------+
// | 0|car3| 2| 3|
// | 0|car2| 1| 4|
// | 0|car1| 2| 3|
// | 1|car2| 1| 1|
// | 1|car1| 1| 1|
// +---+----+----------+--------------------+

Spark - How to apply rules defined in a dataframe to another dataframe

I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+