I try to group by values of metric which can be I or M and make a summe based on the values of x. The result should be stored in each row of its respective value. Normally I make it in R with group and then ungroup but I dont know the equivalent in PysSpark. Any advice?
from pyspark.sql.functions import *
from pyspark.sql.functions import last
from pyspark.sql.functions import arrays_zip
from pyspark.sql.types import *
data = [["1", "Amit", "DU", "I", "8", "6"],
["2", "Mohit", "DU", "I", "4", "2"],
["3", "rohith", "BHU", "I", "5", "3"],
["4", "sridevi", "LPU", "I", "1", "6"],
["1", "sravan", "KLMP", "M", "2", "4"],
["5", "gnanesh", "IIT", "M", "6", "8"],
["6", "gnadesh", "KLM", "M","0", "9"]]
columns = ['ID', 'NAME', 'college', 'metric', 'x', 'y']
dataframe = spark.createDataFrame(data, columns)
dataframe = dataframe.withColumn("x",dataframe.x.cast(DoubleType()))
This is how the data looks like
+---+-------+-------+------+----+---+
| ID| NAME|college|metric| x| y|
+---+-------+-------+------+----+---+
| 1| Amit| DU| I| 8| 6|
| 2| Mohit| DU| I| 4| 2|
| 3| rohith| BHU| I| 5| 3|
| 4|sridevi| LPU| I| 1| 6|
| 1| sravan| KLMP| M| 2| 4|
| 5|gnanesh| IIT| M| 6| 8|
| 6|gnadesh| KLM| M|0 | 9|
+---+-------+-------+------+----+---+
Expected output
+---+-------+-------+------+----+---+------+
| ID| NAME|college|metric| x| y| total|
+---+-------+-------+------+----+---+------+
| 1| Amit| DU| I| 8| 6| 18 |
| 2| Mohit| DU| I| 4| 2| 18 |
| 3| rohith| BHU| I| 5| 3| 18 |
| 4|sridevi| LPU| I| 1| 6| 18 |
| 1| sravan| KLMP| M| 2| 4| 8 |
| 5|gnanesh| IIT| M| 6| 8| 8 |
| 6|gnadesh| KLM| M| 0| 9| 8 |
+---+-------+-------+------+----+---+------+
I tried this but it does not work
dataframe.withColumn("total",dataframe.groupBy("metric").sum("x"))
You can do groupby on data and calculate the total value and then join the grouped dataframe with original data
metric_sum_df = dataframe.groupby('metric').agg(F.sum('x').alias('total'))
total_df = dataframe.join(metric_sum_df, 'metric')
I have the following data,
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
this could be generated by code:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
How can I get the result:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user_id,time and order by time and rank the group, thanks~
To rank the rows you can use dense_rank window function and the order can be achieved by final orderBy transformation:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
Note that the order in the item column is not given
I have a dataframe that looks more or less like this:
| id | category | value | item_id |
|----|----------|-------|---------|
| 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 |
| 3 | 1 | 3 | 1 |
| 4 | 2 | 4 | 2 |
In this case, some of the categories must be considered sub-categories in some parts of the code (computation up to this point is similar regardless of the hierarchy, and thus they are all on the same table). However, now they must be nested according to a certain set of rules, defined in a separate dataframe:
| id | children |
|----|----------|
| 1 | [2,3] |
| 2 | null |
| 3 | null |
This nesting depends on the item column. That is, for each row, only those entries with the same item value which have a subcategory have to be nested. Which means that the categories 2 and 3 for item 1 have to be nested under the entry with id 1. If output to JSON, the result should look like this:
[{
"id": 1,
"item": 1,
"category": 1,
"value": 1,
"children": [{
"id": 2,
"item": 1,
"category": 2,
"value": 2
}, {
"id": 3,
"item": 1,
"category": 3,
"value": 3
}]
},
{
"id": 4,
"item": 2,
"category": 1,
"value": 4,
"children": []
}]
While fairly simple to implement using my own code, I would like to achieve this nesting using the PySpark Dataframe API. So far, here is what I have tried:
Join the two tables so that the list of children for a certain row is added as a column:
df_data = df_data.join(df_rules, df_data.category == df_rules.id, "left")
After some coalescing, here is the result:
| id | category | value | children |
|----|----------|-------|----------|
| 1 | 1 | 1 | [2,3] |
| 2 | 2 | 2 | [] |
| 3 | 1 | 3 | [] |
| 4 | 2 | 4 | [] |
Now, I would like to apply some sort of transformation so that I get something like this:
| id | category | value | item | children |
|----|----------|-------|------|-------------------------|
| 1 | 1 | 1 | 1 |[(2,2,2,1),(3,3,3,1)] |
| 2 | 2 | 2 | 1 |[] |
| 3 | 1 | 3 | 1 |[] |
| 4 | 2 | 4 | 1 |[] |
That is, rows with ids 2 and 3 are nested into row 1. The rest receive an empty list, as there are no matches. Afterwards the subcategories can be removed, but that is trivial to implement.
I am struggling a bit to implement this. My first thought was to use something like this:
spark.sql("SELECT *, ARRAY(SELECT * FROM my_table b WHERE b.item = a.item AND b.category IN a.children) FROM my_table a")
However, as soon as I add a SELECT statement to ARRAY it complains. I have also considered window functions or UDFs, but I am not sure how I could proceed with those, or if it is even possible.
I think found a way to do this. Here is an MCVE using Pandas to create data. It assumes an already initialized Spark session
import pandas as pd
from pyspark.sql import functions as F
columns = ['id', 'category', 'value', 'item_id']
data = [(1,1,1,1), (2,2,2,2), (3,3,3,1), (4,4,4,2)]
spark_data = spark.createDataFrame(pd.DataFrame.from_records(data=data, columns=columns))
rules_columns = ['category', 'children_rules']
rules_data = [(1, [2, 3]), (2, []), (3, [])]
spark_rules_data = spark.createDataFrame(pd.DataFrame.from_records(data=rules_data, columns=rules_columns))
First, perform a left join with the rules to apply:
joined_data = spark_data.join(spark_rules_data, on="category", how="left")
joined_data = joined_data.withColumn("children_rules", F.coalesce(F.col("children_rules"), F.array()))
joined_data.createOrReplaceTempView("joined_data")
joined_data.show()
+--------+---+-----+-------+--------------+
|category| id|value|item_id|children_rules|
+--------+---+-----+-------+--------------+
| 1| 1| 1| 1| [2, 3]|
| 3| 3| 3| 1| []|
| 2| 2| 2| 2| []|
| 4| 4| 4| 2| []|
+--------+---+-----+-------+--------------+
Join the table with itself according to the rules in the children column
nested_data = spark.sql("""SELECT joined_data_1.id as id, joined_data_1.category as category, joined_data_1.value as value, joined_data_1.item_id as item_id,
STRUCT(joined_data_2.id as id, joined_data_2.category as category, joined_data_2.value as value, joined_data_2.item_id as item_id) as children
FROM joined_data AS joined_data_1 LEFT JOIN joined_data AS joined_data_2
ON array_contains(joined_data_1.children_rules, joined_data_2.category)""")
nested_data.createOrReplaceTempView("nested_data")
nested_data.show()
+---+--------+-----+-------+------------+
| id|category|value|item_id| children|
+---+--------+-----+-------+------------+
| 1| 1| 1| 1|[3, 3, 3, 1]|
| 1| 1| 1| 1|[2, 2, 2, 2]|
| 3| 3| 3| 1| [,,,]|
| 2| 2| 2| 2| [,,,]|
| 4| 4| 4| 2| [,,,]|
+---+--------+-----+-------+------------+
Group by category value and aggregate the children column into a list
grouped_data = spark.sql("SELECT category, collect_set(children) as children FROM nested_data GROUP BY category")
grouped_data.createOrReplaceTempView("grouped_data")
grouped_data.show()
+--------+--------------------+
|category| children|
+--------+--------------------+
| 1|[[2, 2, 2, 2], [3...|
| 3| [[,,,]]|
| 2| [[,,,]]|
| 4| [[,,,]]|
+--------+--------------------+
Join the grouped table with the original one
original_with_children = spark_data.join(grouped_data, on="category")
original_with_children.createOrReplaceTempView("original_with_children")
original_with_children.show()
+--------+---+-----+-------+--------------------+
|category| id|value|item_id| children|
+--------+---+-----+-------+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...|
| 3| 3| 3| 1| [[,,,]]|
| 2| 2| 2| 2| [[,,,]]|
| 4| 4| 4| 2| [[,,,]]|
+--------+---+-----+-------+--------------------+
Here is the tricky bit. We need to remove the entries in children with NULL values. I tried to do a CASE statement with an empty array, casting to 'array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>> (this value comes from original_with_children.dtypes:
[('category', 'bigint'),
('id', 'bigint'),
('value', 'bigint'),
('item_id', 'bigint'),
('children',
'array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>')]
array_type = "array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>"
spark.sql(f"""SELECT *, CASE WHEN children[0]['category'] IS NULL THEN CAST(ARRAY() AS {array_type}) ELSE children END as no_null_children
FROM original_with_children""").show()
This raises the following exception (only relevant bit shown):
Py4JJavaError Traceback (most recent call last)
~/miniconda3/envs/sukiyaki_venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/miniconda3/envs/sukiyaki_venv/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling o28.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<string> to array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>; line 1 pos 57;
...
I could not figure out a way to create an empty array of the right type (casting does not work, as there is no conversion between the default, array of string, and an array of structures). Instead, here is my approach:
The order of the fields changes on every call and that surprisingly causes a type mismatch, so it is necessar to query it every time
array_type = next(value for key, value in original_with_children.dtypes if key == 'children')
empty_array_udf = F.udf(lambda : [], array_type)
aux = original_with_children.withColumn("aux", empty_array_udf())
aux.createOrReplaceTempView("aux")
There must be a better way to create an empty column with this complicated type. The UDF causes an unnecessary overhead for such a simple thing.
no_null_children = spark.sql("""SELECT *, CASE WHEN children[0]['category'] IS NULL THEN aux ELSE children END as no_null_children
FROM aux""")
no_null_children.createOrReplaceTempView("no_null_children")
no_null_children.show()
+--------+---+-----+-------+--------------------+---+--------------------+
|category| id|value|item_id| children|aux| no_null_children|
+--------+---+-----+-------+--------------------+---+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...| []|[[2, 2, 2, 2], [3...|
| 3| 3| 3| 1| [[,,,]]| []| []|
| 2| 2| 2| 2| [[,,,]]| []| []|
| 4| 4| 4| 2| [[,,,]]| []| []|
+--------+---+-----+-------+--------------------+---+--------------------+
Remove unnecessary columns:
result = no_null_children.drop("aux").drop("children").withColumnRenamed("no_null_children", "children")
Remove nested entries from the top level
nested_categories = spark.sql("""SELECT explode(children['category']) as category FROM removed_columns""")
nested_categories.createOrReplaceTempView("nested_categories")
nested_categories.show()
+--------+
|category|
+--------+
| 2|
| 3|
+--------+
result = spark.sql("SELECT * from removed_columns WHERE category NOT IN (SELECT category FROM nested_categories)")
result.show()
+--------+---+-----+-------+--------------------+
|category| id|value|item_id| children|
+--------+---+-----+-------+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...|
| 4| 4| 4| 2| []|
+--------+---+-----+-------+--------------------+
And the final JSON result looks as desired:
result.toJSON().collect()
['{"category":1,"id":1,"value":1,"item_id":1,"children":[{"id":2,"category":2,"value":2,"item_id":2},{"id":3,"category":3,"value":3,"item_id":1}]}',
'{"category":4,"id":4,"value":4,"item_id":2,"children":[]}']
Prettified:
{
"category":1,
"id":1,
"value":1,
"item_id":1,
"children":[
{
"id":2,
"category":2,
"value":2,
"item_id":2
},
{
"id":3,
"category":3,
"value":3,
"item_id":1
}
]
}
{
"category":4,
"id":4,
"value":4,
"item_id":2,
"children":[
]
}
I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+