I have a dataframe like the following:
+-----------+-----------+---------------+------+---------------------+
|best_col |A |B | C |<many more columns> |
+-----------+-----------+---------------+------+---------------------+
| A | 14 | 26 | 32 | ... |
| C | 13 | 17 | 96 | ... |
| B | 23 | 19 | 42 | ... |
+-----------+-----------+---------------+------+---------------------+
I want to end up with a DataFrame like this:
+-----------+-----------+---------------+------+---------------------+----------+
|best_col |A |B | C |<many more columns> | result |
+-----------+-----------+---------------+------+---------------------+----------+
| A | 14 | 26 | 32 | ... | 14 |
| C | 13 | 17 | 96 | ... | 96 |
| B | 23 | 19 | 42 | ... | 19 |
+-----------+-----------+---------------+------+---------------------+----------+
Essentially, I want to add a column result that will choose the value from the column specified in the best_col column. best_col only contains column names that are present in the DataFrame. Since I have dozens of columns, I want to avoid using a bunch of when statements to check when col(best_col) === A etc. I tried doing col(col("best_col").toString()), but this didn't work. Is there an easy way to do this?
Using map_filter introduced in Spark 3.0:
val df = Seq(
("A", 14, 26, 32),
("C", 13, 17, 96),
("B", 23, 19, 42),
).toDF("best_col", "A", "B", "C")
df.withColumn("result", map(df.columns.tail.flatMap(c => Seq(col(c), lit(col("best_col") === lit(c)))): _*))
.withColumn("result", map_filter(col("result"), (a, b) => b))
.withColumn("result", map_keys(col("result"))(0))
.show()
+--------+---+---+---+------+
|best_col| A| B| C|result|
+--------+---+---+---+------+
| A| 14| 26| 32| 14|
| C| 13| 17| 96| 96|
| B| 23| 19| 42| 19|
+--------+---+---+---+------+
I have a dataset that has a group, ID and target column. I am attempting to eliminate null target values by the Group column, ignoring the ID column. I'd like to do this in PySpark.
| Group | ID | Target |
| ----- | --- | -------- |
| A | B | 10 |
| A | B | 10 |
| A | B | 10 |
| A | C | null |
| A | C | null |
| A | C | null |
| B | D | null |
| B | D | null |
| B | D | null |
This is the resulting dataset I'm looking for:
| Group | ID | Target |
| ----- | --- | -------- |
| A | B | 10 |
| A | B | 10 |
| A | B | 10 |
| B | D | null |
| B | D | null |
| B | D | null |
In other words, if the group has a target value already, I don't need the values in that group that have a null target, regardless of their ID. However, I need to make sure every group has a target that is not null, so if there is a group that has only null targets, they cannot be dropped.
You can compute max(target) per group and assign this to all rows in the of the group. Then filter rows such that a if maximum is null then select the row is if maximum is not null and target is also not null.
from pyspark.sql import functions as F
from pyspark.sql import Window
data = [("A", "B", 10),
("A", "B", 10),
("A", "B", 10),
("A", "C", None),
("A", "C", None),
("A", "C", None),
("B", "D", None),
("B", "D", None),
("B", "D", None),]
df = spark.createDataFrame(data, ("Group", "ID", "Target",))
window_spec = Window.partitionBy("Group")
df.withColumn("max_target", F.max("Target").over(window_spec))\
.filter((F.col("max_target").isNull()) |
(F.col("Target").isNotNull() & F.col("max_target").isNotNull()))\
.drop("max_target")\
.show()
Output
+-----+---+------+
|Group| ID|Target|
+-----+---+------+
| A| B| 10|
| A| B| 10|
| A| B| 10|
| B| D| null|
| B| D| null|
| B| D| null|
+-----+---+------+
In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?
If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+
Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()
I have the following a dataFrame on which I'm trying to update a cell depending on some conditions (like sql update where..)
for example, let's say I have the following data Frame :
+-------+-------+
|datas |isExist|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | O |
| AA | O |
+-------+-------+
How could I update the values to X when datas=AA and isExist is O, here is the expected output :
+-------+-------+
|IPCOPE2|IPROPE2|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | X |
| AA | X |
+-------+-------+
I could do a filter, then union, but I think its not the best solution, I could also use the when, but in this case I had create a new line containing the same values except for the isExist column, in that example is an acceptable solution, but what if I have 20 column !!
You can create new column using withColumn (either putting original or updated value) and then drop isExist column.
I am not sure why you do not want to use when for it seems to be exactly what you need. The withColumn method, when used with an existing column name will simply replace the column by the new value:
df.withColumn("isExist",
when('datas === "AA" && 'isExist === "O", "X").otherwise('isExist))
.show()
+-----+-------+
|datas|isExist|
+-----+-------+
| AA| x|
| BB| x|
| CC| O|
| CC| O|
| DD| O|
| AA| x|
| AA| x|
| AA| X|
| AA| X|
+-----+-------+
Then you can use withColumnRenamed to change the names of your columns. (e.g. df.withColumnRenamed("datas", "IPCOPE2"))
I have a dataframe that looks more or less like this:
| id | category | value | item_id |
|----|----------|-------|---------|
| 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 |
| 3 | 1 | 3 | 1 |
| 4 | 2 | 4 | 2 |
In this case, some of the categories must be considered sub-categories in some parts of the code (computation up to this point is similar regardless of the hierarchy, and thus they are all on the same table). However, now they must be nested according to a certain set of rules, defined in a separate dataframe:
| id | children |
|----|----------|
| 1 | [2,3] |
| 2 | null |
| 3 | null |
This nesting depends on the item column. That is, for each row, only those entries with the same item value which have a subcategory have to be nested. Which means that the categories 2 and 3 for item 1 have to be nested under the entry with id 1. If output to JSON, the result should look like this:
[{
"id": 1,
"item": 1,
"category": 1,
"value": 1,
"children": [{
"id": 2,
"item": 1,
"category": 2,
"value": 2
}, {
"id": 3,
"item": 1,
"category": 3,
"value": 3
}]
},
{
"id": 4,
"item": 2,
"category": 1,
"value": 4,
"children": []
}]
While fairly simple to implement using my own code, I would like to achieve this nesting using the PySpark Dataframe API. So far, here is what I have tried:
Join the two tables so that the list of children for a certain row is added as a column:
df_data = df_data.join(df_rules, df_data.category == df_rules.id, "left")
After some coalescing, here is the result:
| id | category | value | children |
|----|----------|-------|----------|
| 1 | 1 | 1 | [2,3] |
| 2 | 2 | 2 | [] |
| 3 | 1 | 3 | [] |
| 4 | 2 | 4 | [] |
Now, I would like to apply some sort of transformation so that I get something like this:
| id | category | value | item | children |
|----|----------|-------|------|-------------------------|
| 1 | 1 | 1 | 1 |[(2,2,2,1),(3,3,3,1)] |
| 2 | 2 | 2 | 1 |[] |
| 3 | 1 | 3 | 1 |[] |
| 4 | 2 | 4 | 1 |[] |
That is, rows with ids 2 and 3 are nested into row 1. The rest receive an empty list, as there are no matches. Afterwards the subcategories can be removed, but that is trivial to implement.
I am struggling a bit to implement this. My first thought was to use something like this:
spark.sql("SELECT *, ARRAY(SELECT * FROM my_table b WHERE b.item = a.item AND b.category IN a.children) FROM my_table a")
However, as soon as I add a SELECT statement to ARRAY it complains. I have also considered window functions or UDFs, but I am not sure how I could proceed with those, or if it is even possible.
I think found a way to do this. Here is an MCVE using Pandas to create data. It assumes an already initialized Spark session
import pandas as pd
from pyspark.sql import functions as F
columns = ['id', 'category', 'value', 'item_id']
data = [(1,1,1,1), (2,2,2,2), (3,3,3,1), (4,4,4,2)]
spark_data = spark.createDataFrame(pd.DataFrame.from_records(data=data, columns=columns))
rules_columns = ['category', 'children_rules']
rules_data = [(1, [2, 3]), (2, []), (3, [])]
spark_rules_data = spark.createDataFrame(pd.DataFrame.from_records(data=rules_data, columns=rules_columns))
First, perform a left join with the rules to apply:
joined_data = spark_data.join(spark_rules_data, on="category", how="left")
joined_data = joined_data.withColumn("children_rules", F.coalesce(F.col("children_rules"), F.array()))
joined_data.createOrReplaceTempView("joined_data")
joined_data.show()
+--------+---+-----+-------+--------------+
|category| id|value|item_id|children_rules|
+--------+---+-----+-------+--------------+
| 1| 1| 1| 1| [2, 3]|
| 3| 3| 3| 1| []|
| 2| 2| 2| 2| []|
| 4| 4| 4| 2| []|
+--------+---+-----+-------+--------------+
Join the table with itself according to the rules in the children column
nested_data = spark.sql("""SELECT joined_data_1.id as id, joined_data_1.category as category, joined_data_1.value as value, joined_data_1.item_id as item_id,
STRUCT(joined_data_2.id as id, joined_data_2.category as category, joined_data_2.value as value, joined_data_2.item_id as item_id) as children
FROM joined_data AS joined_data_1 LEFT JOIN joined_data AS joined_data_2
ON array_contains(joined_data_1.children_rules, joined_data_2.category)""")
nested_data.createOrReplaceTempView("nested_data")
nested_data.show()
+---+--------+-----+-------+------------+
| id|category|value|item_id| children|
+---+--------+-----+-------+------------+
| 1| 1| 1| 1|[3, 3, 3, 1]|
| 1| 1| 1| 1|[2, 2, 2, 2]|
| 3| 3| 3| 1| [,,,]|
| 2| 2| 2| 2| [,,,]|
| 4| 4| 4| 2| [,,,]|
+---+--------+-----+-------+------------+
Group by category value and aggregate the children column into a list
grouped_data = spark.sql("SELECT category, collect_set(children) as children FROM nested_data GROUP BY category")
grouped_data.createOrReplaceTempView("grouped_data")
grouped_data.show()
+--------+--------------------+
|category| children|
+--------+--------------------+
| 1|[[2, 2, 2, 2], [3...|
| 3| [[,,,]]|
| 2| [[,,,]]|
| 4| [[,,,]]|
+--------+--------------------+
Join the grouped table with the original one
original_with_children = spark_data.join(grouped_data, on="category")
original_with_children.createOrReplaceTempView("original_with_children")
original_with_children.show()
+--------+---+-----+-------+--------------------+
|category| id|value|item_id| children|
+--------+---+-----+-------+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...|
| 3| 3| 3| 1| [[,,,]]|
| 2| 2| 2| 2| [[,,,]]|
| 4| 4| 4| 2| [[,,,]]|
+--------+---+-----+-------+--------------------+
Here is the tricky bit. We need to remove the entries in children with NULL values. I tried to do a CASE statement with an empty array, casting to 'array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>> (this value comes from original_with_children.dtypes:
[('category', 'bigint'),
('id', 'bigint'),
('value', 'bigint'),
('item_id', 'bigint'),
('children',
'array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>')]
array_type = "array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>"
spark.sql(f"""SELECT *, CASE WHEN children[0]['category'] IS NULL THEN CAST(ARRAY() AS {array_type}) ELSE children END as no_null_children
FROM original_with_children""").show()
This raises the following exception (only relevant bit shown):
Py4JJavaError Traceback (most recent call last)
~/miniconda3/envs/sukiyaki_venv/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
~/miniconda3/envs/sukiyaki_venv/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling o28.sql.
: org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<string> to array<struct<id:bigint,category:bigint,value:bigint,item_id:bigint>>; line 1 pos 57;
...
I could not figure out a way to create an empty array of the right type (casting does not work, as there is no conversion between the default, array of string, and an array of structures). Instead, here is my approach:
The order of the fields changes on every call and that surprisingly causes a type mismatch, so it is necessar to query it every time
array_type = next(value for key, value in original_with_children.dtypes if key == 'children')
empty_array_udf = F.udf(lambda : [], array_type)
aux = original_with_children.withColumn("aux", empty_array_udf())
aux.createOrReplaceTempView("aux")
There must be a better way to create an empty column with this complicated type. The UDF causes an unnecessary overhead for such a simple thing.
no_null_children = spark.sql("""SELECT *, CASE WHEN children[0]['category'] IS NULL THEN aux ELSE children END as no_null_children
FROM aux""")
no_null_children.createOrReplaceTempView("no_null_children")
no_null_children.show()
+--------+---+-----+-------+--------------------+---+--------------------+
|category| id|value|item_id| children|aux| no_null_children|
+--------+---+-----+-------+--------------------+---+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...| []|[[2, 2, 2, 2], [3...|
| 3| 3| 3| 1| [[,,,]]| []| []|
| 2| 2| 2| 2| [[,,,]]| []| []|
| 4| 4| 4| 2| [[,,,]]| []| []|
+--------+---+-----+-------+--------------------+---+--------------------+
Remove unnecessary columns:
result = no_null_children.drop("aux").drop("children").withColumnRenamed("no_null_children", "children")
Remove nested entries from the top level
nested_categories = spark.sql("""SELECT explode(children['category']) as category FROM removed_columns""")
nested_categories.createOrReplaceTempView("nested_categories")
nested_categories.show()
+--------+
|category|
+--------+
| 2|
| 3|
+--------+
result = spark.sql("SELECT * from removed_columns WHERE category NOT IN (SELECT category FROM nested_categories)")
result.show()
+--------+---+-----+-------+--------------------+
|category| id|value|item_id| children|
+--------+---+-----+-------+--------------------+
| 1| 1| 1| 1|[[2, 2, 2, 2], [3...|
| 4| 4| 4| 2| []|
+--------+---+-----+-------+--------------------+
And the final JSON result looks as desired:
result.toJSON().collect()
['{"category":1,"id":1,"value":1,"item_id":1,"children":[{"id":2,"category":2,"value":2,"item_id":2},{"id":3,"category":3,"value":3,"item_id":1}]}',
'{"category":4,"id":4,"value":4,"item_id":2,"children":[]}']
Prettified:
{
"category":1,
"id":1,
"value":1,
"item_id":1,
"children":[
{
"id":2,
"category":2,
"value":2,
"item_id":2
},
{
"id":3,
"category":3,
"value":3,
"item_id":1
}
]
}
{
"category":4,
"id":4,
"value":4,
"item_id":2,
"children":[
]
}