combine text from multiple rows in pyspark - pyspark

I created a PySpark dataframe using the following code
testlist = [
{"category":"A","name":"A1"},
{"category":"A","name":"A2"},
{"category":"B","name":"B1"},
{"category":"B","name":"B2"}
]
spark_df = spark.createDataFrame(testlist)
Result:
category name
A A1
A A2
B B1
B B2
I want to make it appear as follows:
category name
A A1, A2
B B1, B2
I tried the following code which does not work
spark_df.groupby('category').agg('name', lambda x:x + ', ')
Can anyone help identify what I am doing wrong and the best way to make this happen ?

One option is to use pyspark.sql.functions.collect_list() as the aggregate function.
from pyspark.sql.functions import collect_list
grouped_df = spark_df.groupby('category').agg(collect_list('name').alias("name"))
This will collect the values for name into a list and the resultant output will look like:
grouped_df.show()
#+---------+---------+
#|category |name |
#+---------+---------+
#|A |[A1, A2] |
#|B |[B1, B2] |
#+---------+---------+
Update 2019-06-10:
If you wanted your output as a concatenated string, you can use pyspark.sql.functions.concat_ws to concatenate the values of the collected list, which will be better than using a udf:
from pyspark.sql.functions import concat_ws
grouped_df.withColumn("name", concat_ws(", ", "name")).show()
#+---------+-------+
#|category |name |
#+---------+-------+
#|A |A1, A2 |
#|B |B1, B2 |
#+---------+-------+
Original Answer: If you wanted your output as a concatenated string, you'd have to can use a udf. For example, you can first do the groupBy() as above and the apply a udf to join the collected list:
from pyspark.sql.functions import udf
concat_list = udf(lambda lst: ", ".join(lst), StringType())
grouped_df.withColumn("name", concat_list("name")).show()
#+---------+-------+
#|category |name |
#+---------+-------+
#|A |A1, A2 |
#|B |B1, B2 |
#+---------+-------+

UNIQUE values
If you want unique values then use collect_set instead of collect_list
from pyspark.sql.functions import collect_set
grouped_df = sdf.groupby('category').agg(collect_set('name').alias("unique_name"))
sdf.show(5)

Another option is this
>>> df.rdd.reduceByKey(lambda x,y: x+','+y).toDF().show()
+---+-----+
| _1| _2|
+---+-----+
| A|A1,A2|
| B|B1,B2|
+---+-----+

Related

pyspark, get rows where first column value equals id and second column value is between two values, do this for each row in a dataframe

So I have one pyspark dataframe like so, let's call it dataframe a:
+-------------------+---------------+----------------+
| reg| val1| val2 |
+-------------------+---------------+----------------+
| N110WA| 1590030660| 1590038340000|
| N876LF| 1590037200| 1590038880000|
| N135MH| 1590039060| 1590040080000|
And another like this, let's call it dataframe b:
+-----+-------------+-----+-----+---------+----------+---+----+
| reg| postime| alt| galt| lat| long|spd| vsi|
+-----+-------------+-----+-----+---------+----------+---+----+
|XY679|1590070078549| 50| 130|18.567169|-69.986343|132|1152|
|HI949|1590070091707| 375| 455| 18.5594|-69.987804|148|1344|
|JX784|1590070110666| 825| 905|18.544968|-69.990414|170|1216|
Is there some way to create a numpy array or pyspark dataframe, where for each row in dataframe a, all the rows in dataframe b with the same reg and postime between val 1 and val 2, are included?
You can try the below solution -- and let us know if works or anything else is expected ?
I have modified the imputes a little in order to showcase the working solution--
Input here
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',1590030660,1590038340000), ('N110WA',1590070078549,1590070078559)],[ "reg","val1","val2"])
df_b = spark.createDataFrame([('N110WA',1590070078549)],[ "reg","postime"])
df_a.show()
df_a
+------+-------------+-------------+
| reg| val1| val2|
+------+-------------+-------------+
|N110WA| 1590030660|1590038340000|
|N110WA|1590070078549|1590070078559|
+------+-------------+-------------+
df_b
+------+-------------+
| reg| postime|
+------+-------------+
|N110WA|1590070078549|
+------+-------------+
Solution here
from pyspark.sql import types as T
from pyspark.sql import functions as F
#df_a = df_a.join(df_b,'reg','left')
df_a = df_a.withColumn('condition_col', F.when(((F.col('postime') >= F.col('val1')) & (F.col('postime') <= F.col('val2'))),'1').otherwise('0'))
df_a = df_a.filter(F.col('condition_col') == 1).drop('condition_col')
df_a.show()
Final Output
+------+-------------+-------------+-------------+
| reg| val1| val2| postime|
+------+-------------+-------------+-------------+
|N110WA|1590070078549|1590070078559|1590070078549|
+------+-------------+-------------+-------------+
Yes, assuming df_a and df_b are both pyspark dataframes, you can use an inner join in pyspark:
delta = val
df = df_a.join(df_b, [
df_a.res == df_b.res,
df_a.posttime <= df_b.val1 + delta,
df_a.posttime >= df_b.val2 - delta
], "inner")
Will filter out the results to only include the ones specified

Merge repeated values into a field in a Dataframe with scala and spark

I have a DF like following :
ID hier
1 Z1
1 Z2
2 Z1
2 Z2
and the output required is a DF like the next :
ID hier
1 Z1,Z2
2 Z1,Z2
Im know about the Fold and reduce but I dont have very clear how use that for this case .
Fold and Reduce are functional methods. Working with DataFrames provides a relational algebra to express your transformations. You should consider using the collect_list built-in function for your question :
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,"Z1"),
(1,"Z2"),
(2,"Z1"),
(2,"Z2")
).toDF("ID", "hier")
df.groupBy($"ID").agg(collect_list($"hier").as("hier"))
.show(false)
+---+------------------+
|ID |hier |
+---+------------------+
|2 |[Z1, Z2] |
|1 |[Z1, Z2] |
+---+------------------+
And if you want a String, you can use this transformation instead :
df.groupBy($"ID").agg(concat_ws(",",collect_list($"hier")).as("hier"))
+---+-----+
|ID |hier |
+---+-----+
|2 |Z1,Z2|
|1 |Z1,Z2|
+---+-----+

Pyspark - groupby concat string columns by order

I have a dataframe with the following columns - User, Order, Food.
For example:
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
I would like to concatenate all of the foods into a single string sorted by order and grouped by per user
If I run the following:
df.groupBy("User").agg(concat_ws(" $ ",collect_list("Food")).alias("Food List"))
I get a single list but the foods are not concatenated in order.
User Food List
B Salad
C Bread
A Eggs $ Water $ Peaches
What is a good way to get the food list concatenated in order?
Try use window here:
Build the DataFrame
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import mean, pandas_udf, PandasUDFType
from pyspark.sql.types import *
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
df.show()
+----+-----+-------+
|User|Order| Food|
+----+-----+-------+
| A| 1| Eggs|
| B| 1| Salad|
| A| 2|Peaches|
| C| 1| Bread|
| A| 3| Water|
+----+-----+-------+
Create window and apply a udf to join the strings:
w = Window.partitionBy('User').orderBy('Order').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
#pandas_udf(StringType(), PandasUDFType.GROUPED_AGG)
def _udf(v):
return ' $ '.join(v)
df = df.withColumn('Food List', _udf(df['Food']).over(w)).dropDuplicates(['User', 'Food List']).drop(*['Order', 'Food'])
df.show(truncate=False)
+----+----------------------+
|User|Food List |
+----+----------------------+
|B |Salad |
|C |Bread |
|A |Eggs $ Peaches $ Water|
+----+----------------------+
Based on the possible duplicate comment - collect_list by preserving order based on another variable, I was able to come up with a solution.
First define a sorter function. This takes a struct, sorts by order and then returns the list of items in a string format separated by ' $ '
# define udf
def sorter(l):
res = sorted(l, key=lambda x: x.Order)
return ' $ '.join([item[1] for item in res])
sort_udf = udf(sorter,StringType())
Then create the struct and run the sorter function:
SortedFoodList = (df.groupBy("User")
.agg(collect_list(struct("Order","Food")).alias("food_list"))
.withColumn("sorted_foods",sort_udf("food_list"))
.drop("food_list")
)

spark scala cartesian product of each element in a column

I have a dataframe which is like :
df:
col1 col2
a [p1,p2,p3]
b [p1,p4]
Desired output is that:
df_out:
col1 col2 col3
p1 p2 a
p1 p3 a
p2 p3 a
p1 p4 b
I did some research and i think that converting df to rdd and then flatMap with cartesian product are ideal for the problem. However i could not combine them together.
Thanks,
It looks like you are trying to do combination rather than cartesian. Please check my understanding.
This is in PySpark but the only python thing is the UDF, the rest is just DataFrame operations.
process is
Create dataframe
define UDF to get all pairs of combinations ignoring order
use UDF to convert array into array of pairs of structs, one for each element of the combination
explode the results to get rows of pair of structs
select each struct and original column 1 into desired result columns
from itertools import combinations
from pyspark.sql import functions as F
df = spark.createDataFrame([
("a", ["p1", "p2", "p3"]),
("b", ["p1", "p4"])
],
["col1", "col2"]
)
# define and register udf that takes an array and returns an array of struct of two strings
#udf("array<struct<_1: string, _2: string>>")
def combinations_list(x):
return combinations(x, 2)
resultDf = df.select("col1", F.explode(combinations_list(df.col2)).alias("combos"))
resultDf.selectExpr("combos._1 as col1", "combos._2 as col2", "col1 as col3").show()
Result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| p1| p2| a|
| p1| p3| a|
| p2| p3| a|
| p1| p4| b|
+----+----+----+

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+