I have a dataframe with the following columns - User, Order, Food.
For example:
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
I would like to concatenate all of the foods into a single string sorted by order and grouped by per user
If I run the following:
df.groupBy("User").agg(concat_ws(" $ ",collect_list("Food")).alias("Food List"))
I get a single list but the foods are not concatenated in order.
User Food List
B Salad
C Bread
A Eggs $ Water $ Peaches
What is a good way to get the food list concatenated in order?
Try use window here:
Build the DataFrame
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import mean, pandas_udf, PandasUDFType
from pyspark.sql.types import *
df = spark.createDataFrame(pd.DataFrame([['A','B','A','C','A'],[1,1,2,1,3],['Eggs','Salad','Peaches','Bread','Water']],index=['User','Order','Food']).T)
df.show()
+----+-----+-------+
|User|Order| Food|
+----+-----+-------+
| A| 1| Eggs|
| B| 1| Salad|
| A| 2|Peaches|
| C| 1| Bread|
| A| 3| Water|
+----+-----+-------+
Create window and apply a udf to join the strings:
w = Window.partitionBy('User').orderBy('Order').rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
#pandas_udf(StringType(), PandasUDFType.GROUPED_AGG)
def _udf(v):
return ' $ '.join(v)
df = df.withColumn('Food List', _udf(df['Food']).over(w)).dropDuplicates(['User', 'Food List']).drop(*['Order', 'Food'])
df.show(truncate=False)
+----+----------------------+
|User|Food List |
+----+----------------------+
|B |Salad |
|C |Bread |
|A |Eggs $ Peaches $ Water|
+----+----------------------+
Based on the possible duplicate comment - collect_list by preserving order based on another variable, I was able to come up with a solution.
First define a sorter function. This takes a struct, sorts by order and then returns the list of items in a string format separated by ' $ '
# define udf
def sorter(l):
res = sorted(l, key=lambda x: x.Order)
return ' $ '.join([item[1] for item in res])
sort_udf = udf(sorter,StringType())
Then create the struct and run the sorter function:
SortedFoodList = (df.groupBy("User")
.agg(collect_list(struct("Order","Food")).alias("food_list"))
.withColumn("sorted_foods",sort_udf("food_list"))
.drop("food_list")
)
Related
Incrementally derive ID from a name column and on next load if there are new values added to that name column then assign need ID which is not already assigned to previous data
Example - first load:
Name
a
b
b
a
Result
ID
Name
1
a
2
b
2
b
1
a
Next load:
Name
a
b
b
a
c
d
c
Result:
ID
Name
1
a
2
b
2
b
1
a
3
c
4
d
3
c
As described in question looking for a solution in PySpark
You can create additional dataframe df_map where you store your IDs between loads. If you need to, you can save and restore this dataframe from the disk.
df1 = spark.createDataFrame(
data=[['a'], ['b'], ['b'], ['a']],
schema=["name"]
)
df2 = spark.createDataFrame(
data=[['a'], ['b'], ['b'], ['a'], ['c'], ['d'], ['c'], ['0']],
schema=["name"]
)
w = Window.orderBy('name')
# create empty map
df_map = spark.createDataFrame([], schema='name string, id int')
df_map.show()
# get additional name->id map for df1
n = df_map.select(F.count('id').alias('n')).collect()[0].n
df_map = df1.subtract(df_map.select('name')).withColumn('id', F.row_number().over(w) + F.lit(n)).union(df_map)
df_map.show()
# map can be saved to disk between runs
# get additional name->id map for df2
n = df_map.select(F.count('id').alias('n')).collect()[0].n
df_map = df2.subtract(df_map.select('name')).withColumn('id', F.row_number().over(w) + F.lit(n)).union(df_map)
df_map.show()
# join to get the final dataframe
df2.join(df_map, on='name').show()
You can use window and dense_rank. The code below will make dataframe sorted by 'name' column and give each unique name an incremental unique id.
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window as W
window = W.orderBy('name')
(
df
.withColumn('id', F.dense_rank().over(window))
).show()
+----+---+
|name| id|
+----+---+
| a| 1|
| a| 1|
| b| 2|
| b| 2|
| c| 3|
| c| 3|
| d| 4|
+----+---+
I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----
I am comparing a condition with pyspark join in my application by using substring function. This function is returning a column type instead of a value.
substring(trim(coalesce(df.col1)), 13, 3) returns
Column<b'substring(trim(coalesce(col1), 13, 3)'>
Tried with expr but still getting the same column type result
expr("substring(trim(coalesce(df.col1)),length(trim(coalesce(df.col1))) - 2, 3)")
I want to compare the values coming from substring to the value of another dataframe column. Both are of string type
pyspark:
substring(trim(coalesce(df.col1)), length(trim(coalesce(df.col1))) -2, 3) == df2["col2"]
lets say col1 = 'abcdefghijklmno'
The expected output of substring function should mno based on the above definition.
creating a sample dataframes to join
list1 = [('ABC','abcdefghijklmno'),('XYZ','abcdefghijklmno'),('DEF','abcdefghijklabc')]
df1=spark.createDataFrame(list1, ['col1', 'col2'])
list2 = [(1,'mno'),(2,'mno'),(3,'abc')]
df2=spark.createDataFrame(list2, ['col1', 'col2'])
import pyspark.sql.functions as f
creating a substring to read last n characters for three positions.
cond=f.substring(df1['col2'], -3, 3)==df2['col2']
newdf=df1.join(df2,cond)
>>> newdf.show()
+----+---------------+----+----+
|col1| col2|col1|col2|
+----+---------------+----+----+
| ABC|abcdefghijklmno| 1| mno|
| ABC|abcdefghijklmno| 2| mno|
| XYZ|abcdefghijklmno| 1| mno|
| XYZ|abcdefghijklmno| 2| mno|
| DEF|abcdefghijklabc| 3| abc|
+----+---------------+----+----+
Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+
I am very new to scala and spark.
I have read a text file into a dataframe, and successfully split the single column into columns (essentially the file is SPACE delimited csv)
val irisDF:DataFrame = spark.read.csv("src/test/resources/iris-in.txt")
irisDF.show()
val dfnew:DataFrame = irisDF.withColumn("_tmp", split($"_c0", " ")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3"),
$"_tmp".getItem(3).as("col4")
).drop("_tmp")
This works.
BUT what if I do not know how many columns there are in the datafile? How do I dynamically generate the columns depending on the number of items generated by the split function?
You can create a sequence of select expressions, and then apply all of them to select method with :_* syntax:
Example Data:
val df = Seq("a b c d", "e f g").toDF("c0")
df.show
+-------+
| c0|
+-------+
|a b c d|
| e f g|
+-------+
If you want five columns from the c0 column, which you need to determine before doing this:
val selectExprs = 0 until 5 map (i => $"temp".getItem(i).as(s"col$i"))
df.withColumn("temp", split($"c0", " ")).select(selectExprs:_*).show
+----+----+----+----+----+
|col0|col1|col2|col3|col4|
+----+----+----+----+----+
| a| b| c| d|null|
| e| f| g|null|null|
+----+----+----+----+----+