Get last n items in pyspark - pyspark

For a dataset like -
+---+------+----------+
| id| item| timestamp|
+---+------+----------+
| 1| apple|2022-08-15|
| 1| peach|2022-08-15|
| 1| apple|2022-08-15|
| 1|banana|2022-08-14|
| 2| apple|2022-08-15|
| 2|banana|2022-08-14|
| 2|banana|2022-08-14|
| 2| water|2022-08-14|
| 3| water|2022-08-15|
| 3| water|2022-08-14|
+---+------+----------+
Can I use pyspark functions directly to get last three items the user purchased in the past 5 days? I know udf can do that, but I am wondering if any existing funtion can achieve this.
My expected output is like below or anything simliar is okay too.
id last_three_item
1 [apple, peach, apple]
2 [water, banana, apple]
3 [water, water]
Thanks!

You can use pandas_udf for this.
#f.pandas_udf(returnType=ArrayType(StringType()), functionType=f.PandasUDFType.GROUPED_AGG)
def pudf_get_top_3(x):
return x.head(3).to_list()
sdf\
.orderby("timestamp")\
.groupby("id")\
.agg(pudf_get_top_3("item")\
.alias("last_three_item))\
.show()

Related

pyspark: counting number of occurrences of each distinct values

I think the question is related to: Spark DataFrame: count distinct values of every column
So basically I have a spark dataframe, with column A has values of 1,1,2,2,1
So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like
distinct_values | number_of_apperance
1 | 3
2 | 2
I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:
from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]
df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()
+-----+-----+
|value|count|
+-----+-----+
| 1| 3|
| 2| 2|
+-----+-----+
I am not sure if you are looking for below solution:
Here are my thoughts on this. Suppose you have a dataframe like this.
>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])
>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
| 1| AAA| USA|
| 2| XXX| CHN|
| 3| KKK| USA|
| 4| PPP| USA|
| 5| EEE| USA|
| 5| HHH| THA|
+---+----+-------+
I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.
import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()
+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
| THA| 1|
| USA| 4|
| CHN| 1|
+----------------+-------------+
were you looking something similar to this?

How to convert an array of dataFrames into a single data frame? [duplicate]

This question already has answers here:
Spark unionAll multiple dataframes
(4 answers)
Closed 4 years ago.
I have an array of data frames called "dataFrames" and looks like this:
dataFrames(0)
+----------+--------------------+---------+-------------+
|Periodo | frutas|freq |prods_qty |
+----------+--------------------+---------+-------------+
| 1|Apple, Watermelon | 1| 2|
| 1|Banana, StrawBerry | 2| 2|
+----------+--------------------+---------+-------------+
dataFrames(1)
+----------+--------------------+---------+-------------+
|Periodo | frutas|freq |prods_qty |
+----------+--------------------+---------+-------------+
| 2|Naranjas, Fresas | 7| 2|
| 2|Pineapple, Apples | 9| 2|
+----------+--------------------+---------+-------------+
Well, I need to get a single dataframe like this:
+----------+--------------------+---------+-------------+
|Periodo | frutas|freq |prods_qty |
+----------+--------------------+---------+-------------+
| 1|Apple, Watermelon | 1| 2|
| 1|Banana, StrawBerry | 2| 2|
| 2|Naranjas, Fresas | 7| 2|
| 2|Pineapple, Apples | 9| 2|
+----------+--------------------+---------+-------------+
For this example the length of the array is 1, but the array could any size.
It is possible to achive this... or i need to store the dataframes into a hive table?
Thanks in advance
You can reduce a sequence or array of DataFrames together using unionAll:
val dfs = Array(df1, df2, df3)
val all = dfs.reduce(_ unionAll _)

Group rows that match sub string in a column using scala

I have a fol df:
Zip | Name | id |
abc | xyz | 1 |
def | wxz | 2 |
abc | wex | 3 |
bcl | rea | 4 |
abc | txc | 5 |
def | rfx | 6 |
abc | abc | 7 |
I need to group all the names that contain 'x' based on same Zip using scala
Desired Output:
Zip | Count |
abc | 3 |
def | 2 |
Any help is highly appreciated
As #Shaido mentioned in the comment above, all you need is filter, groupBy and aggregation as
import org.apache.spark.sql.functions._
fol.filter(col("Name").contains("x")) //filtering the rows that has x in the Name column
.groupBy("Zip") //grouping by Zip column
.agg(count("Zip").as("Count")) //counting the rows in each groups
.show(false)
and you should have the desired output
+---+-----+
|Zip|Count|
+---+-----+
|abc|3 |
|def|2 |
+---+-----+
You want to groupBy bellow data frame.
+---+----+---+
|zip|name| id|
+---+----+---+
|abc| xyz| 1|
|def| wxz| 2|
|abc| wex| 3|
|bcl| rea| 4|
|abc| txc| 5|
|def| rfx| 6|
|abc| abc| 7|
+---+----+---+
then you can simply use groupBy function with passing column parameter and followed by count will give you the result.
val groupedDf: DataFrame = df.groupBy("zip").count()
groupedDf.show()
// +---+-----+
// |zip|count|
// +---+-----+
// |bcl| 1|
// |abc| 4|
// |def| 2|
// +---+-----+

Merging and aggregating dataframes using Spark Scala

I have a dataset, after transformation using Spark Scala (1.6.2). I got the following two dataframes.
DF1:
|date | country | count|
| 1872| Scotland| 1|
| 1873| England | 1|
| 1873| Scotland| 1|
| 1875| England | 1|
| 1875| Scotland| 2|
DF2:
| date| country | count|
| 1872| England | 1|
| 1873| Scotland| 1|
| 1874| England | 1|
| 1875| Scotland| 1|
| 1875| Wales | 1|
Now from above two dataframes, I want to get aggregate by date per country. Like following output. I tried using union and by joining but not able to get desired results.
Expected output from the two dataframes above:
| date| country | count|
| 1872| England | 1|
| 1872| Scotland| 1|
| 1873| Scotland| 2|
| 1873| England | 1|
| 1874| England | 1|
| 1875| Scotland| 3|
| 1875| Wales | 1|
| 1875| England | 1|
Kindly help me get solution.
The best way is to perform an union and then an groupBy by the two columns, then with the sum, you can specify which column to add up:
df1.unionAll(df2)
.groupBy("date", "country")
.sum("count")
Output:
+----+--------+----------+
|date| country|sum(count)|
+----+--------+----------+
|1872|Scotland| 1|
|1875| England| 1|
|1873| England| 1|
|1875| Wales| 1|
|1872| England| 1|
|1874| England| 1|
|1873|Scotland| 2|
|1875|Scotland| 3|
+----+--------+----------+
Using the DataFrame API, you can use a unionAll followed by a groupBy to achive this.
DF1.unionAll(DF2)
.groupBy("date", "country")
.agg(sum($"count").as("count"))
This will first put all rows from the two dataframes into a single dataframe. Then, then by grouping on the date and country columns it's possible to get the aggregate sum of the count column by date per country as asked. The as("count") part renames the aggregated column to count.
Note: In newer Spark versions (read version 2.0+), unionAll is deprecated and is replaced by union.

How to append column values in Spark SQL?

I have the below table:
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+
What I am trying to achieve is to append the genre values together where the id and name are the same. Like this:
+-------+---------+---------------------------+
|movieId|movieName| genre |
+-------+---------+---------------------------+
| 1| example1| action|thriller|romance |
| 2| example2| action|fantastic |
+-------+---------+---------------------------+
Use groupBy and collect_list to get a list of all items with the same movie name. Then combine these to a string using concat_ws (if the order is important, first use sort_array). Small example with given sample dataframe:
val df2 = df.groupBy("movieId", "movieName")
.agg(collect_list($"genre").as("genre"))
.withColumn("genre", concat_ws("|", sort_array($"genre")))
Gives the result:
+-------+---------+-----------------------+
|movieId|movieName|genre |
+-------+---------+-----------------------+
|1 |example1 |action|thriller|romance|
|2 |example2 |action|fantastic |
+-------+---------+-----------------------+