How to Limit and Partition data in PySpqrk Dataframe - pyspark

I have below data
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|restaurant_id| restaurant_name| city|state|postal_code| stars|review_count|cuisine_name|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
| 62112| Neptune Oyster| Boston| MA| 02113|4.500000000000000000| 5115| American|
| 62112| Neptune Oyster| Boston| MA| 02113|4.500000000000000000| 5115| Thai|
| 60154|Giacomo's Ristora...| Boston| MA| 02113|4.000000000000000000| 3520| Italian|
| 61455|Atlantic Fish Com...| Boston| MA| 02116|4.000000000000000000| 2575| American|
| 57757| Top of the Hub| Boston| MA| 02199|3.500000000000000000| 2273| American|
| 58631| Carmelina's| Boston| MA| 02113|4.500000000000000000| 2250| Italian|
| 58895| The Beehive| Boston| MA| 02116|3.500000000000000000| 2184| American|
| 56517|Lolita Cocina & T...| Boston| MA| 02116|4.000000000000000000| 2179| American|
| 56517|Lolita Cocina & T...| Boston| MA| 02116|4.000000000000000000| 2179| Mexican|
| 58440| Toro| Boston| MA| 02118|4.000000000000000000| 2175| Spanish|
| 58615| Regina Pizzeria| Boston| MA| 02113|4.000000000000000000| 2071| Italian|
| 58723| Gaslight| Boston| MA| 02118|4.000000000000000000| 2056| American|
| 58723| Gaslight| Boston| MA| 02118|4.000000000000000000| 2056| French|
| 60920| Modern Pastry Shop| Boston| MA| 02113|4.000000000000000000| 2042| Italian|
| 59453|Gourmet Dumpling ...| Boston| MA| 02111|3.500000000000000000| 1990| Taiwanese|
| 59453|Gourmet Dumpling ...| Boston| MA| 02111|3.500000000000000000| 1990| Chinese|
| 59204|Russell House Tavern|Cambridge| MA| 02138|4.000000000000000000| 1965| American|
| 60732|Eastern Standard ...| Boston| MA| 02215|4.000000000000000000| 1890| American|
| 60732|Eastern Standard ...| Boston| MA| 02215|4.000000000000000000| 1890| French|
| 56970| Border Café|Cambridge| MA| 02138|4.000000000000000000| 1880| Mexican|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
I want to partition data based of City,State and Cuisine and order by stars and review count and finally limit the records per partition.
Can this be done with pyspark.

You can add row_number to the partitions after windowing and filter based on this to limit records per window. You can control the maximum number of rows per window using max_number_of_rows_per_partition variable in the code below.
Since your question did not include the way you want stars and review_count ordered, I have assumed them to be descending.
import pyspark.sql.functions as F
from pyspark.sql import Window
window_spec = Window.partitionBy("city", "state", "cuisine_name")\
.orderBy(F.col("stars").desc(), F.col("review_count").desc())
max_number_of_rows_per_partition = 3
df.withColumn("row_number", F.row_number().over(window_spec))\
.filter(F.col("row_number") <= max_number_of_rows_per_partition)\
.drop("row_number")\
.show(200, False)

Related

pyspark dataframe retrieve the first value in each sequence within an ordered column

Edited with a new example for clarify
The following data
+------+---------+-----------+
| ID| location| loggedTime|
+------+---------+-----------+
| 67| 312| 12:09:00|
| 67| 375| 12:23:00|
| 67| 375| 12:25:00|
| 67| 650| 12:26:00|
| 75| 650| 12:27:00|
| 75| 650| 12:29:00|
| 75| 800| 12:30:00|
+------+---------+-----------+
should yield the below, where we compare each row to the previous column 'ID' and 'location'. I need to log each time ID was logged at a different location. They can visit the same location again later in the sequence, therefore dropDupicates on ID and location isn't possible
+------+---------+-----------+
| ID| location| loggedTime|
+------+---------+-----------+
| 67| 312| 12:09:00|
| 67| 375| 12:23:00|
| 67| 650| 12:26:00|
| 75| 650| 12:27:00|
| 75| 800| 12:30:00|
+------+---------+-----------+
Using a Window ordered by loggedTime can be used to get the location from the previous row. Then the rows where the current and the previous location are the same can be filtered out:
from pyspark.sql import functions as F
from pyspark.sql import Window
w=Window.partitionBy("ID").orderBy("loggedTime")
df.withColumn("prev_location", F.lag("location").over(w)) \
.filter("prev_location is null or location <> prev_location") \
.drop("prev_location") \
.show()
Output:
+---+--------+-------------------+
| ID|location| loggedTime|
+---+--------+-------------------+
| 67| 312|1970-01-01 00:09:00|
| 67| 375|1970-01-01 00:23:00|
| 67| 650|1970-01-01 00:26:00|
| 75| 650|1970-01-01 00:27:00|
| 75| 800|1970-01-01 00:30:00|
+---+--------+-------------------+
how about useing group by?
df = df.groupBy(col("id"), col("location")).agg(min(col("loggedTime")))

Sliding window over a period of weeks in Spark

I have a dataset:
+---------------+-----------+---------+--------+
| Country | Timezone |Year_Week|MinUsers|
+---------------+-----------+---------+--------+
|Germany |1.0 |2019-01 |4322 |
|Germany |1.0 |2019-02 |4634 |
|Germany |1.0 |2019-03 |5073 |
|Germany |1.0 |2019-04 |4757 |
|Germany |1.0 |2019-05 |5831 |
|Germany |1.0 |2019-06 |5026 |
|Germany |1.0 |2019-07 |5038 |
|Germany |1.0 |2019-08 |5005 |
|Germany |1.0 |2019-09 |5766 |
|Germany |1.0 |2019-10 |5204 |
|Germany |1.0 |2019-11 |5240 |
|Germany |1.0 |2019-12 |5306 |
|Germany |1.0 |2019-13 |5381 |
|Germany |1.0 |2019-14 |5659 |
|Germany |1.0 |2019-15 |5518 |
|Germany |1.0 |2019-16 |6666 |
|Germany |1.0 |2019-17 |5594 |
|Germany |1.0 |2019-18 |5395 |
|Germany |1.0 |2019-19 |5482 |
|Germany |1.0 |2019-20 |5582 |
|Germany |1.0 |2019-21 |5492 |
|Germany |1.0 |2019-22 |5889 |
|Germany |1.0 |2019-23 |6514 |
|Germany |1.0 |2019-24 |5112 |
|Germany |1.0 |2019-25 |4795 |
|Germany |1.0 |2019-26 |4673 |
|Germany |1.0 |2019-27 |5330 |
+---------------+-----------+---------+--------+
I want to slide over the dataset with a window of 25 weeks and calculate avg min users over the period. So final results should look like():
+---------------+-----------+---------+-------------+
| Country | Timezone |Year_Week|Avg(MinUsers)|
+---------------+-----------+---------+-------------+
|Germany |1.0 |2019-25 |6006.12 |
|Germany |1.0 |2019-26 |2343.16 |
|Germany |1.0 |2019-27 |8464.2 |
+---------------+-----------+---------+-------------+
*Avg(MinUsers) are dummy numbers.
I want avg per country per timezone per yeark_week:
df
.groupBy("Country", "Timezone", "Year_Week")
.agg(min("NumUserPer4Hour").alias("MinUsers"))
.withColumn("Avg", avg("MinUsers").over(Window.partitionBy("Country", "Timezone").rowsBetween(-25, 0).orderBy("Year_Week")))
.orderBy("Country", "Year_Week")
Im not sure how to add the partition information there. I tried tumbling window as well but it didn't work well.
It would be great if someone can help in this regard.
This can be solved with a Window Function.
import org.apache.spark.sql.expressions.Window
val df = Seq(("Germany",1.0,"2019-01",4322),
("Germany",1.0,"2019-02",4634),
("Germany",1.0,"2019-03",5073),
("Germany",1.0,"2019-04",4757),
("Germany",1.0,"2019-05",5831),
("Germany",1.0,"2019-06",5026),
("Germany",1.0,"2019-07",5038),
("Germany",1.0,"2019-08",5005),
("Germany",1.0,"2019-09",5766),
("Germany",1.0,"2019-10",5204),
("Germany",1.0,"2019-11",5240),
("Germany",1.0,"2019-12",5306),
("Germany",1.0,"2019-13",5381),
("Germany",1.0,"2019-14",5659),
("Germany",1.0,"2019-15",5518),
("Germany",1.0,"2019-16",6666),
("Germany",1.0,"2019-17",5594),
("Germany",1.0,"2019-18",5395),
("Germany",1.0,"2019-19",5482),
("Germany",1.0,"2019-20",5582),
("Germany",1.0,"2019-21",5492),
("Germany",1.0,"2019-22",5889),
("Germany",1.0,"2019-23",6514),
("Germany",1.0,"2019-24",5112),
("Germany",1.0,"2019-25",4795),
("Germany",1.0,"2019-26",4673),
("Germany",1.0,"2019-27",5330)
).toDF("Country", "Timezone", "Year_Week", "MinUsers")
val w = Window.partitionBy("Country", "Timezone")
.orderBy("Year_Week")
.rowsBetween(-25, Window.currentRow)
df.select(
$"Country",
$"Timezone",
$"Year_week",
avg($"MinUsers").over(w).as("Avg(MinUsers)")
)
.filter($"Year_Week" >= "2019-25")
.show()
The filter is there to reduce the rows to the ones in your question, but the window function will calculate it for every row, ignoring when the number of previous weeks goes beyond the beginning of the dataframe. In those cases, it will calculate the averages with the rows that exist in that window.
The above code produces:
+-------+--------+---------+-----------------+
|Country|Timezone|Year_week| Avg(MinUsers)|
+-------+--------+---------+-----------------+
|Germany| 1.0| 2019-25| 5371.24|
|Germany| 1.0| 2019-26|5344.384615384615|
|Germany| 1.0| 2019-27|5383.153846153846|
+-------+--------+---------+-----------------+
If it is a date field you can use the following code. You can replace days with weeks months year etc
spark.sql(
"""SELECT *, avg(some_value) OVER (
PARTITION BY Country, Timezone
ORDER BY CAST(Year_Week AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS avg FROM df""").show()

Not getting other column when using Spark sql groupby with max?

I have a dataset for movie ratings per year.
+--------------------+----------+----------+
| movie_title|imdb_score|title_year|
+--------------------+----------+----------+
| Avatar?| 7.9| 2009|
|Pirates of the Ca...| 7.1| 2007|
| Spectre?| 6.8| 2015|
|The Dark Knight R...| 8.5| 2012|
|Star Wars: Episod...| 7.1| null|
| John Carter?| 6.6| 2012|
| Spider-Man 3?| 6.2| 2007|
| Tangled?| 7.8| 2010|
|Avengers: Age of ...| 7.5| 2015|
|Harry Potter and ...| 7.5| 2009|
|Batman v Superman...| 6.9| 2016|
| Superman Returns?| 6.1| 2006|
| Quantum of Solace?| 6.7| 2008|
|Pirates of the Ca...| 7.3| 2006|
| The Lone Ranger?| 6.5| 2013|
| Man of Steel?| 7.2| 2013|
|The Chronicles of...| 6.6| 2008|
| The Avengers?| 8.1| 2012|
|Pirates of the Ca...| 6.7| 2011|
| Men in Black 3?| 6.8| 2012|
|The Hobbit: The B...| 7.5| 2014|
|The Amazing Spide...| 7.0| 2012|
| Robin Hood?| 6.7| 2010|
|The Hobbit: The D...| 7.9| 2013|
| The Golden Compass?| 6.1| 2007|
| King Kong?| 7.2| 2005|
| Titanic?| 7.7| 1997|
|Captain America: ...| 8.2| 2016|
| Battleship?| 5.9| 2012|
| Jurassic World?| 7.0| 2015|
| Skyfall?| 7.8| 2012|
| Spider-Man 2?| 7.3| 2004|
| Iron Man 3?| 7.2| 2013|
|Alice in Wonderland?| 6.5| 2010|
|X-Men: The Last S...| 6.8| 2006|
|Monsters University?| 7.3| 2013|
|Transformers: Rev...| 6.0| 2009|
|Transformers: Age...| 5.7| 2014|
|Oz the Great and ...| 6.4| 2013|
|The Amazing Spide...| 6.7| 2014|
| TRON: Legacy?| 6.8| 2010|
I need to find the best rated movie for each year based on imdb_score.
I have created data frame and also temp view using df.createOrReplaceTempView("movie_metadata").
When I am executing
spark.sql("select max(imdb_score), title_year from movie_metadata group by title_year”),
I am getting correct result
+---------------+----------+
|max(imdb_score)|title_year|
+---------------+----------+
| 8.3| 1959|
| 8.7| 1990|
| 8.7| 1975|
| 8.7| 1977|
| 8.9| 2003|
| 8.4| 2007|
| 9.0| 1974|
| 8.6| 2015|
| 8.3| 1927|
| 8.1| 1955|
| 8.5| 2006|
| 8.2| 1978|
| 8.3| 1925|
| 8.3| 1961|
which is showing max score for that year but I need movie_title also which has the highest score.
When I am executing
spark.sql("select last(movie_title), max(imdb_score), title_year from movie_metadata group by title_year") with
movie_title as last or first, I am not getting the correct movie_title with max score for that year.
Also getting exception without first or last function. Please suggest me the right way to do it.
Thanks
You can use Window:
df.createOrReplaceTempView("Movies")
sparkSession.sqlContext.sql("select title_year, movie_title, imdb_score from (select *, row_number() OVER (PARTITION BY title_year ORDER BY imdb_score DESC) as rn FROM Movies) tmp where rn = 1").show(false)```
If you prefer without creating a temp view:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.partitionBy("title_year").orderBy(col("imdb_score").desc)
df.withColumn("rn", row_number() over window).where(col("rn") === 1).drop(col("rn")).select(Seq(col("title_year"), col("movie_title"), col("imdb_score")): _*).show(false)
Hope it helps

Duplicating the record count in apache spark

This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?
Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
.show(false)
which should give you
+-------+--------+-----+
|city |media |count|
+-------+--------+-----+
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
+-------+--------+-----+

How to transpose column to row with PySpark

I'm trying to transpose some columns of my table to row. I found the previous post: Transpose column to row with Spark
I actually want the opposite way. Initially, I have:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
And what I want is:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
How can I do it? Thanks!
Hi you can use 'when' to emulate SQL CASE like statement, with that statement you redistribute data over columns , if you 'colid' is 'col2' and you are calculating col1 you simply put 0.
After that with simple sum you reduce number of rows.
from pyspark.sql import functions as F
df2=df.select(df.A, F.when(df.colid=='col_1', df.colval).otherwise(0).alias('col1'),F.when(df.colid=='col_2', df.colval)\
.otherwise(0).alias('col2'))
df2.groupBy(df.A).agg(F.sum("col1").alias('col1'),\
F.sum("col2").alias('col2')).show()