Edited with a new example for clarify
The following data
+------+---------+-----------+
| ID| location| loggedTime|
+------+---------+-----------+
| 67| 312| 12:09:00|
| 67| 375| 12:23:00|
| 67| 375| 12:25:00|
| 67| 650| 12:26:00|
| 75| 650| 12:27:00|
| 75| 650| 12:29:00|
| 75| 800| 12:30:00|
+------+---------+-----------+
should yield the below, where we compare each row to the previous column 'ID' and 'location'. I need to log each time ID was logged at a different location. They can visit the same location again later in the sequence, therefore dropDupicates on ID and location isn't possible
+------+---------+-----------+
| ID| location| loggedTime|
+------+---------+-----------+
| 67| 312| 12:09:00|
| 67| 375| 12:23:00|
| 67| 650| 12:26:00|
| 75| 650| 12:27:00|
| 75| 800| 12:30:00|
+------+---------+-----------+
Using a Window ordered by loggedTime can be used to get the location from the previous row. Then the rows where the current and the previous location are the same can be filtered out:
from pyspark.sql import functions as F
from pyspark.sql import Window
w=Window.partitionBy("ID").orderBy("loggedTime")
df.withColumn("prev_location", F.lag("location").over(w)) \
.filter("prev_location is null or location <> prev_location") \
.drop("prev_location") \
.show()
Output:
+---+--------+-------------------+
| ID|location| loggedTime|
+---+--------+-------------------+
| 67| 312|1970-01-01 00:09:00|
| 67| 375|1970-01-01 00:23:00|
| 67| 650|1970-01-01 00:26:00|
| 75| 650|1970-01-01 00:27:00|
| 75| 800|1970-01-01 00:30:00|
+---+--------+-------------------+
how about useing group by?
df = df.groupBy(col("id"), col("location")).agg(min(col("loggedTime")))
I have a dataset:
+---------------+-----------+---------+--------+
| Country | Timezone |Year_Week|MinUsers|
+---------------+-----------+---------+--------+
|Germany |1.0 |2019-01 |4322 |
|Germany |1.0 |2019-02 |4634 |
|Germany |1.0 |2019-03 |5073 |
|Germany |1.0 |2019-04 |4757 |
|Germany |1.0 |2019-05 |5831 |
|Germany |1.0 |2019-06 |5026 |
|Germany |1.0 |2019-07 |5038 |
|Germany |1.0 |2019-08 |5005 |
|Germany |1.0 |2019-09 |5766 |
|Germany |1.0 |2019-10 |5204 |
|Germany |1.0 |2019-11 |5240 |
|Germany |1.0 |2019-12 |5306 |
|Germany |1.0 |2019-13 |5381 |
|Germany |1.0 |2019-14 |5659 |
|Germany |1.0 |2019-15 |5518 |
|Germany |1.0 |2019-16 |6666 |
|Germany |1.0 |2019-17 |5594 |
|Germany |1.0 |2019-18 |5395 |
|Germany |1.0 |2019-19 |5482 |
|Germany |1.0 |2019-20 |5582 |
|Germany |1.0 |2019-21 |5492 |
|Germany |1.0 |2019-22 |5889 |
|Germany |1.0 |2019-23 |6514 |
|Germany |1.0 |2019-24 |5112 |
|Germany |1.0 |2019-25 |4795 |
|Germany |1.0 |2019-26 |4673 |
|Germany |1.0 |2019-27 |5330 |
+---------------+-----------+---------+--------+
I want to slide over the dataset with a window of 25 weeks and calculate avg min users over the period. So final results should look like():
+---------------+-----------+---------+-------------+
| Country | Timezone |Year_Week|Avg(MinUsers)|
+---------------+-----------+---------+-------------+
|Germany |1.0 |2019-25 |6006.12 |
|Germany |1.0 |2019-26 |2343.16 |
|Germany |1.0 |2019-27 |8464.2 |
+---------------+-----------+---------+-------------+
*Avg(MinUsers) are dummy numbers.
I want avg per country per timezone per yeark_week:
df
.groupBy("Country", "Timezone", "Year_Week")
.agg(min("NumUserPer4Hour").alias("MinUsers"))
.withColumn("Avg", avg("MinUsers").over(Window.partitionBy("Country", "Timezone").rowsBetween(-25, 0).orderBy("Year_Week")))
.orderBy("Country", "Year_Week")
Im not sure how to add the partition information there. I tried tumbling window as well but it didn't work well.
It would be great if someone can help in this regard.
This can be solved with a Window Function.
import org.apache.spark.sql.expressions.Window
val df = Seq(("Germany",1.0,"2019-01",4322),
("Germany",1.0,"2019-02",4634),
("Germany",1.0,"2019-03",5073),
("Germany",1.0,"2019-04",4757),
("Germany",1.0,"2019-05",5831),
("Germany",1.0,"2019-06",5026),
("Germany",1.0,"2019-07",5038),
("Germany",1.0,"2019-08",5005),
("Germany",1.0,"2019-09",5766),
("Germany",1.0,"2019-10",5204),
("Germany",1.0,"2019-11",5240),
("Germany",1.0,"2019-12",5306),
("Germany",1.0,"2019-13",5381),
("Germany",1.0,"2019-14",5659),
("Germany",1.0,"2019-15",5518),
("Germany",1.0,"2019-16",6666),
("Germany",1.0,"2019-17",5594),
("Germany",1.0,"2019-18",5395),
("Germany",1.0,"2019-19",5482),
("Germany",1.0,"2019-20",5582),
("Germany",1.0,"2019-21",5492),
("Germany",1.0,"2019-22",5889),
("Germany",1.0,"2019-23",6514),
("Germany",1.0,"2019-24",5112),
("Germany",1.0,"2019-25",4795),
("Germany",1.0,"2019-26",4673),
("Germany",1.0,"2019-27",5330)
).toDF("Country", "Timezone", "Year_Week", "MinUsers")
val w = Window.partitionBy("Country", "Timezone")
.orderBy("Year_Week")
.rowsBetween(-25, Window.currentRow)
df.select(
$"Country",
$"Timezone",
$"Year_week",
avg($"MinUsers").over(w).as("Avg(MinUsers)")
)
.filter($"Year_Week" >= "2019-25")
.show()
The filter is there to reduce the rows to the ones in your question, but the window function will calculate it for every row, ignoring when the number of previous weeks goes beyond the beginning of the dataframe. In those cases, it will calculate the averages with the rows that exist in that window.
The above code produces:
+-------+--------+---------+-----------------+
|Country|Timezone|Year_week| Avg(MinUsers)|
+-------+--------+---------+-----------------+
|Germany| 1.0| 2019-25| 5371.24|
|Germany| 1.0| 2019-26|5344.384615384615|
|Germany| 1.0| 2019-27|5383.153846153846|
+-------+--------+---------+-----------------+
If it is a date field you can use the following code. You can replace days with weeks months year etc
spark.sql(
"""SELECT *, avg(some_value) OVER (
PARTITION BY Country, Timezone
ORDER BY CAST(Year_Week AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS avg FROM df""").show()
I have a dataset for movie ratings per year.
+--------------------+----------+----------+
| movie_title|imdb_score|title_year|
+--------------------+----------+----------+
| Avatar?| 7.9| 2009|
|Pirates of the Ca...| 7.1| 2007|
| Spectre?| 6.8| 2015|
|The Dark Knight R...| 8.5| 2012|
|Star Wars: Episod...| 7.1| null|
| John Carter?| 6.6| 2012|
| Spider-Man 3?| 6.2| 2007|
| Tangled?| 7.8| 2010|
|Avengers: Age of ...| 7.5| 2015|
|Harry Potter and ...| 7.5| 2009|
|Batman v Superman...| 6.9| 2016|
| Superman Returns?| 6.1| 2006|
| Quantum of Solace?| 6.7| 2008|
|Pirates of the Ca...| 7.3| 2006|
| The Lone Ranger?| 6.5| 2013|
| Man of Steel?| 7.2| 2013|
|The Chronicles of...| 6.6| 2008|
| The Avengers?| 8.1| 2012|
|Pirates of the Ca...| 6.7| 2011|
| Men in Black 3?| 6.8| 2012|
|The Hobbit: The B...| 7.5| 2014|
|The Amazing Spide...| 7.0| 2012|
| Robin Hood?| 6.7| 2010|
|The Hobbit: The D...| 7.9| 2013|
| The Golden Compass?| 6.1| 2007|
| King Kong?| 7.2| 2005|
| Titanic?| 7.7| 1997|
|Captain America: ...| 8.2| 2016|
| Battleship?| 5.9| 2012|
| Jurassic World?| 7.0| 2015|
| Skyfall?| 7.8| 2012|
| Spider-Man 2?| 7.3| 2004|
| Iron Man 3?| 7.2| 2013|
|Alice in Wonderland?| 6.5| 2010|
|X-Men: The Last S...| 6.8| 2006|
|Monsters University?| 7.3| 2013|
|Transformers: Rev...| 6.0| 2009|
|Transformers: Age...| 5.7| 2014|
|Oz the Great and ...| 6.4| 2013|
|The Amazing Spide...| 6.7| 2014|
| TRON: Legacy?| 6.8| 2010|
I need to find the best rated movie for each year based on imdb_score.
I have created data frame and also temp view using df.createOrReplaceTempView("movie_metadata").
When I am executing
spark.sql("select max(imdb_score), title_year from movie_metadata group by title_year”),
I am getting correct result
+---------------+----------+
|max(imdb_score)|title_year|
+---------------+----------+
| 8.3| 1959|
| 8.7| 1990|
| 8.7| 1975|
| 8.7| 1977|
| 8.9| 2003|
| 8.4| 2007|
| 9.0| 1974|
| 8.6| 2015|
| 8.3| 1927|
| 8.1| 1955|
| 8.5| 2006|
| 8.2| 1978|
| 8.3| 1925|
| 8.3| 1961|
which is showing max score for that year but I need movie_title also which has the highest score.
When I am executing
spark.sql("select last(movie_title), max(imdb_score), title_year from movie_metadata group by title_year") with
movie_title as last or first, I am not getting the correct movie_title with max score for that year.
Also getting exception without first or last function. Please suggest me the right way to do it.
Thanks
You can use Window:
df.createOrReplaceTempView("Movies")
sparkSession.sqlContext.sql("select title_year, movie_title, imdb_score from (select *, row_number() OVER (PARTITION BY title_year ORDER BY imdb_score DESC) as rn FROM Movies) tmp where rn = 1").show(false)```
If you prefer without creating a temp view:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.partitionBy("title_year").orderBy(col("imdb_score").desc)
df.withColumn("rn", row_number() over window).where(col("rn") === 1).drop(col("rn")).select(Seq(col("title_year"), col("movie_title"), col("imdb_score")): _*).show(false)
Hope it helps
This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?
Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
.show(false)
which should give you
+-------+--------+-----+
|city |media |count|
+-------+--------+-----+
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
+-------+--------+-----+
I'm trying to transpose some columns of my table to row. I found the previous post: Transpose column to row with Spark
I actually want the opposite way. Initially, I have:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
And what I want is:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
How can I do it? Thanks!
Hi you can use 'when' to emulate SQL CASE like statement, with that statement you redistribute data over columns , if you 'colid' is 'col2' and you are calculating col1 you simply put 0.
After that with simple sum you reduce number of rows.
from pyspark.sql import functions as F
df2=df.select(df.A, F.when(df.colid=='col_1', df.colval).otherwise(0).alias('col1'),F.when(df.colid=='col_2', df.colval)\
.otherwise(0).alias('col2'))
df2.groupBy(df.A).agg(F.sum("col1").alias('col1'),\
F.sum("col2").alias('col2')).show()