Not getting other column when using Spark sql groupby with max? - scala

I have a dataset for movie ratings per year.
+--------------------+----------+----------+
| movie_title|imdb_score|title_year|
+--------------------+----------+----------+
| Avatar?| 7.9| 2009|
|Pirates of the Ca...| 7.1| 2007|
| Spectre?| 6.8| 2015|
|The Dark Knight R...| 8.5| 2012|
|Star Wars: Episod...| 7.1| null|
| John Carter?| 6.6| 2012|
| Spider-Man 3?| 6.2| 2007|
| Tangled?| 7.8| 2010|
|Avengers: Age of ...| 7.5| 2015|
|Harry Potter and ...| 7.5| 2009|
|Batman v Superman...| 6.9| 2016|
| Superman Returns?| 6.1| 2006|
| Quantum of Solace?| 6.7| 2008|
|Pirates of the Ca...| 7.3| 2006|
| The Lone Ranger?| 6.5| 2013|
| Man of Steel?| 7.2| 2013|
|The Chronicles of...| 6.6| 2008|
| The Avengers?| 8.1| 2012|
|Pirates of the Ca...| 6.7| 2011|
| Men in Black 3?| 6.8| 2012|
|The Hobbit: The B...| 7.5| 2014|
|The Amazing Spide...| 7.0| 2012|
| Robin Hood?| 6.7| 2010|
|The Hobbit: The D...| 7.9| 2013|
| The Golden Compass?| 6.1| 2007|
| King Kong?| 7.2| 2005|
| Titanic?| 7.7| 1997|
|Captain America: ...| 8.2| 2016|
| Battleship?| 5.9| 2012|
| Jurassic World?| 7.0| 2015|
| Skyfall?| 7.8| 2012|
| Spider-Man 2?| 7.3| 2004|
| Iron Man 3?| 7.2| 2013|
|Alice in Wonderland?| 6.5| 2010|
|X-Men: The Last S...| 6.8| 2006|
|Monsters University?| 7.3| 2013|
|Transformers: Rev...| 6.0| 2009|
|Transformers: Age...| 5.7| 2014|
|Oz the Great and ...| 6.4| 2013|
|The Amazing Spide...| 6.7| 2014|
| TRON: Legacy?| 6.8| 2010|
I need to find the best rated movie for each year based on imdb_score.
I have created data frame and also temp view using df.createOrReplaceTempView("movie_metadata").
When I am executing
spark.sql("select max(imdb_score), title_year from movie_metadata group by title_year”),
I am getting correct result
+---------------+----------+
|max(imdb_score)|title_year|
+---------------+----------+
| 8.3| 1959|
| 8.7| 1990|
| 8.7| 1975|
| 8.7| 1977|
| 8.9| 2003|
| 8.4| 2007|
| 9.0| 1974|
| 8.6| 2015|
| 8.3| 1927|
| 8.1| 1955|
| 8.5| 2006|
| 8.2| 1978|
| 8.3| 1925|
| 8.3| 1961|
which is showing max score for that year but I need movie_title also which has the highest score.
When I am executing
spark.sql("select last(movie_title), max(imdb_score), title_year from movie_metadata group by title_year") with
movie_title as last or first, I am not getting the correct movie_title with max score for that year.
Also getting exception without first or last function. Please suggest me the right way to do it.
Thanks

You can use Window:
df.createOrReplaceTempView("Movies")
sparkSession.sqlContext.sql("select title_year, movie_title, imdb_score from (select *, row_number() OVER (PARTITION BY title_year ORDER BY imdb_score DESC) as rn FROM Movies) tmp where rn = 1").show(false)```
If you prefer without creating a temp view:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.partitionBy("title_year").orderBy(col("imdb_score").desc)
df.withColumn("rn", row_number() over window).where(col("rn") === 1).drop(col("rn")).select(Seq(col("title_year"), col("movie_title"), col("imdb_score")): _*).show(false)
Hope it helps

Related

scala filtering out rows in a joined df based on 2 columns with same values - best way

Im comparing 2 dataframes.
I choose to compare them column by column
I created 2 smaller dataframes from the parent dataframes.
based on join columns and the comparison columns:
Created 1st dataframe:
val df1_subset = df1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 66|
| blake | lively| 66|
| eva| green| 44|
| brad| pitt| 99|
| jason| momoa| 34|
| george | clooney| 67|
| ed| sheeran| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| null| null| |
+----------+---------+-------------+
Created 2nd Dataframe:
val df1_1_subset = df1_1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 34|
| brad| pitt| 78|
| eva| green| 56|
| tom | cruise| 99|
| jason| momoa| 34|
| george | clooney| 67|
| george | clooney| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| kyle| jenner| 56|
| celena| gomez| 2|
+----------+---------+-------------+
Then I joined the 2 subsets
I joined these as a full outer join to get the following:
val df_subset_joined = df1_subset.join(df1_1_subset, joinColsArray, "full_outer")
Joined Subset
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| will | smith| 67| 67|
| george | clooney| 67| 67|
| george | clooney| 67| 88|
| blake | lively| 66| null|
| celena| gomez| null| 2|
| eva| green| 44| 56|
| null| null| | null|
| jason| momoa| 34| 34|
| ed| sheeran| 88| null|
| lionel| messi| 88| 88|
| kyle| jenner| null| 56|
| tom | cruise| 66| 34|
| tom | cruise| 66| 99|
| brad| pitt| 99| 78|
| ryan| reynolds| 45| 45|
+----------+---------+-------------+-------------+
Then I tried to filter out the elements that are same in both comparison columns (loyalty_scores in this example) by using column positions
df_subset_joined.filter(_c2 != _c3).show
But that didnt work. Im getting the following error:
Error:(174, 33) not found: value _c2
df_subset_joined.filter(_c2 != _c3).show
What is the most efficient way for me to get a joined dataframe, where I only see the rows that do not match in the comparison columns.
I would like to keep this dynamic so hard coding column names is not an option.
Thank you for helping me understand this.
you need wo work with aliases and make us of the null-safe comparison operator (https://spark.apache.org/docs/latest/api/sql/index.html#_9), see also https://stackoverflow.com/a/54067477/1138523
val df_subset_joined = df1_subset.as("a").join(df1_1_subset.as("b"), joinColsArray, "full_outer")
df_subset_joined.filter(!($"a.loyality_score" <=> $"b.loyality_score")).show
EDIT: for dynamic column names, you can use string interpolation
import org.apache.spark.sql.functions.col
val xxx : String = ???
df_subset_joined.filter(!(col(s"a.$xxx") <=> col(s"b.$xxx"))).show

Sorting numeric String in Spark Dataset

Let's assume that I have the following Dataset:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-13| 300|
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-100| 870|
+-----------+----------+
Where productCode is of String type and the amount is an Int.
If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison):
def orderProducts(product: Dataset[Product]): Dataset[Product] = {
product.orderBy("productCode")
}
// Output:
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-10| 35|
| XX-100| 870|
| XX-13| 300|
| XX-2| 410|
| XX-9| 50|
+-----------+----------+
How can I get an output ordered by Integer part of the productCode like below considering Dataset API?
+-----------+----------+
|productCode| amount|
+-----------+----------+
| XX-1| 250|
| XX-2| 410|
| XX-9| 50|
| XX-10| 35|
| XX-13| 300|
| XX-100| 870|
+-----------+----------+
Use the expression in the orderBy. Check this out:
scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]
scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
| XX-1|250|
| XX-2|410|
| XX-9| 50|
| XX-10| 35|
| XX-13|300|
| XX-100|870|
+-----------+---+
scala>
With window functions, you could do like
scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1 |250|1 |
|XX-2 |410|2 |
|XX-9 |50 |3 |
|XX-10 |35 |4 |
|XX-13 |300|5 |
|XX-100 |870|6 |
+-----------+---+----+
scala>
Note that spark complains of moving all data to single partition.

How to generate sequence on a file with millions records (daily incremental load) in Spark2

I have a business scenario to generate surrogate key on daily incremental table or file in spark 2.0 with scala 2.11.8. I know about "zipwithindex", "row_num" and "monotonically_increasing_id()" but none of them works for daily incremental load, as for today's load my sequence would be 1 + yesterday's sequence.
Accumulator's also won't work as it is write only.
Ex. Scenario: Till yesterday's load I have last customer_sk as 1001, now in today's load I want to set customer_sk that will starts from 1002 till end of file.
Note: I will have millions of rows , the program will be running on multiple nodes in parallel.
Thanks in advance
1) Get max customer_sk from the table.
2) then when using row_num add this max customer_sk number so that your sequence continue from that.
If using rdd also, add the previous max number to the (zipwithindex +1).
For all those who are still looking for answer with sample code.
hdfs dfs -cat /user/shahabhi/test_file_2.csv
abhishek,shah,123,pune,2018-12-31,2018-11-30
abhishek,shah,123,pune,2018-12-31,2018-11-30
ravi,sharma,464,mumbai,20181231,20181130
Mitesh,shah,987,satara,2018-12-31,2018-11-30
shalabh,nagar,981,satara,2018-12-31,2018-11-30
Gaurav,mehta,235,ujjain,2018/12/31,2018/11/30
Gaurav,mehta,235,ujjain,2018-12-31,2018-11-30
vikas,khanna,123,ujjain,2018-12-31,2018-11-30
vinayak,kale,789,pune,2018-12-31,2018-11-30
Spark code--
import org.apache.spark.sql.functions.monotonically_increasing_id
val df =spark.read.csv("/user/shahabhi/test_file_2.csv").toDF("name","lname","d_code","city","esd","eed")
df.show()
+--------+------+------+------+----------+----------+
| name| lname|d_code| city| esd| eed|
+--------+------+------+------+----------+----------+
|abhishek| shah| 123| pune|2018-12-31|2018-11-30|
|abhishek| shah| 123| pune|2018-12-31|2018-11-30|
| ravi|sharma| 464|mumbai| 20181231| 20181130|
| Mitesh| shah| 987|satara|2018-12-31|2018-11-30|
| shalabh| nagar| 981|satara|2018-12-31|2018-11-30|
| Gaurav| mehta| 235|ujjain|2018/12/31|2018/11/30|
| Gaurav| mehta| 235|ujjain|2018-12-31|2018-11-30|
| vikas|khanna| 123|ujjain|2018-12-31|2018-11-30|
| vinayak| kale| 789| pune|2018-12-31|2018-11-30|
+--------+------+------+------+----------+----------+
val df_2=df.withColumn("surrogate_key", monotonically_increasing_id())
df_2.show()
+--------+------+------+------+----------+----------+-------------+
| name| lname|d_code| city| esd| eed|surrogate_key|
+--------+------+------+------+----------+----------+-------------+
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 0|
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 1|
| ravi|sharma| 464|mumbai| 20181231| 20181130| 2|
| Mitesh| shah| 987|satara|2018-12-31|2018-11-30| 3|
| shalabh| nagar| 981|satara|2018-12-31|2018-11-30| 4|
| Gaurav| mehta| 235|ujjain|2018/12/31|2018/11/30| 5|
| Gaurav| mehta| 235|ujjain|2018-12-31|2018-11-30| 6|
| vikas|khanna| 123|ujjain|2018-12-31|2018-11-30| 7|
| vinayak| kale| 789| pune|2018-12-31|2018-11-30| 8|
+--------+------+------+------+----------+----------+-------------+
val df_3=df.withColumn("surrogate_key", monotonically_increasing_id()+1000)
df_3.show()
+--------+------+------+------+----------+----------+-------------+
| name| lname|d_code| city| esd| eed|surrogate_key|
+--------+------+------+------+----------+----------+-------------+
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 1000|
|abhishek| shah| 123| pune|2018-12-31|2018-11-30| 1001|
| ravi|sharma| 464|mumbai| 20181231| 20181130| 1002|
| Mitesh| shah| 987|satara|2018-12-31|2018-11-30| 1003|
| shalabh| nagar| 981|satara|2018-12-31|2018-11-30| 1004|
| Gaurav| mehta| 235|ujjain|2018/12/31|2018/11/30| 1005|
| Gaurav| mehta| 235|ujjain|2018-12-31|2018-11-30| 1006|
| vikas|khanna| 123|ujjain|2018-12-31|2018-11-30| 1007|
| vinayak| kale| 789| pune|2018-12-31|2018-11-30| 1008|
+--------+------+------+------+----------+----------+-------------+

Duplicating the record count in apache spark

This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?
Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
.show(false)
which should give you
+-------+--------+-----+
|city |media |count|
+-------+--------+-----+
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
+-------+--------+-----+

How to transpose column to row with PySpark

I'm trying to transpose some columns of my table to row. I found the previous post: Transpose column to row with Spark
I actually want the opposite way. Initially, I have:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
And what I want is:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
How can I do it? Thanks!
Hi you can use 'when' to emulate SQL CASE like statement, with that statement you redistribute data over columns , if you 'colid' is 'col2' and you are calculating col1 you simply put 0.
After that with simple sum you reduce number of rows.
from pyspark.sql import functions as F
df2=df.select(df.A, F.when(df.colid=='col_1', df.colval).otherwise(0).alias('col1'),F.when(df.colid=='col_2', df.colval)\
.otherwise(0).alias('col2'))
df2.groupBy(df.A).agg(F.sum("col1").alias('col1'),\
F.sum("col2").alias('col2')).show()