Merging rows from different dataframes together in Scala - scala

For example first I have a dataframe like this
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
+----+-----+-----+--------------------+-----+
we have years 2012, 1997 and 2015. And we have another Dataframe like this
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|BMW | 3| No comment| |
|1997|VW | GTI | get | |
|2015|MB | C200| good| null|
+----+-----+-----+--------------------+-----+
we also have year 2012, 1997, 2015. How can we merge the rows with same year together? Thanks
The output should be like this
+----+-----+-----+--------------------+-----++-----+-----+--------------------------+
|year| make|model| comment|blank|| make|model| comment|blank|
+----+-----+-----+--------------------+-----++-----+-----+-----+--------------------+
|2012|Tesla| S| No comment| |BMW | 3 | no comment|
|1997| Ford| E350|Go get one now th...| |VW |GTI | get |
|2015|Chevy| Volt| null| null|MB |C200 | Good |null
+----+-----+-----+--------------------+-----++----+-----+-----+---------------------+

You can get what your desired table with a simple join. Something like:
val joined = df1.join(df2, df1("year") === df2("year"))
I loaded your inputs such that I see the following:
scala> df1.show
...
year make model comment
2012 Tesla S No comment
1997 Ford E350 Go get one now
2015 Chevy Volt null
scala> df2.show
...
year make model comment
2012 BMW 3 No comment
1997 VW GTI get
2015 MB C200 good
When I run the join, I get:
scala> val joined = df1.join(df2, df1("year") === df2("year"))
joined: org.apache.spark.sql.DataFrame = [year: string, make: string, model: string, comment: string, year: string, make: string, model: string, comment: string]
scala> joined.show
...
year make model comment year make model comment
2012 Tesla S No comment 2012 BMW 3 No comment
2015 Chevy Volt null 2015 MB C200 good
1997 Ford E350 Go get one now 1997 VW GTI get
One thing to note is that your column names may be ambiguous as they are named the same across dataframes (so you could change their names to make operations on your resultant dataframe easier to write).

Related

Keep only modified rows in Pyspark

I need to clean a dataset filtering only modified rows (compared to the previous one) based on certain fields (in the example below we only consider cities and sports, for each id), keeping only the first occurrence.
If a row goes back to a previous state (but not for the immediately preceding), I still want to keep it.
Input df1
id
city
sport
date
abc
london
football
2022-02-11
abc
paris
football
2022-02-12
abc
paris
football
2022-02-13
abc
paris
football
2022-02-14
abc
paris
football
2022-02-15
abc
london
football
2022-02-16
abc
paris
football
2022-02-17
def
paris
volley
2022-02-10
def
paris
volley
2022-02-11
ghi
manchester
basketball
2022-02-09
Output DESIDERED
id
city
sport
date
abc
london
football
2022-02-11
abc
paris
football
2022-02-12
abc
london
football
2022-02-16
abc
paris
football
2022-02-17
def
paris
volley
2022-02-10
ghi
manchester
basketball
2022-02-09
I would simply use a lag function to compare over a hash :
from pyspark.sql import functions as F, Window
output_df = (
df.withColumn("hash", F.hash(F.col("city"), F.col("sport")))
.withColumn(
"prev_hash", F.lag("hash").over(Window.partitionBy("id").orderBy("date"))
)
.where(~F.col("hash").eqNullSafe(F.col("prev_hash")))
.drop("hash", "prev_hash")
)
output_df.show()
+---+----------+----------+----------+
| id| city| sport| date|
+---+----------+----------+----------+
|abc| london| football|2022-02-11|
|abc| paris| football|2022-02-12|
|abc| london| football|2022-02-16|
|abc| paris| football|2022-02-17|
|def| paris| volley|2022-02-10|
|ghi|manchester|basketball|2022-02-09|
+---+----------+----------+----------+
Though following solution works for the given data, there are 2 things:
Spark's architecture is not suitable for serial processing like this.
As I pointed out in the comment, you must have a key attribute or combination of attributes which can bring your data back in order if it gets fragmented. A slight change in partitioning and fragmentation can change the results.
The logic is:
Shift "city" and "sport" row by one index.
Compare with this row's "city" and "sport" with these shifted values. If you see a difference, then that is a new row. For similar rows, there will be no difference. For this we use Spark's Window util and a "dummy_serial_key".
Filter the data which matches above condition.
You can feel free to add more columns as per your data design:
from pyspark.sql.window import Window
df = spark.createDataFrame(data=[["abc","london","football","2022-02-11"],["abc","paris","football","2022-02-12"],["abc","paris","football","2022-02-13"],["abc","paris","football","2022-02-14"],["abc","paris","football","2022-02-15"],["abc","london","football","2022-02-16"],["abc","paris","football","2022-02-17"],["def","paris","volley","2022-02-10"],["def","paris","volley","2022-02-11"],["ghi","manchester","basketball","2022-02-09"]], schema=["id","city","sport","date"])
df = df.withColumn("date", F.to_date("date", format="yyyy-MM-dd"))
df = df.withColumn("dummy_serial_key", F.lit(0))
dummy_w = Window.partitionBy("dummy_serial_key").orderBy("dummy_serial_key")
df = df.withColumn("city_prev", F.lag("city", offset=1).over(dummy_w))
df = df.withColumn("sport_prev", F.lag("sport", offset=1).over(dummy_w))
df = df.filter(
(F.col("city_prev").isNull())
| (F.col("sport_prev").isNull())
| (F.col("city") != F.col("city_prev"))
| (F.col("sport") != F.col("sport_prev"))
)
df = df.drop("dummy_serial_key", "city_prev", "sport_prev")
+---+----------+----------+----------+
| id| city| sport| date|
+---+----------+----------+----------+
|abc| london| football|2022-02-11|
|abc| paris| football|2022-02-12|
|abc| london| football|2022-02-16|
|abc| paris| football|2022-02-17|
|def| paris| volley|2022-02-10|
|ghi|manchester|basketball|2022-02-09|
+---+----------+----------+----------+

Spark : aggregate values by a list of given years

I'm new to Scala, say I have a dataset :
>>> ds.show()
+--------------+-----------------+-------------+
|year |nb_product_sold | system_year |
+--------------+-----------------+-------------+
|2010 | 1 | 2012 |
|2012 | 2 | 2012 |
|2012 | 4 | 2012 |
|2015 | 3 | 2012 |
|2019 | 4 | 2012 |
|2021 | 5 | 2012 |
+--------------+-----------------+-------+
and I have a List<Integer> years = {1, 3, 8}, which means the x year after system_year year.
The goal is to calculate the number of total sold products for each year after system_year.
In other words, I have to calculate the total sold products for year 2013, 2015, 2020.
The output dataset should be like this :
+-------+-----------------------+
| year | total_product_sold |
+-------+-----------------------+
| 1 | 6 | -> 2012 - 2013 6 products sold
| 3 | 9 | -> 2012 - 2015 9 products sold
| 8 | 13 | -> 2012 - 2020 13 products sold
+-------+-----------------------+
I want to know how to do this in scala ? Should I use groupBy() in this case ?
You could have used a groupby case/when if the year ranges didn't overlap. But here you'll need to do a groupby for each year and then union the 3 grouped dataframes :
val years = List(1, 3, 8)
val result = years.map{ y =>
df.filter($"year".between($"system_year", $"system_year" + y))
.groupBy(lit(y).as("year"))
.agg(sum($"nb_product_sold").as("total_product_sold"))
}.reduce(_ union _)
result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//| 1| 6|
//| 3| 9|
//| 8| 13|
//+----+------------------+
There might be multiple ways of doing things and more efficient than what I am showing you but it works for your use case.
//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year")
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")
you can see the output as below:

WeekOfYear column getting null in the SparkSQL

Here I am writing the SQL statement for spark.sql but I am not able to get the WEEKOFYEAR converted to week of the year and getting a null in the Output
Below I have shown the expression of what I a using
Input Data
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 8.26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,01-12-2010 8.26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 8.26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 8.26,3.39,17850,United Kingdom
SQL CODE
val summarySQlTest = spark.sql(
"""
|select Country,WEEKOFYEAR(InvoiceDate)as WeekNumber,
|count(distinct(InvoiceNo)) as NumInvoices,
|sum(Quantity) as TotalQuantity,
|round(sum(Quantity*UnitPrice),2) as InvoiceValue
|from sales
|group by Country,WeekNumber
|""".stripMargin
).show()
DESIRED OUTPUT
+--------------+----------+-----------+-------------+------------+
| Country|WeekNumber|NumInvoices|TotalQuantity|InvoiceValue|
+--------------+----------+-----------+-------------+------------+
| Spain| 49| 1| 67| 174.72|
| Germany| 48| 11| 1795| 3309.75|
Output I am getting
+--------------+----------+-----------+-------------+------------+
| Country|WeekNumber|NumInvoices|TotalQuantity|InvoiceValue|
+--------------+----------+-----------+-------------+------------+
| Spain| null| 1| 67| 174.72|
| Germany| null| 11| 1795| 3309.75|
For the desired output I used this, But I want to solve the same in the spark.sql
Also it would be great if anyone can explain the what is actually happening here
(to_date(col("InvoiceDate"),"dd-MM-yyyy H.mm")
val knowFunc= invoicesDF.withColumn("InvoiceDate",to_date(col("InvoiceDate"),"dd-MM-yyyy H.mm"))
.where("year(InvoiceDate) == 2010")
.withColumn("WeekNumber",weekofyear(col("InvoiceDate")))
.groupBy("Country","WeekNumber")
.agg(sum("Quantity").as("TotalQuantity"),
round(sum(expr("Quantity*UnitPrice")),2).as("InvoiceValue")).show()
You'll need to convert the InvoiceDate column to date type first (using to_date), before you can call weekofyear. I guess this also answers your last question.
val summarySQlTest = spark.sql(
"""
|select Country,WEEKOFYEAR(to_date(InvoiceDate,'dd-MM-yyyy H.mm')) as WeekNumber,
|count(distinct(InvoiceNo)) as NumInvoices,
|sum(Quantity) as TotalQuantity,
|round(sum(Quantity*UnitPrice),2) as InvoiceValue
|from sales
|group by Country,WeekNumber
|""".stripMargin
).show()

Create a new column based on date checking

I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11

Find columns with different values

My dataframe has 120 columns.Suppose my dataframe has the below structure
Id value1 value2 value3
a 10 1983 19
a 20 1983 20
a 10 1983 21
b 10 1984 1
b 10 1984 2
we can see here the id a, value1 have different values(10,20). I have to find columns having the different values for a particular id. Is there any statistical or any other approach in spark to solve this problem?
Expected output
id new_column
a value1,value3
b value3
The following code might be a start of an answer:
val result = log.select("Id","value1","value2","value3").groupBy('Id).agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
Should do the following:
1)
log.select("Id","value1","value2","value3")
select relevant columns (if you want to take all columns it might be redundant)
2)
groupBy('Id)
group rows with the same ID
3)
agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
output : ID, and number(count) of unique(distinct) values per ID/specific column
You can do it in several ways, one of them being the distinct method, that is similar to the SQL behaviour. Another one would be the groupBy method, where you have to pass in parameters the name of the columns you want to group (e.g. df.groupBy("Id", "value1")).
Below is an example using the distinct method.
scala> case class Person(name : String, age: Int)
defined class Person
scala> val persons = Seq(Person("test", 10), Person("test", 20), Person("test", 10)).toDF
persons: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.show
+----+---+
|name|age|
+----+---+
|test| 10|
|test| 20|
|test| 10|
+----+---+
scala> persons.select("name", "age").distinct().show
+-----+---+
| name|age|
+-----+---+
| test| 10|
| test| 20|
+-----+---+