I need to clean a dataset filtering only modified rows (compared to the previous one) based on certain fields (in the example below we only consider cities and sports, for each id), keeping only the first occurrence.
If a row goes back to a previous state (but not for the immediately preceding), I still want to keep it.
Input df1
id
city
sport
date
abc
london
football
2022-02-11
abc
paris
football
2022-02-12
abc
paris
football
2022-02-13
abc
paris
football
2022-02-14
abc
paris
football
2022-02-15
abc
london
football
2022-02-16
abc
paris
football
2022-02-17
def
paris
volley
2022-02-10
def
paris
volley
2022-02-11
ghi
manchester
basketball
2022-02-09
Output DESIDERED
id
city
sport
date
abc
london
football
2022-02-11
abc
paris
football
2022-02-12
abc
london
football
2022-02-16
abc
paris
football
2022-02-17
def
paris
volley
2022-02-10
ghi
manchester
basketball
2022-02-09
I would simply use a lag function to compare over a hash :
from pyspark.sql import functions as F, Window
output_df = (
df.withColumn("hash", F.hash(F.col("city"), F.col("sport")))
.withColumn(
"prev_hash", F.lag("hash").over(Window.partitionBy("id").orderBy("date"))
)
.where(~F.col("hash").eqNullSafe(F.col("prev_hash")))
.drop("hash", "prev_hash")
)
output_df.show()
+---+----------+----------+----------+
| id| city| sport| date|
+---+----------+----------+----------+
|abc| london| football|2022-02-11|
|abc| paris| football|2022-02-12|
|abc| london| football|2022-02-16|
|abc| paris| football|2022-02-17|
|def| paris| volley|2022-02-10|
|ghi|manchester|basketball|2022-02-09|
+---+----------+----------+----------+
Though following solution works for the given data, there are 2 things:
Spark's architecture is not suitable for serial processing like this.
As I pointed out in the comment, you must have a key attribute or combination of attributes which can bring your data back in order if it gets fragmented. A slight change in partitioning and fragmentation can change the results.
The logic is:
Shift "city" and "sport" row by one index.
Compare with this row's "city" and "sport" with these shifted values. If you see a difference, then that is a new row. For similar rows, there will be no difference. For this we use Spark's Window util and a "dummy_serial_key".
Filter the data which matches above condition.
You can feel free to add more columns as per your data design:
from pyspark.sql.window import Window
df = spark.createDataFrame(data=[["abc","london","football","2022-02-11"],["abc","paris","football","2022-02-12"],["abc","paris","football","2022-02-13"],["abc","paris","football","2022-02-14"],["abc","paris","football","2022-02-15"],["abc","london","football","2022-02-16"],["abc","paris","football","2022-02-17"],["def","paris","volley","2022-02-10"],["def","paris","volley","2022-02-11"],["ghi","manchester","basketball","2022-02-09"]], schema=["id","city","sport","date"])
df = df.withColumn("date", F.to_date("date", format="yyyy-MM-dd"))
df = df.withColumn("dummy_serial_key", F.lit(0))
dummy_w = Window.partitionBy("dummy_serial_key").orderBy("dummy_serial_key")
df = df.withColumn("city_prev", F.lag("city", offset=1).over(dummy_w))
df = df.withColumn("sport_prev", F.lag("sport", offset=1).over(dummy_w))
df = df.filter(
(F.col("city_prev").isNull())
| (F.col("sport_prev").isNull())
| (F.col("city") != F.col("city_prev"))
| (F.col("sport") != F.col("sport_prev"))
)
df = df.drop("dummy_serial_key", "city_prev", "sport_prev")
+---+----------+----------+----------+
| id| city| sport| date|
+---+----------+----------+----------+
|abc| london| football|2022-02-11|
|abc| paris| football|2022-02-12|
|abc| london| football|2022-02-16|
|abc| paris| football|2022-02-17|
|def| paris| volley|2022-02-10|
|ghi|manchester|basketball|2022-02-09|
+---+----------+----------+----------+
Related
I have a dataframe which has the following data:
+-------------+-------------+
| City| Country|
+-------------+-------------+
| Vancouver| Canada|
| Portland|United States|
|San Francisco|United States|
| Seattle|United States|
| Los Angeles|United States|
| San Diego|United States|
| Las Vegas|United States|
and so on... And I have another dataframe which has rows of the cities as columns. I have 37 columns i.e. 37 cities. Like this:
datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,...
2012-10-01 13:00:00,0.0,0.0,150.0,0.0,0.0,0.0,0.0,10.0,360.0,...
2012-10-01 14:00:00,76.0,80.0,87.0,80.0,88.0,81.0,21.0,23.0,49.0,62.0,92.0,
For the 1st dataframe, I converted dataframe to map
val array = citydf1.collect.map(r => Map(citydf.columns.zip(r.toSeq):_*)).foreach(println)
Map(City -> Vancouver, Country -> Canada), Map(City -> Portland, Country -> United States),Map(City -> San Francisco, Country -> United States)
I want to do aggregate functions like min, max, avg, count on the columns in the 2nd dataframe,
can I join the values of the map to those columns so that I don't have to write query for all the 37 columns? If it is possible..
I need to merge rows in the same dataframe based on a key column "id". In the sample data frame, 1 row has data for id,name and age. The other row has id,name, and salary. Rows with same key 'id' have to be merged a single record in the final data frame. If there is just one record, should show them as well with null values [Smith, and Jake] as in example below.
The computation needs to happen on real time data, spark native function based solution would be ideal. I have tried filtering the records based on age and city columns to separate data frames and them perform a left join on ID. But its not very efficient. Looking for any alternate suggestions. Thanks in advance!
Sample Dataframe
val inputDF= Seq(("100","John", Some(35),None)
,("100","John", None,Some("Georgia")),
("101","Mike", Some(25),None),
("101","Mike", None,Some("New York")),
("103","Mary", Some(22),None),
("103","Mary", None,Some("Texas")),
("104","Smith", Some(25),None),
("105","Jake", None,Some("Florida")))
.toDF("id","name","age","city")
Input Dataframe
+---+-----+----+--------+
|id |name |age |city |
+---+-----+----+--------+
|100|John |35 |null |
|100|John |null|Georgia |
|101|Mike |25 |null |
|101|Mike |null|New York|
|103|Mary |22 |null |
|103|Mary |null|Texas |
|104|Smith|25 |null |
|105|Jake |null|Florida |
+---+-----+----+--------+
Expected Output Dataframe
+---+-----+----+---------+
| id| name| age| city|
+---+-----+----+---------+
|100| John| 35| Georgia|
|101| Mike| 25| New York|
|103| Mary| 22| Texas|
|104|Smith| 25| null|
|105| Jake|null| Florida|
+---+-----+----+---------+
Use first or last standard functions with ignoreNulls flag on.
first standard function
val q = inputDF
.groupBy("id", "name")
.agg(first("age", ignoreNulls = true) as "age", first("city", ignoreNulls = true) as "city")
.orderBy("id")
last standard function
val q = inputDF
.groupBy("id","name")
.agg(last("age", true) as "age", last("city") as "city")
.orderBy("id")
I have a dataframe that gives a set of id numbers and the date at which they visited a certain location and I’m trying to find a way in spark scala to get the number of unique people (“id”) that have visited this location on or before each day so that one id number won’t be counted twice if they visit on 2019-01-01 and then again on 2019-01-07 for example.
df.show(5,false)
+---------------+
|id |date |
+---------------+
|3424|2019-01-02|
|8683|2019-01-01|
|7690|2019-01-02|
|3424|2019-01-07|
|9002|2019-01-02|
+---------------+
I want the output to look like this: where I groupBy(“date”) and get the count of unique id’s as a cumulative number. (So for example: next to 2019-01-03, it would give the distinct count of id’s on any day up to 2019-01-03)
+----------+-------+
|date |cum_ct |
+----------+-------+
|2019-01-01|xxxxx |
|2019-01-02|xxxxx |
|2019-01-03|xxxxx |
|... |... |
|2019-01-08|xxxxx |
|2019-01-09|xxxxx |
+------------------+
What would be the best way to do this after df.groupBy("date")
You will have to use the ROW_NUMBER() function in this scenario. I have created a dataframe
val df = Seq((1,"2019-05-03"),(1,"2018-05-03"),(2,"2019-05-03"),(2,"2018-05-03"),(3,"2019-05-03"),(3,"2018-05-03")).toDF("id","date")
df.show
+---+----------+
| id| date|
+---+----------+
| 1|2019-05-03|
| 1|2018-05-03|
| 2|2019-05-03|
| 2|2018-05-03|
| 3|2019-05-03|
| 3|2018-05-03|
+---+----------+
ID represents a person id in your case that can appear against multiple dates.
Here is the count against each date.
df.groupBy("date").count.show
+----------+-----+
| date|count|
+----------+-----+
|2018-05-03| 3|
|2019-05-03| 3|
+----------+-----+
This shows the repetitive count of id's against each date. I have used 3 id's in total and each date has a count of 3 that means all id's are counted explicitly in each date.
Now to my understanding you want an ID to be counted only once against any date (depends if you want latest date or oldest date).
I am going to use latest date for every ID.
val newdf = df.withColumn("row_num",row_number().over(Window.partitionBy($"id").orderBy($"date".desc)))
Above line will assign row numbers against every ID for each date against it's entry, and row number 1 will refer to the latest date of each ID, now you take count against each ID where row number is 1. That will result in single count of every ID (Distinct).
Here is the output , I have applied filter against row number and you can see in output that the dates are latest i.e in my case 2019.
newdf.select("id","date","row_num").where("row_num = 1").show()
+---+----------+-------+
| id| date|row_num|
+---+----------+-------+
| 1|2019-05-03| 1|
| 3|2019-05-03| 1|
| 2|2019-05-03| 1|
+---+----------+-------+
Now i will take count on NEWDF with same filter which will return date wise count.
newdf.groupBy("date","row_num").count().filter("row_num = 1").select("date","count").show
+----------+-----+
| date|count|
+----------+-----+
|2019-05-03| 3|
+----------+-----+
Here total count is 3 which excludes ID's on previous dates, previously it was 6 (because repetition of id in multiple date)
I hope it answers your questions.
I have a Spark DataFrame of customers as shown below.
#SparkR code
customers <- data.frame(custID = c("001", "001", "001", "002", "002", "002", "002"),
date = c("2017-02-01", "2017-03-01", "2017-04-01", "2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01"),
value = c('new', 'good', 'good', 'new', 'good', 'new', 'bad'))
customers <- createDataFrame(customers)
display(customers)
custID| date | value
--------------------------
001 | 2017-02-01| new
001 | 2017-03-01| good
001 | 2017-04-01| good
002 | 2017-01-01| new
002 | 2017-02-01| good
002 | 2017-03-01| new
002 | 2017-04-01| bad
In the first month observation for a custID the customer gets a value of 'new'. Thereafter they are classified as 'good' or 'bad'. However, it is possible for a customer to revert from 'good' or 'bad' back to 'new' in the case that they open a second account. When this happens I want to tag the customer with '2' instead of '1', to indicate that they opened a second account, as shown below. How can I do this in Spark? Either SparkR or PySpark commands work.
#What I want to get
custID| date | value | tag
--------------------------------
001 | 2017-02-01| new | 1
001 | 2017-03-01| good | 1
001 | 2017-04-01| good | 1
002 | 2017-01-01| new | 1
002 | 2017-02-01| good | 1
002 | 2017-03-01| new | 2
002 | 2017-04-01| bad | 2
In pyspark:
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()
# df is equal to your customers dataframe
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', header=True, sep='|').cache()
df_new = df.filter(df['value'] == 'new').withColumn('tag', f.rank().over(Window.partitionBy('custID').orderBy('date')))
df = df_new.union(df.filter(df['value'] != 'new').withColumn('tag', f.lit(None)))
df = df.withColumn('tag', f.collect_list('tag').over(Window.partitionBy('custID').orderBy('date'))) \
.withColumn('tag', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('tag'))
df.show()
And output:
+------+----------+-----+---+
|custID| date|value|tag|
+------+----------+-----+---+
| 001|2017-02-01| new| 1|
| 001|2017-03-01| good| 1|
| 001|2017-04-01| good| 1|
| 002|2017-01-01| new| 1|
| 002|2017-02-01| good| 1|
| 002|2017-03-01| new| 2|
| 002|2017-04-01| bad| 2|
+------+----------+-----+---+
By the way, pandas can do that easy.
This can be done using the following piece of code:
Filter out all the records with "new"
df_new<-sql("select * from df where value="new")
createOrReplaceTempView(df_new,"df_new")
df_new<-sql("select *,row_number() over(partiting by custID order by date)
tag from df_new")
createOrReplaceTempView(df_new,"df_new")
df<-sql("select custID,date,value,min(tag) as tag from
(select t1.*,t2.tag from df t1 left outer join df_new t2 on
t1.custID=t2.custID and t1.date>=t2.date) group by 1,2,3")
For example first I have a dataframe like this
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
+----+-----+-----+--------------------+-----+
we have years 2012, 1997 and 2015. And we have another Dataframe like this
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|BMW | 3| No comment| |
|1997|VW | GTI | get | |
|2015|MB | C200| good| null|
+----+-----+-----+--------------------+-----+
we also have year 2012, 1997, 2015. How can we merge the rows with same year together? Thanks
The output should be like this
+----+-----+-----+--------------------+-----++-----+-----+--------------------------+
|year| make|model| comment|blank|| make|model| comment|blank|
+----+-----+-----+--------------------+-----++-----+-----+-----+--------------------+
|2012|Tesla| S| No comment| |BMW | 3 | no comment|
|1997| Ford| E350|Go get one now th...| |VW |GTI | get |
|2015|Chevy| Volt| null| null|MB |C200 | Good |null
+----+-----+-----+--------------------+-----++----+-----+-----+---------------------+
You can get what your desired table with a simple join. Something like:
val joined = df1.join(df2, df1("year") === df2("year"))
I loaded your inputs such that I see the following:
scala> df1.show
...
year make model comment
2012 Tesla S No comment
1997 Ford E350 Go get one now
2015 Chevy Volt null
scala> df2.show
...
year make model comment
2012 BMW 3 No comment
1997 VW GTI get
2015 MB C200 good
When I run the join, I get:
scala> val joined = df1.join(df2, df1("year") === df2("year"))
joined: org.apache.spark.sql.DataFrame = [year: string, make: string, model: string, comment: string, year: string, make: string, model: string, comment: string]
scala> joined.show
...
year make model comment year make model comment
2012 Tesla S No comment 2012 BMW 3 No comment
2015 Chevy Volt null 2015 MB C200 good
1997 Ford E350 Go get one now 1997 VW GTI get
One thing to note is that your column names may be ambiguous as they are named the same across dataframes (so you could change their names to make operations on your resultant dataframe easier to write).