Spark Scala Cumulative Unique Count by Date - scala

I have a dataframe that gives a set of id numbers and the date at which they visited a certain location and I’m trying to find a way in spark scala to get the number of unique people (“id”) that have visited this location on or before each day so that one id number won’t be counted twice if they visit on 2019-01-01 and then again on 2019-01-07 for example.
df.show(5,false)
+---------------+
|id |date |
+---------------+
|3424|2019-01-02|
|8683|2019-01-01|
|7690|2019-01-02|
|3424|2019-01-07|
|9002|2019-01-02|
+---------------+
I want the output to look like this: where I groupBy(“date”) and get the count of unique id’s as a cumulative number. (So for example: next to 2019-01-03, it would give the distinct count of id’s on any day up to 2019-01-03)
+----------+-------+
|date |cum_ct |
+----------+-------+
|2019-01-01|xxxxx |
|2019-01-02|xxxxx |
|2019-01-03|xxxxx |
|... |... |
|2019-01-08|xxxxx |
|2019-01-09|xxxxx |
+------------------+
What would be the best way to do this after df.groupBy("date")

You will have to use the ROW_NUMBER() function in this scenario. I have created a dataframe
val df = Seq((1,"2019-05-03"),(1,"2018-05-03"),(2,"2019-05-03"),(2,"2018-05-03"),(3,"2019-05-03"),(3,"2018-05-03")).toDF("id","date")
df.show
+---+----------+
| id| date|
+---+----------+
| 1|2019-05-03|
| 1|2018-05-03|
| 2|2019-05-03|
| 2|2018-05-03|
| 3|2019-05-03|
| 3|2018-05-03|
+---+----------+
ID represents a person id in your case that can appear against multiple dates.
Here is the count against each date.
df.groupBy("date").count.show
+----------+-----+
| date|count|
+----------+-----+
|2018-05-03| 3|
|2019-05-03| 3|
+----------+-----+
This shows the repetitive count of id's against each date. I have used 3 id's in total and each date has a count of 3 that means all id's are counted explicitly in each date.
Now to my understanding you want an ID to be counted only once against any date (depends if you want latest date or oldest date).
I am going to use latest date for every ID.
val newdf = df.withColumn("row_num",row_number().over(Window.partitionBy($"id").orderBy($"date".desc)))
Above line will assign row numbers against every ID for each date against it's entry, and row number 1 will refer to the latest date of each ID, now you take count against each ID where row number is 1. That will result in single count of every ID (Distinct).
Here is the output , I have applied filter against row number and you can see in output that the dates are latest i.e in my case 2019.
newdf.select("id","date","row_num").where("row_num = 1").show()
+---+----------+-------+
| id| date|row_num|
+---+----------+-------+
| 1|2019-05-03| 1|
| 3|2019-05-03| 1|
| 2|2019-05-03| 1|
+---+----------+-------+
Now i will take count on NEWDF with same filter which will return date wise count.
newdf.groupBy("date","row_num").count().filter("row_num = 1").select("date","count").show
+----------+-----+
| date|count|
+----------+-----+
|2019-05-03| 3|
+----------+-----+
Here total count is 3 which excludes ID's on previous dates, previously it was 6 (because repetition of id in multiple date)
I hope it answers your questions.

Related

Merge rows from one pair of columns into another

Here's a link to an example of what I want to achieve: https://community.powerbi.com/t5/Desktop/Append-Rows-using-Another-columns/m-p/401836. Basically, I need to merge all the rows of a pair of columns into another pair of columns. How can I do this in Spark Scala?
Input
Output
Correct me if I'm wrong, but I understand that you have a dataframe with 4 columns and you want two of them to be in the previous couple of columns right?
For instance with this input (only two rows for simplicity)
df.show
+----+----------+-----------+----------+---------+
|name| date1| cost1| date2| cost2|
+----+----------+-----------+----------+---------+
| A|2013-03-25|19923245.06| | |
| B|2015-06-04| 4104660.00|2017-10-16|392073.48|
+----+----------+-----------+----------+---------+
With just a couple of selects and a unionn you can achieve what you want
df.select("name", "date1", "cost1")
.union(df.select("name", "date2", "cost2"))
.withColumnRenamed("date1", "date")
.withColumnRenamed("cost1", "cost")
+----+----------+-----------+
|name| date| cost|
+----+----------+-----------+
| A|2013-03-25|19923245.06|
| B|2015-06-04| 4104660.00|
| A| | |
| B|2017-10-16| 392073.48|
+----+----------+-----------+

Apache spark aggregation: aggregate column based on another column value

I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+

pyspark - Can I use substring of value as a key of groupBy() function?

I have a dataframe looks like this:
datetime | ID |
======================
20180201000000 | 275 |
20171231113024 | 534 |
20180201220000 | 275 |
20170205000000 | 28 |
what I want to do is to count by ID, monthly.
this way was perfactly worked :
add column of month by extracting from datetime column :
new_df = df.withColumn('month', df.datetime.substr(0,6))
count by ID & month :
count_df = new_df.groupBy('ID','month').count()
but is there a way to use substring of certain column values as an argument of groupBy() function? like :
`count_df = df.groupBy('ID', df.datetime.substr(0,6)).count()`
at least, this code didn't work.
if there exist the way to use substring of values, don't need to add new column and save much of resources(in case of big data).
but even if this approach is wrong, do you have a better idea to get same result?
Try this
>>> df.show()
+--------------+---+
| datetime| id|
+--------------+---+
|20180201000000|275|
|20171231113024|534|
|20180201220000|275|
|20170205000000| 28|
+--------------+---+
>>> df.groupBy('id',df.datetime.substr(0,6)).agg(count('id')).show()
+---+-----------------------+---------+
| id|substring(datetime,0,6)|count(id)|
+---+-----------------------+---------+
|275| 201802| 2|
|534| 201712| 1|
| 28| 201702| 1|
+---+-----------------------+---------+

How to randomly selecting rows from one dataframeusing information from another dataframe

The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate
To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))