Need some data after by grouping on key in spark/scala - scala

I have a problem in spark(v2.2.2)/scala(v2.11.8). Mostly into scala/spark functional language.
I have a list of person with rented_date like below.
These are csv file which I will convert into parquet and read as a dataframe.
Table: Person
+-------------------+-----------+
| ID |report_date|
+-------------------+-----------+
| 123| 2011-09-25|
| 111| 2017-08-23|
| 222| 2018-09-30|
| 333| 2020-09-30|
| 444| 2019-09-30|
+-------------------+-----------+
I want to find out the start_date of the address for the period person's rented it out by grouping on ID
Table: Address
+-------------------+----------+----------+
| ID |start_date|close_date|
+-------------------+----------+----------+
| 123|2008-09-23|2009-09-23|
| 123|2009-09-24|2010-09-23|
| 123|2010-09-24|2011-09-23|
| 123|2011-09-30|2012-09-23|
| 123|2012-09-24| null|
| 111|2013-09-23|2014-09-23|
| 111|2014-09-24|2015-09-23|
| 111|2015-09-24|2016-09-23|
| 111|2016-09-24|2017-09-23|
| 111|2017-09-24| null|
| 222|2018-09-24| null|
+-------------------+----------+----------+
ex: For 123 rented_date is 2011-09-20, which in address table falls in the period (start_date, close_date) 2010-09-24,2011-09-23 (row 3 in address). Form here I have to fetch start_date 2010-09-24.
I have to do this on entire dataset by joining the tables. Or need to fetch start_date from address table into the Person table.
Also need to handle where closed date is null.
Sometime scenario may also include where rented date will not fall in any of the period in that case we need to take it where rented_date < closed_date.
Apologies, proper format of tables are not populating.
Thanks in Advance.

First of all
I have a list of person with rented_date like below. These are csv file which I will convert into parquet and read as a dataframe.
No need to convert it you can just read it directly with spark
spark.read.csv("path")
spark.read.format("csv").load("path")
I am not sure what your expectation in null fields are so I would filter them out for now:
dfAdressNotNull.filter($"close_date".isNotNull)
Of course now you need to join them together and since the data in Address is the relevant one I would do a left join.
val joinedDf = dfAddressNotNull.join(dfPerson, Seq("ID"), "left")
No you have Addresses and Persons combined
If you filter now like that
joinedDf.filter($"report_date" >= $"start_date" && $"report_date" < $"closed_date")
You should have something like that what you want to achieve.

Related

Merge rows from one pair of columns into another

Here's a link to an example of what I want to achieve: https://community.powerbi.com/t5/Desktop/Append-Rows-using-Another-columns/m-p/401836. Basically, I need to merge all the rows of a pair of columns into another pair of columns. How can I do this in Spark Scala?
Input
Output
Correct me if I'm wrong, but I understand that you have a dataframe with 4 columns and you want two of them to be in the previous couple of columns right?
For instance with this input (only two rows for simplicity)
df.show
+----+----------+-----------+----------+---------+
|name| date1| cost1| date2| cost2|
+----+----------+-----------+----------+---------+
| A|2013-03-25|19923245.06| | |
| B|2015-06-04| 4104660.00|2017-10-16|392073.48|
+----+----------+-----------+----------+---------+
With just a couple of selects and a unionn you can achieve what you want
df.select("name", "date1", "cost1")
.union(df.select("name", "date2", "cost2"))
.withColumnRenamed("date1", "date")
.withColumnRenamed("cost1", "cost")
+----+----------+-----------+
|name| date| cost|
+----+----------+-----------+
| A|2013-03-25|19923245.06|
| B|2015-06-04| 4104660.00|
| A| | |
| B|2017-10-16| 392073.48|
+----+----------+-----------+

"Enrich" Spark DataFrame from another DF (or from HBase)

I am not sure this is the right title so feel free to suggest an edit. Btw, I'm really new to Scala and Spark.
Basically, I have a DF df_1 looking like this:
| ID | name | city_id |
| 0 | "abc"| 123 |
| 1 | "cba"| 124 |
...
The city_id is a key in a huge HBase:
123; New York; .... 124; Los Angeles; .... etc.
The result should be df_1:
| ID | name | city_id |
| 0 | "abc"| New York|
| 1 | "cba"| Los Angeles|
...
My approach was to create an external Hive table on top of HBase with the columns I need. But then again I do not know how to join them in the most efficient manner.
I suppose there is a way to do it dirrectly from HBase, but again I do not know how.
Any hint is appreciated. :)
There is no need to create an itermediate hive table over hbase. Spark sql can deal with all kind of unstructured data directly. Just load hbase data into a dataframe with the hbase data source
Once you have the proper hbase dataframe use the following
sample spark-scala code to get the joined dataframe:
val df=Seq((0,"abc",123),(1,"cda",124),(2,"dsd",125),(3,"gft",126),(4,"dty",127)).toDF("ID","name","city_id")
val hbaseDF=Seq((123,"New York"),(124,"Los Angeles"),(125,"Chicago"),(126,"Seattle")).toDF("city_id","city_name")
df.join(hbaseDF,Seq("city_id"),"inner").drop("city_id").show()

Concatenate Dataframe rows based on timestamp value

I have a Dataframe with text messages and a timestamp value for each row.
Like so:
+--------------------------+---------------------+
| message | timestamp |
+--------------------------+---------------------+
| some text from message 1 | 2019-08-03 01:00:00 |
+--------------------------+---------------------+
| some text from message 2 | 2019-08-03 01:01:00 |
+--------------------------+---------------------+
| some text from message 3 | 2019-08-03 01:03:00 |
+--------------------------+---------------------+
I need to concatenate the messages by creating time windows of X number of minutes so that for example they look like this:
+---------------------------------------------------+
| message |
+---------------------------------------------------+
| some text from message 1 some text from message 2 |
+---------------------------------------------------+
| some text from message 3 |
+---------------------------------------------------+
After doing the concatenation I have no use for the timestamp column so I can drop it or keep it with any value.
I have been able to do this by iterating through the entire Dataframe, adding timestamp diffs and inserting into a new Dataframe when the time window is achieved. It works but it's ugly and I am looking for some pointers into how to accomplish this in Scala in a more functional/elegant way.
I looked at the Window functions but since I am not doing aggregations it appears that I do not have a way to access the content of the groups once the WindowSpec is created so I didn't get very far.
I also looked at the lead and lag functions but I couldn't figure out how to use them without also having to go into a for loop.
I appreciate any ideas or pointers you can provide.
Any thoughts or pointers into how to accomplish this?
You can use the window datetime function (not to be confused with Window functions) to generate time windows, followed by a groupBy to aggregate messages using concat_ws:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("message1", "2019-08-03 01:00:00"),
("message2", "2019-08-03 01:01:00"),
("message3", "2019-08-03 01:03:00")
).toDF("message", "timestamp")
val duration = "2 minutes"
df.
groupBy(window($"timestamp", duration)).
agg(concat_ws(" ", collect_list($"message")).as("message")).
show(false)
// +------------------------------------------+-----------------+
// |window |message |
// +------------------------------------------+-----------------+
// |[2019-08-03 01:00:00, 2019-08-03 01:02:00]|message1 message2|
// |[2019-08-03 01:02:00, 2019-08-03 01:04:00]|message3 |
// +------------------------------------------+-----------------+

Find all possible combinations of a column in a dataframe when fixing a value in another column

I need to create a Graph network of authors and movies. Authors that participated in at least one movie should be connected. I already created my vertice dataframe containing the author's information. I am having trouble to create an edges dataframe that show this connection. I have the following dataframe:
author_ID | movie_ID
nm0000198 | tt0091954
nm0000198 | tt0468569
nm0000198 | tt4555426
nm0000354 | tt0134119
nm0000354 | tt0091954
nm0000721 | tt0091954
I would like to somehow fix the movie and create all possible combinations of authors that participated in that movie. Like:
movie_ID | author_A | author_B
tt0091954| nm0000198 | nm0000354
tt0091954| nm0000198 | nm0000721
tt0091954| nm0000354 | nm0000721
Please help if you can. Thanks in advance!
You can achieve this with a self join
dfA = df.withColumnRenamed('author_ID', 'author_A')
dfB = df.withColumnRenamed('author_ID', 'author_B')
dfA \
.join(dfB, on=(dfA.movie_ID == dfB.movie_ID) & (dfA.author_A < dfB.author_B)) \
.drop(dfB.movie_ID) \
.show()
+---------+---------+---------+
| author_A| author_B| movie_ID|
+---------+---------+---------+
|nm0000198|nm0000354|tt0091954|
|nm0000198|nm0000721|tt0091954|
|nm0000354|nm0000721|tt0091954|
+---------+---------+---------+
The < clause is to make sure we only get the tuple (author_A, author_B) once
This should work for you. Just another way to write self-join.
from pyspark.sql.functions import col
joining_condition = [col("a.movie_ID") == col("b.movie_ID") , col("a.author_ID") > col("b.author_ID") ]
df.alias("a")\
.join(df.alias("b"), joining_condition)\
.selectExpr("a.movie_ID AS movie_Id",
"a.author_ID AS author_A",
"B.author_ID AS author_B")\
.show()
#+---------+---------+---------+
#| movie_Id| author_A| author_B|
#+---------+---------+---------+
#|tt0091954|nm0000354|nm0000198|
#|tt0091954|nm0000721|nm0000198|
#|tt0091954|nm0000721|nm0000354|
#+---------+---------+---------+

Spark Scala Cumulative Unique Count by Date

I have a dataframe that gives a set of id numbers and the date at which they visited a certain location and I’m trying to find a way in spark scala to get the number of unique people (“id”) that have visited this location on or before each day so that one id number won’t be counted twice if they visit on 2019-01-01 and then again on 2019-01-07 for example.
df.show(5,false)
+---------------+
|id |date |
+---------------+
|3424|2019-01-02|
|8683|2019-01-01|
|7690|2019-01-02|
|3424|2019-01-07|
|9002|2019-01-02|
+---------------+
I want the output to look like this: where I groupBy(“date”) and get the count of unique id’s as a cumulative number. (So for example: next to 2019-01-03, it would give the distinct count of id’s on any day up to 2019-01-03)
+----------+-------+
|date |cum_ct |
+----------+-------+
|2019-01-01|xxxxx |
|2019-01-02|xxxxx |
|2019-01-03|xxxxx |
|... |... |
|2019-01-08|xxxxx |
|2019-01-09|xxxxx |
+------------------+
What would be the best way to do this after df.groupBy("date")
You will have to use the ROW_NUMBER() function in this scenario. I have created a dataframe
val df = Seq((1,"2019-05-03"),(1,"2018-05-03"),(2,"2019-05-03"),(2,"2018-05-03"),(3,"2019-05-03"),(3,"2018-05-03")).toDF("id","date")
df.show
+---+----------+
| id| date|
+---+----------+
| 1|2019-05-03|
| 1|2018-05-03|
| 2|2019-05-03|
| 2|2018-05-03|
| 3|2019-05-03|
| 3|2018-05-03|
+---+----------+
ID represents a person id in your case that can appear against multiple dates.
Here is the count against each date.
df.groupBy("date").count.show
+----------+-----+
| date|count|
+----------+-----+
|2018-05-03| 3|
|2019-05-03| 3|
+----------+-----+
This shows the repetitive count of id's against each date. I have used 3 id's in total and each date has a count of 3 that means all id's are counted explicitly in each date.
Now to my understanding you want an ID to be counted only once against any date (depends if you want latest date or oldest date).
I am going to use latest date for every ID.
val newdf = df.withColumn("row_num",row_number().over(Window.partitionBy($"id").orderBy($"date".desc)))
Above line will assign row numbers against every ID for each date against it's entry, and row number 1 will refer to the latest date of each ID, now you take count against each ID where row number is 1. That will result in single count of every ID (Distinct).
Here is the output , I have applied filter against row number and you can see in output that the dates are latest i.e in my case 2019.
newdf.select("id","date","row_num").where("row_num = 1").show()
+---+----------+-------+
| id| date|row_num|
+---+----------+-------+
| 1|2019-05-03| 1|
| 3|2019-05-03| 1|
| 2|2019-05-03| 1|
+---+----------+-------+
Now i will take count on NEWDF with same filter which will return date wise count.
newdf.groupBy("date","row_num").count().filter("row_num = 1").select("date","count").show
+----------+-----+
| date|count|
+----------+-----+
|2019-05-03| 3|
+----------+-----+
Here total count is 3 which excludes ID's on previous dates, previously it was 6 (because repetition of id in multiple date)
I hope it answers your questions.