"Enrich" Spark DataFrame from another DF (or from HBase) - scala

I am not sure this is the right title so feel free to suggest an edit. Btw, I'm really new to Scala and Spark.
Basically, I have a DF df_1 looking like this:
| ID | name | city_id |
| 0 | "abc"| 123 |
| 1 | "cba"| 124 |
...
The city_id is a key in a huge HBase:
123; New York; .... 124; Los Angeles; .... etc.
The result should be df_1:
| ID | name | city_id |
| 0 | "abc"| New York|
| 1 | "cba"| Los Angeles|
...
My approach was to create an external Hive table on top of HBase with the columns I need. But then again I do not know how to join them in the most efficient manner.
I suppose there is a way to do it dirrectly from HBase, but again I do not know how.
Any hint is appreciated. :)

There is no need to create an itermediate hive table over hbase. Spark sql can deal with all kind of unstructured data directly. Just load hbase data into a dataframe with the hbase data source
Once you have the proper hbase dataframe use the following
sample spark-scala code to get the joined dataframe:
val df=Seq((0,"abc",123),(1,"cda",124),(2,"dsd",125),(3,"gft",126),(4,"dty",127)).toDF("ID","name","city_id")
val hbaseDF=Seq((123,"New York"),(124,"Los Angeles"),(125,"Chicago"),(126,"Seattle")).toDF("city_id","city_name")
df.join(hbaseDF,Seq("city_id"),"inner").drop("city_id").show()

Related

Need some data after by grouping on key in spark/scala

I have a problem in spark(v2.2.2)/scala(v2.11.8). Mostly into scala/spark functional language.
I have a list of person with rented_date like below.
These are csv file which I will convert into parquet and read as a dataframe.
Table: Person
+-------------------+-----------+
| ID |report_date|
+-------------------+-----------+
| 123| 2011-09-25|
| 111| 2017-08-23|
| 222| 2018-09-30|
| 333| 2020-09-30|
| 444| 2019-09-30|
+-------------------+-----------+
I want to find out the start_date of the address for the period person's rented it out by grouping on ID
Table: Address
+-------------------+----------+----------+
| ID |start_date|close_date|
+-------------------+----------+----------+
| 123|2008-09-23|2009-09-23|
| 123|2009-09-24|2010-09-23|
| 123|2010-09-24|2011-09-23|
| 123|2011-09-30|2012-09-23|
| 123|2012-09-24| null|
| 111|2013-09-23|2014-09-23|
| 111|2014-09-24|2015-09-23|
| 111|2015-09-24|2016-09-23|
| 111|2016-09-24|2017-09-23|
| 111|2017-09-24| null|
| 222|2018-09-24| null|
+-------------------+----------+----------+
ex: For 123 rented_date is 2011-09-20, which in address table falls in the period (start_date, close_date) 2010-09-24,2011-09-23 (row 3 in address). Form here I have to fetch start_date 2010-09-24.
I have to do this on entire dataset by joining the tables. Or need to fetch start_date from address table into the Person table.
Also need to handle where closed date is null.
Sometime scenario may also include where rented date will not fall in any of the period in that case we need to take it where rented_date < closed_date.
Apologies, proper format of tables are not populating.
Thanks in Advance.
First of all
I have a list of person with rented_date like below. These are csv file which I will convert into parquet and read as a dataframe.
No need to convert it you can just read it directly with spark
spark.read.csv("path")
spark.read.format("csv").load("path")
I am not sure what your expectation in null fields are so I would filter them out for now:
dfAdressNotNull.filter($"close_date".isNotNull)
Of course now you need to join them together and since the data in Address is the relevant one I would do a left join.
val joinedDf = dfAddressNotNull.join(dfPerson, Seq("ID"), "left")
No you have Addresses and Persons combined
If you filter now like that
joinedDf.filter($"report_date" >= $"start_date" && $"report_date" < $"closed_date")
You should have something like that what you want to achieve.

Concatenate Dataframe rows based on timestamp value

I have a Dataframe with text messages and a timestamp value for each row.
Like so:
+--------------------------+---------------------+
| message | timestamp |
+--------------------------+---------------------+
| some text from message 1 | 2019-08-03 01:00:00 |
+--------------------------+---------------------+
| some text from message 2 | 2019-08-03 01:01:00 |
+--------------------------+---------------------+
| some text from message 3 | 2019-08-03 01:03:00 |
+--------------------------+---------------------+
I need to concatenate the messages by creating time windows of X number of minutes so that for example they look like this:
+---------------------------------------------------+
| message |
+---------------------------------------------------+
| some text from message 1 some text from message 2 |
+---------------------------------------------------+
| some text from message 3 |
+---------------------------------------------------+
After doing the concatenation I have no use for the timestamp column so I can drop it or keep it with any value.
I have been able to do this by iterating through the entire Dataframe, adding timestamp diffs and inserting into a new Dataframe when the time window is achieved. It works but it's ugly and I am looking for some pointers into how to accomplish this in Scala in a more functional/elegant way.
I looked at the Window functions but since I am not doing aggregations it appears that I do not have a way to access the content of the groups once the WindowSpec is created so I didn't get very far.
I also looked at the lead and lag functions but I couldn't figure out how to use them without also having to go into a for loop.
I appreciate any ideas or pointers you can provide.
Any thoughts or pointers into how to accomplish this?
You can use the window datetime function (not to be confused with Window functions) to generate time windows, followed by a groupBy to aggregate messages using concat_ws:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("message1", "2019-08-03 01:00:00"),
("message2", "2019-08-03 01:01:00"),
("message3", "2019-08-03 01:03:00")
).toDF("message", "timestamp")
val duration = "2 minutes"
df.
groupBy(window($"timestamp", duration)).
agg(concat_ws(" ", collect_list($"message")).as("message")).
show(false)
// +------------------------------------------+-----------------+
// |window |message |
// +------------------------------------------+-----------------+
// |[2019-08-03 01:00:00, 2019-08-03 01:02:00]|message1 message2|
// |[2019-08-03 01:02:00, 2019-08-03 01:04:00]|message3 |
// +------------------------------------------+-----------------+

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

In spark and scala, how to convert or map a dataframe to specific columns info?

Scala.
Spark.
intellij IDEA.
I have a dataframe (multiple rows, multiple columns) from CSV file.
And I want it maps to another specific column info.
I think scala class (not case class, because columns count > 22) or map().....
But I don't know how to convert them.
Example
a dataframe from CSV file.
----------------------
| No | price| name |
----------------------
| 1 | 100 | "A" |
----------------------
| 2 | 200 | "B" |
----------------------
another specific columns info.
=> {product_id, product_name, seller}
First, product_id is mapping to 'No'.
Second, product_name is mapping to 'name'.
Third, seller is null or ""(empty string).
So, finally, I want a dataframe that have another columns info.
-----------------------------------------
| product_id | product_name | seller |
-----------------------------------------
| 1 | "A" | |
-----------------------------------------
| 2 | "B" | |
-----------------------------------------
If you already have a dataframe (eg. old_df) :
val new_df=old_df.withColumnRenamed("No","product_id").
withColumnRenamed("name","product_name").
drop("price").
withColumn("seller", ... )
Let's say your CSV file is "products.csv",
First you have to load it in spark, you can do that using
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
Once the data is loaded you will have all the column names in the dataframe df. As you mentioned your column name will be "No","Price","Name".
To change the name of the column you just have to use withColumnRenamed api of dataframe.
val renamedDf = df.withColumnRenamed("No","product_id").
withColumnRenames("name","product_name")
Your renamedDf will have the name of the column as you have assigned.