How to create Scala trait which stores data from other columns in dataset and then create new dataset with column storing the trait in Scala? - scala

I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new column which contains a Scala trait Seq[order_summary]. The Scala trait stores data the corresponding Name, Ticket Number, and Seat Number taken from the input dataset.
I have implemented input_dataset.groupyBy("Name") to organise the dataset and have tried df.withColumn("NewColumn", struct(df("a"), df("b"))) to combine different columns together. However, I would like to use a Scala trait instead and am also stuck with matching the name to the ticket number. Would anyone know how to resolve this or point me towards the right direction?
Input dataset: input_dataset
Name Type is String. Ticket Number Type is Int
+----+---------------+-------------+
|Name| Ticket Number | Seat Number |
+----+---------------+-------------+
|Adam| 123 | AB |
|Adam| 456 | AC |
|Adam| 789 | AD |
|Bob | 1234 | BA |
|Bob | 5678 | BB |
|Sam | 987 | CA |
|Sam | 654 | CB |
|Sam | 321 | CC |
|Sam | 876 | CD |
+----+---------------+-------------+
Output dataset
Name Type is String. Purchase Order Summary is a trait, Seq[order_summary]
+----+-----------------------------------------------------+
|Name| Purchase Order Summary |
+----+-----------------------------------------------------+
|Adam|((Adam,123,AB),(Adam,456,AC),(Adam,789,AD)) |
|Bob |((Bob,1234,BA),(Bob,5678,BB)) |
|Sam |((Sam,987,CA),(Sam,654,CB),(Sam,321,CC),(Sam,876,CD))|
+----+-----------------------------------------------------+

Pretty sure Spark has a map method.
So you could just create a case class
case class PurchaseOrderSummary(name: String, ticketNum: Long, seatNum: Int)
and instantiate it inside a map from your DF, then collect it into a list.
df.map(row => PurchaseOrderSummary(row.getString(0), row.getLong(1), row.getInt(2))).collectAsList
collectAsList should retrieve data from the RDD and transform it to a scala List[PurchaseOrderSummary].

Related

How to decode HTML entities in Spark-scala?

I have a spark code to read some data from a database.
One of the columns (of type string) named "title" contains the following data.
+-------------------------------------------------+
|title |
+-------------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------------+
I'd like to remove the HTML entities and decode it to look as given below.
+-------------------------------------------+
|title |
+-------------------------------------------+
|Example sentence |
|Read the ‘Book’ |
|‘LOTR’ Is A Great Book |
+-------------------------------------------+
There is a library "html-enitites" for node.js that does exactly what I am looking for,
but i am unable to find something similar for spark-scala.
What would be good approach to do this?
You can use org.apache.commons.lang.StringEscapeUtils with a help of UDF to achieve this.
import org.apache.commons.lang.StringEscapeUtils;
val decodeHtml = (html:String) => {
StringEscapeUtils.unescapeHtml(html);
}
val decodeHtmlUDF = udf(decodeHtml)
df.withColumn("title", decodeHtmlUDF($"title")).show()
/*
+--------------------+
| title|
+--------------------+
| Example sentence |
| Read the ‘Book’ |
|‘LOTR’ Is A Great...|
+--------------------+
*/

Check if a set of a field values is mapped against single value of another field in dataframe

Consider the below dataframe with store and books available:
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S2 | B11 | 11$ |
| S1 | B15 | 29$ | <<
| S2 | B10 | 25$ |
| S2 | B16 | 30$ |
| S1 | B09 | 21$ | <
| S3 | B15 | 22$ |
+-----------+------+-------+
Suppose we need to find the stores which have two books namely, B11 and B15. Here, the answer is S1 as it stores both books.
One way of doing it is to find intersection of the stores having book B11 with the stores having book B15 using below command:
val df_select = df.filter($"book" === "B11").select("storename")
.join(df.filter($"book" === "B15").select("storename"), Seq("storename"), "inner")
which contains the name of stores having both.
But instead I want a table
+-----------+------+-------+
| storename | book | price |
+-----------+------+-------+
| S1 | B11 | 10$ | <<
| S1 | B15 | 29$ | <<
| S1 | B09 | 21$ | <
+-----------+------+-------+
which contains all records related to that fulfilling store. Note that B09 is not left out. (Use case : the user can explore some other books as well in the same store)
We can do this by doing another intersection of above result with original dataframe:
df_select.join(df, Seq("storename"), "inner")
But, I see scalability and readability issue with step 1 as I have to keep on joining one dataframe to another if the number of books are more than 2. Lots of pain to do and that's error-prone too. Is there a more elegant way to do the same? Something like:
val storewise = Window.partitionBy("storename")
df.filter($"book".contains{"B11", "B15"}.over(storewise))
Found a simple solution using array_except function.
Add required set-of-field-values as an array in a new column, req_books
Add a column, all_books, storing all the books stored in a store using Window.
Using above two columns find if the store misses any required book, and filter them out if it misses anything.
Drop the excess columns created.
Code:
val df1 = df.withColumn("req_books", array(lit("B11"), lit("B15")))
.withColumn("all_books", collect_set('book).over(Window.partitionBy('storename)))
df1.withColumn("missing_books", array_except('req_books, 'all_books))
.filter(size('missing_books) === 0)
.drop('missing_book).drop('all_books).drop('req_books).show
Using Window Functions to create array of all values and check if it contains all the necessary values.
val bookList = List("B11", "B15") //list of books to search
def arrayContainsMultiple(bookList: Seq[String]) = udf((allBooks: WrappedArray[String]) => allBooks.intersect(bookList).sorted.equals(bookList.sorted))
val filteredDF = input
.withColumn("allBooks", collect_set($"books").over(Window.partitionBy($"storename")))
.filter(arrayContainsMultiple(bookList)($"allBooks"))
.drop($"allBooks")

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

except operation on two dataframe having a map column

I have two dataframes dfA and dfB. I want to remove all occurrences of dfB from dfA. The problem however is that they have a column which is of datatype map. except operation doesn't work well with that.
+--------+----------------------------------
|id | fee_amount | optional |
|1 | 10.00 | { 1 -> abc, 2-> def |
|2 | 20.0 | { 3 -> pqr, 5-> stu |
I was thinking I could drop the column somehow and add it back but it won't work because I wouldn't know which rows got removed from dfA. Options?