Is there a better way to go about this process of trimming my spark DataFrame appropriately? - scala

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF

I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

Related

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

Generating all possible combinations from a Data Frame in Apache Spark

I'm trying to do something quite simple where I have 2 arrays that have been converted into a Data Frame, and I want to show all possible combinations. So for example my output at the moment looks something like this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| Second | P |
+-----------|-----------+
However what I'm actually looking for is this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| First | P |
| Second | T |
| Second | P |
+-----------|-----------+
So far I've got some fairly straight forward code to map my arrays into columns but being quite new to using both Scala and Spark I'm not sure how I'd grab all those combinations. Here is what I have so far:
val firstColumnValues = Array("First", "Second")
val secondColumnValues = Array("T", "P")
val xs = Array(firstColumnValues, secondColumnValues).transpose
val mapped = sparkContext.parallelize(xs).map(ys => Row(ys(0), ys(1)))
val df = mapped.toDF("A", "B")
df.show
...
case class Row(first: String, second: String)
Thanks in advance for any help
In Spark 2.3
val firstColumnValues = sc.parallelize(Array("First", "Second")).toDF("A")
val secondColumnValues = sc.parallelize(Array("T", "P")).toDF("B")
val fullouter = firstColumnValues.crossJoin(secondColumnValues).show

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+

Apache Spark DataFrame apply custom operation after GroupBy

I have 2 columns say ID, value Id is of type Int and value is of type List[String].
Ids are repeating so to make them unique I apply GroupBy("id") on My DataFrame now my problem is I want to append the value with each other and value column must be distinct.
Example :- i have a data like
+---+---+
| id| v |
+---+---+
| 1|[a]|
| 1|[b]|
| 1|[a]|
| 2|[e]|
| 2|[b]|
+---+---+
and i want my output like this
+---+---+--
| id| v |
+---+-----+
| 1|[a,b]|
| 2|[e,b]|
i tried this :-
val uniqueDF = df.groupBy("id").agg(collect_list("v"))
uniqueDf.map{row => (row.getInt(0),
row.getAsSeq[String].toList.distinct)}
Can I do the same after groupBy() say in agg() or something I do not want to apply map operation
thanks
val uniqueDF = df.groupBy("id").agg(collect_set("v"))
Set will have only unique values