Use Iterator to get top k keywords - scala

I am writing a Spark algorithm to get top k keywords for each country, now I already have a Dataframe containing all records and plan to do
df.repartition($"country_id").mapPartition()
to retrieve top k keywords but am confused on how I could write an iterator to get it.
If I am able to write a method or call native method, I can sort in each partition and get top k which seems not to be the correct approach if the input is an iterator.
Anyone has idea on it?

you can achieve this using window functions, let's assume that column _1 is your keyword and _2 is keyword's count. In this case k = 2
scala> df.show()
+---+---+
| _1| _2|
+---+---+
| 1| 3|
| 2| 2|
| 1| 4|
| 1| 1|
| 2| 0|
| 1| 10|
| 2| 5|
+---+---+
scala> df.select('*,row_number().over(Window.orderBy('_2.desc).partitionBy('_1)).as("rn")).where('rn < 3).show()
+---+---+---+
| _1| _2| rn|
+---+---+---+
| 1| 10| 1|
| 1| 4| 2|
| 2| 5| 1|
| 2| 2| 2|
+---+---+---+

Related

How to do a groupBy by a given column but still keep all the rows of the original DataFrame?

I want to do a groupBy and aggregate by a given column in PySpark but I still want to keep all the rows from the original DataFrame.
For example lets say we have the following DataFrame and we want to do a max on the "value" column then we would get the result below.
Original DataFrame
+--+-----+
|id|value|
+--+-----+
| 1| 1|
| 1| 2|
| 2| 3|
| 2| 4|
+--+-----+
Result
+--+-----+---+
|id|value|max|
+--+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+--+-----+---+
You can do it simply by joining aggregated dataframe with original dataframe
aggregated_df = (
df
.groupby('id')
.agg(F.max('value').alias('max'))
)
max_value_df = (
df
.join(aggregated_df, 'id')
)
Use window function
df.withColumn('max', max('value').over(Window.partitionBy('id'))).show()
+---+-----+---+
| id|value|max|
+---+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+---+-----+---+

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

How to make VectorAssembler do not compress data?

I want to transform multiple columns to one column using VectorAssembler,but the data is compressed by default without other options.
val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6))
val df=sc.parallelize(arr2).toDF("a","b","c","e","f")
val colNames=Array("a","b","c","e","f")
val assembler = new VectorAssembler()
.setInputCols(colNames)
.setOutputCol("newCol")
val transDF= assembler.transform(df).select(col("newCol"))
transDF.show(false)
The input is:
+---+---+---+---+---+
| a| b| c| e| f|
+---+---+---+---+---+
| 1| 2| 0| 0| 0|
| 1| 2| 3| 0| 0|
| 1| 2| 4| 5| 0|
| 1| 2| 2| 5| 6|
+---+---+---+---+---+
The result is:
+---------------------+
|newCol |
+---------------------+
|(5,[0,1],[1.0,2.0]) |
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+
My expect result is:
+---------------------+
|newCol |
+---------------------+
|[1.0,2.0,0.0,0.0,0.0]|
|[1.0,2.0,3.0,0.0,0.0]|
|[1.0,2.0,4.0,5.0,0.0]|
|[1.0,2.0,2.0,5.0,6.0]|
+---------------------+
What should I do to get my expect result?
If you really want to coerce all vectors to their dense representation, you can do it using a User Defined Function :
val toDense = udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
transDF.select(toDense($"newCol")).show
+--------------------+
| UDF(newCol)|
+--------------------+
|[1.0,2.0,0.0,0.0,...|
|[1.0,2.0,3.0,0.0,...|
|[1.0,2.0,4.0,5.0,...|
|[1.0,2.0,2.0,5.0,...|
+--------------------+

Pivot scala dataframe with conditional counting

I would like to aggregate this DataFrame and count the number of observations with a value less than or equal to the "BUCKET" field for each level. For example:
val myDF = Seq(
("foo", 0),
("foo", 0),
("bar", 0),
("foo", 1),
("foo", 1),
("bar", 1),
("foo", 2),
("bar", 2),
("foo", 3),
("bar", 3)).toDF("COL1", "BUCKET")
myDF.show
+----+------+
|COL1|BUCKET|
+----+------+
| foo| 0|
| foo| 0|
| bar| 0|
| foo| 1|
| foo| 1|
| bar| 1|
| foo| 2|
| bar| 2|
| foo| 3|
| bar| 3|
+----+------+
I can count the number of observations matching each bucket value using this code:
myDF.groupBy("COL1").pivot("BUCKET").count.show
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 1| 1| 1|
| foo| 2| 2| 1| 1|
+----+---+---+---+---+
But I want to count the number of rows with a value in the "BUCKET" field which is less than or equal to the final header after pivoting, like this:
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 2| 3| 4|
| foo| 2| 4| 5| 6|
+----+---+---+---+---+
You can achieve this using a window function, as follows:
import org.apache.spark.sql.expressions.Window.partitionBy
import org.apache.spark.sql.functions.first
myDF.
select(
$"COL1",
$"BUCKET",
count($"BUCKET").over(partitionBy($"COL1").orderBy($"BUCKET")).as("ROLLING_COUNT")).
groupBy($"COL1").pivot("BUCKET").agg(first("ROLLING_COUNT")).
show()
+----+---+---+---+---+
|COL1| 0| 1| 2| 3|
+----+---+---+---+---+
| bar| 1| 2| 3| 4|
| foo| 2| 4| 5| 6|
+----+---+---+---+---+
What you are specifying here is that you want to perform a count of your observations, partitioned in windows as determined by a key (COL1 in this case). By specifying an ordering, you are also making the count rolling over the window, thus obtaining the results you want then to be pivoted in your end results.
This is the result of applying the window function:
myDF.
select(
$"COL1",
$"BUCKET",
count($"BUCKET").over(partitionBy($"COL1").orderBy($"BUCKET")).as("ROLLING_COUNT")).
show()
+----+------+-------------+
|COL1|BUCKET|ROLLING_COUNT|
+----+------+-------------+
| bar| 0| 1|
| bar| 1| 2|
| bar| 2| 3|
| bar| 3| 4|
| foo| 0| 2|
| foo| 0| 2|
| foo| 1| 4|
| foo| 1| 4|
| foo| 2| 5|
| foo| 3| 6|
+----+------+-------------+
Finally, by grouping by COL1, pivoting over BUCKET and only getting the first result of the rolling count (anyone would be good as all of them are applied to the whole window), you finally obtain the result you were looking for.
In a way, window functions are very similar to aggregations over groupings, but are more flexible and powerful. This just scratches the surface of window functions and you can dig a little bit deeper by having a look at this introductory reading.
Here's one approach to get the rolling counts by traversing the pivoted BUCKET value columns using foldLeft to aggregate the counts. Note that a tuple of (DataFrame, Int) is used for foldLeft to transform the DataFrame as well as store the count in the previous iteration:
val pivotedDF = myDF.groupBy($"COL1").pivot("BUCKET").count
val buckets = pivotedDF.columns.filter(_ != "COL1")
buckets.drop(1).foldLeft((pivotedDF, buckets.head))( (acc, c) =>
( acc._1.withColumn(c, col(acc._2) + col(c)), c )
)._1.show
// +----+---+---+---+---+
// |COL1| 0| 1| 2| 3|
// +----+---+---+---+---+
// | bar| 1| 2| 3| 4|
// | foo| 2| 4| 5| 6|
// +----+---+---+---+---+

How to select all columns in spark sql query in aggregation function

Hi I am new to spark sql.
I have a query like this.
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This prints only 3 columns.
tagShortID,Timestamp,maxAvgValue
But I want to display all the column along with this column.Any help or suggestion would be appreciated.
One alternative, usually good for your specific case is to use Window Functions, because it avoids the need to join with the original data:
import org.apache.spark.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("tagShortID", "Timestamp")
val result = averageDF.withColumn("maxAvgValue", max($"RSSI_Weight_avg").over(windowSpec))
You can find here a good article explaining the Window Functions functionality in Spark.
Please note that it requires either Spark 2+ or a HiveContext in Spark versions 1.4 ~ 1.6.
Here is the simple example with the column name you have
This is your averageDF dataframe with dummy data
+----------+---------+---------------+---------+--------+---------------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|
+----------+---------+---------------+---------+--------+---------------+
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+
After you have a groupby and aggravation
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This did not return all the columns you selected because after groupby and aggregation the only the used and result column are returned, As below
+----------+---------+-----------+
|tagShortID|Timestamp|maxAvgValue|
+----------+---------+-----------+
| 2| 2| 2|
| 1| 1| 1|
+----------+---------+-----------+
To get all the columns you need to join this two dataframes
averageDF.join(highvalueresult, Seq("tagShortID", "Timestamp"))
and the final result will be
+----------+---------+---------------+---------+--------+---------------+-----------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|maxAvgValue|
+----------+---------+---------------+---------+--------+---------------+-----------+
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+-----------+
I hope this clears your confusion.