How can I add a column to a DataFrame which groups rows in chunks of N? Like NTILE, but with a fixed bucket size - scala

Say I have a DataFrame like:
+------------+-----------+-----+
| feed|artist |count|
+------------+-----------+-----+
| y| Kanye West| 9|
| y| Radiohead| 6|
| y| Zero 7| 3|
| y| Puts Marie| 1|
| gwas| Drax| 7|
| gwas| Calibre| 4|
| gwas| Aphex Twin| 1|
| gwas| Jay Z| 1|
| x| DJ Krush| 2|
| x| Titeknots| 1|
+------------+-----------+-----+
I want to add a new column which chunks the rows into buckets of N rows for each partition (feed).
It seems like the inverse of NTILE to me. NTILE lets you choose the # of buckets but I want to choose the bucket-size instead.
Here's the desired result. Notice how each feed is chunked into groups of N = 2, including the x feed which has just one chunk of 2 rows. (Edit: each partition is ordered by count, so group 1 in each partition will be the rows with the highest value for count)
+------------+-----------+-----+-----+
| feed|artist |count|group|
+------------+-----------+-----+-----+
| y| Kanye West| 1| 9|
| y| Radiohead| 1| 6|
| y| Zero 7| 1| 3|
| y| Puts Marie| 1| 1|
| gwas| Drax| 7| 7|
| gwas| Calibre| 1| 4|
| gwas| Aphex Twin| 1| 1|
| gwas| Jay Z| 8| 1|
| x| DJ Krush| 2| 2|
| x| Titeknots| 1| 1|
+------------+-----------+-----+-----+
As a bonus, I would like each bucket to be a different size. E.g. List(2, 2, 4, 10, 10, -1) would mean that the first bucket has 2 rows, the second has 2 rows, the third has 4 rows, etc., and the final bucket (-1) contains the remainder.
EDIT
(Another useful variation)
While implementing the answers, I realized that there's another variation which I would prefer:
Add a column to a DataFrame which chunks its rows into groups of N, without knowing the size of the DataFrame.
Example:
If N = 100 and the DataFrame has 800 rows, it chunk it into 8 buckets of 100. If the DataFrame has 950 rows, it will chunk it into 9 buckets of 100, and 1 bucket of 50. It should not require a scan/call to .count().
The example DataFrames are analogous to the ones above.
(meta: should I make a new question for this variation? I feel like "NTILE with a fixed bucket size" is a more elegant problem and probably more common than my original use-case)

If I understand you correctly, this can be handled by using an SQL expression:
import org.apache.spark.sql.functions.{expr,row_number,desc}
import org.apache.spark.sql.expressions.Window
// set up WindowSpec
val w1 = Window.partitionBy("feed").orderBy(desc("count"))
val L = List(2, 2, 4, 10, 10, -1)
// dynamically create SQL expression from the List `L` to map row_number into group-id
var sql_expr = "CASE"
var running_total = 0
for(i <- 1 to L.size) {
running_total += L(i-1)
sql_expr += (if(L(i-1) > 0) s" WHEN rn <= $running_total THEN $i " else s" ELSE $i END")
}
println(sql_expr)
//CASE WHEN rn <= 2 THEN 1 WHEN rn <= 4 THEN 2 WHEN rn <= 8 THEN 3 WHEN rn <= 18 THEN 4 WHEN rn <= 28 THEN 5 ELSE 6 END
val df_new = df.withColumn("rn", row_number().over(w1)).withColumn("group", expr(sql_expr)).drop("rn")
df_new.show
+----+----------+-----+-----+
|feed| artist|count|group|
+----+----------+-----+-----+
|gwas| Drax| 7| 1|
|gwas| Calibre| 4| 1|
|gwas|Aphex Twin| 1| 2|
|gwas| Jay Z| 1| 2|
| x| DJ Krush| 2| 1|
| x| Titeknots| 1| 1|
| y|Kanye West| 9| 1|
| y| Radiohead| 6| 1|
| y| Zero 7| 3| 2|
| y|Puts Marie| 1| 2|
+----+----------+-----+-----+
For a fixed N, just cast (row_number-1)/N + 1 to int:
val N = 2
val df_new = df.withColumn("group", ((row_number().over(w1)-1)/N+1).cast("int"))

This could work :
val bucketDef = List(2, 2, 4, 10, 10)
val bucketRunsum = bucketDef.scanLeft(1)( _ + _) // calc running sum
// maps a row-number to a bucket
val indexBucketMapping = bucketRunsum.zip(bucketRunsum.tail)
.zipWithIndex
.map{case ((start,end),index) => ((start,end),index+1)} // make index start at 1
// gives List(((1,3),1), ((3,5),2), ((5,9),3), ((9,19),4), ((19,29),5))
// udf to assign a bucket to a given row-number
val calcBucket = udf((rnb:Long) => indexBucketMapping
.find{case ((start,end),_) => start<=rnb && rnb < end}
.map(_._2) // get index
.getOrElse(indexBucketMapping.last._2+1) // is in last bucket
)
df
.withColumn("group",calcBucket(row_number().over(Window.partitionBy($"feed").orderBy($"count"))))
alternatively (without UDF), construct a DataFrame which maps a row-number to a bucket and then join
val bucketSizeDef =List(2, 2, 4, 10, 10)
val bucketDef = (1 +: bucketSizeDef).zipWithIndex.map{case (bs,index) => (bs,index+1)}
.toDF("bucketSize","group")
.withColumn("i",sum($"bucketSize").over(Window.orderBy($"group")))
.withColumn("i_to",coalesce(lead($"i",1).over(Window.orderBy($"group")),lit(Long.MaxValue)))
.drop($"bucketSize")
bucketDef.show()
gives:
+-----+---+-------------------+
|group| i| i_to|
+-----+---+-------------------+
| 1| 1| 3|
| 2| 3| 5|
| 3| 5| 9|
| 4| 9| 19|
| 5| 19| 29|
| 6| 29|9223372036854775807|
+-----+---+-------------------+
then join to df:
df
.withColumn("rnb",row_number().over(Window.partitionBy($"feed").orderBy($"count")))
.join(broadcast(bucketDef),$"rnb">= $"i" and $"rnb"< $"i_to")
.drop("rnb","i","i_to")

Related

How to compare value of one row with all the other rows in PySpark on grouped values

Problem statement
Consider the following data (see code generation at the bottom)
+-----+-----+-------+--------+
|index|group|low_num|high_num|
+-----+-----+-------+--------+
| 0| 1| 1| 1|
| 1| 1| 2| 2|
| 2| 1| 3| 3|
| 3| 2| 1| 3|
+-----+-----+-------+--------+
Then for a given index, I want to count how many times that one indexes high_num is greater than low_num for all low_num in the group.
For instance, consider the second row with index: 1. Index: 1 is in group: 1 and the high_num is 2. high_num on index 1 is greater than the high_num on index 0, equal to low_num, and smaller than the one on index 2. So the high_num of index: 1 is greater than low_num across the group once, so then I want the value in the answer column to say 1.
Dataset with desired output
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+
Dataset generation code
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.getOrCreate()
)
## Example df
## Note the inclusion of "desired" which is the desired output.
df = spark.createDataFrame(
[
(0, 1, 1, 1, 0),
(1, 1, 2, 2, 1),
(2, 1, 3, 3, 2),
(3, 2, 1, 3, 1)
],
schema=["index", "group", "low_num", "high_num", "desired"]
)
Pseudocode that might have solved the problem
A pseusocode might look like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w_spec = Window.partitionBy("group").rowsBetween(
Window.unboundedPreceding, Window.unboundedFollowing)
## F.collect_list_when does not exist
## F.current_col does not exist
## Probably wouldn't work like this anyways
ddf = df.withColumn("Counts",
F.size(F.collect_list_when(
F.current_col("high_number") > F.col("low_number"), 1
).otherwise(None).over(w_spec))
)
You can do a filter on the collect_list, and check its size:
import pyspark.sql.functions as F
df2 = df.withColumn(
'desired',
F.expr('size(filter(collect_list(low_num) over (partition by group), x -> x < high_num))')
)
df2.show()
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

A sum of typedLit columns evaluates to NULL

I am trying to create a sum column by taking the sum of the row values of a set of columns in a dataframe. So I followed the following method to do it.
val temp_data = spark.createDataFrame(Seq(
(1, 5),
(2, 4),
(3, 7),
(4, 6)
)).toDF("A", "B")
val cols = List(col("A"), col("B"))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
So this methods works fine and produce the expected output. However, I want to create the cols variable without specifying the column names explicitly. Therefore I've used typedLit as follows.
val cols2 = temp_data.columns.map(x=>typedLit(x)).toList
when I look at cols and cols2 they look identical.
cols: List[org.apache.spark.sql.Column] = List(A, B)
cols2: List[org.apache.spark.sql.Column] = List(A, B)
However, when I use cols2 to create my sum column, it doesn't work the way I expect it to work.
temp_data.withColumn("sum", cols2.reduce(_ + _)).show
+---+---+----+
| A| B| sum|
+---+---+----+
| 1| 5|null|
| 2| 4|null|
| 3| 7|null|
| 4| 6|null|
+---+---+----+
Does anyone have any idea what I'm doing wrong here? Why doesn't the second method work like the first method?
lit or typedLit is not a replacement for Column. What your code does it creates a list of string literals - "A" and "B"
temp_data.select(cols2: _*).show
+---+---+
| A| B|
+---+---+
| A| B|
| A| B|
| A| B|
| A| B|
+---+---+
and asks for their sums - hence the result is undefined.
You might use TypedColumn here:
import org.apache.spark.sql.TypedColumn
val typedSum: TypedColumn[Any, Int] = cols.map(_.as[Int]).reduce{
(x, y) => (x + y).as[Int]
}
temp_data.withColumn("sum", typedSum).show
but it doesn't provide any practical advantage over standard Column here.
You are trying with typedLit which is not right and like other answer mentioned you don't have to use a function with TypedColumn. You can simply use map transformation on columns of dataframe to convert it to List(Col)
Change your cols2 statement to below and try.
val cols = temp_data.columns.map(f=> col(f))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
You will get below output.
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
Thanks

Efficient way of using for loops in scala

I am trying to divide a data frame into n groups based on certain values of its columns. And ended up with the below code.
But it doesnt look efficient interms of nested for loops, I am looking for some elegant approach in implementing the following code. Can some one please provide inputs?
Input will be column Names based on which the data frame should be divided.
So I have a val storing in the distinct values of columns.
It will store like :
(0)(0) = F
(0)(1) = M
(1)(0) = drugY
(1)(1) = drugC
(1)(2) = drugX
So I have a total 5 created with column values as follows:
F and drugY
M and drugY
F and drugC
M and drugC
F and drugX
M and drugX
I dont really understand what you want to do, but if you want to generate the combinations using the Spark dataframe api, you can do it like this
val patients = Seq(
(1, "f"),
(2, "m")
).toDF("id", "name")
val drugs = Seq(
(1, "drugY"),
(2, "drugC"),
(3, "drugX")
).toDF("id", "name")
patients.createOrReplaceTempView("patients")
drugs.createOrReplaceTempView("drugs")
sqlContext.sql("select p.id as patient_id, p.name as patient_name, d.id as drug_id, d.name as drug_name from patients p cross join drugs d").show
+----------+------------+-------+---------+
|patient_id|patient_name|drug_id|drug_name|
+----------+------------+-------+---------+
| 1| f| 1| drugY|
| 1| f| 2| drugC|
| 1| f| 3| drugX|
| 2| m| 1| drugY|
| 2| m| 2| drugC|
| 2| m| 3| drugX|
+----------+------------+-------+---------+
or with the dataframe api
val cartesian = patients.join(drugs)
cartesian.show
(2) Spark Jobs
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
| 1| f| 1|drugY|
| 1| f| 2|drugC|
| 1| f| 3|drugX|
| 2| m| 1|drugY|
| 2| m| 2|drugC|
| 2| m| 3|drugX|
+---+----+---+-----+
After that you can use a crosstab to get the a table of the frequency distribution
c.stat.crosstab("patient_name","drug_name").show
+----------------------+-----+-----+-----+
|patient_name_drug_name|drugC|drugX|drugY|
+----------------------+-----+-----+-----+
| m| 1| 1| 1|
| f| 1| 1| 1|
+----------------------+-----+-----+-----+

Counting both occurrences and cooccurrences in a DF

I would like to compute the mutual information (MI) between two variables x and y that I have in a Spark dataframe which looks like this:
scala> df.show()
+---+---+
| x| y|
+---+---+
| 0| DO|
| 1| FR|
| 0| MK|
| 0| FR|
| 0| RU|
| 0| TN|
| 0| TN|
| 0| KW|
| 1| RU|
| 0| JP|
| 0| US|
| 0| CL|
| 0| ES|
| 0| KR|
| 0| US|
| 0| IT|
| 0| SE|
| 0| MX|
| 0| CN|
| 1| EE|
+---+---+
In my case, x happens to be whether an event is occurring (x = 1) or not (x = 0), and y is a country code, but these variables could represent anything. To compute the MI between x and y I would like to have the above dataframe grouped by x, y pairs with the following three additional columns:
The number of occurrences of x
The number of occurrences of y
The number of occurrences of x, y
In the short example above, it would look like
x, y, count_x, count_y, count_xy
0, FR, 17, 2, 1
1, FR, 3, 2, 1
...
Then I would just have to compute the mutual information term for each x, y pair and sum them.
So far, I have been able to group by x, y pairs and aggregate a count(*) column but I couldn't find an efficient way to add the x and y counts. My current solution is to convert the DF into an array and count the occurrences and cooccurrences manually. It works well when y is a country but it takes forever when the cardinality of y gets big. Any suggestions as to how I could do it in a more Sparkish way?
Thanks in advance!
I would go with RDDs, generate a key for each use case, count by key and join the results. This way I know exactly what are the stages.
rdd.cache() // rdd is your data [x,y]
val xCnt:RDD[Int, Int] = rdd.countByKey
val yCnt:RDD[String, Int] = rdd.countByValue
val xyCnt:RDD[(Int,String), Int] = rdd.map((x, y) => ((x,y), x,y)).countByKey
val tmp = xCnt.cartsian(yCnt).map(((x, xCnt),(y, yCnt)) => ((x,y),xCnt,yCnt))
val miReady = tmp.join(xyCnt).map(((x,y), ((xCnt, yCnt), xyCnt)) => ((x,y), xCnt, yCnt, xyCnt))
another option would be to use map Partition and simply work on iterables and merge the resolutes across partitions.
Also new to Spark but I have an idea what to do. I do not know if this is the perfect solution but I thought sharing this wouldnt harm.
What I would do is probably filter() for the value 1 to create a Dataframe and filter() for the value 0 for a second Dataframe
You would get something like
1st Dataframe
DO 1
DO 1
FR 1
In the next step i would groupBy(y)
So you would get for the 1st Dataframe
DO 1 1
FR 1
As GroupedData https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/GroupedData.html
This also has a count() function which should be counting the rows per group. Unfortunately I do not have the time to try this out by myself right now but I wanted to try and help anyway.
Edit: Please let me know if this helped, otherwise I'll delete the answer so other people still take a look at this!
Recently, I had the same task to compute probabilities and here I would like to share my solution based on Spark's window aggregation functions:
// data is your DataFrame with two columns [x,y]
val cooccurrDF: DataFrame = data
.groupBy(col("x"), col("y"))
.count()
.toDF("x", "y", "count-x-y")
val windowX: WindowSpec = Window.partitionBy("x")
val windowY: WindowSpec = Window.partitionBy("y")
val countsDF: DataFrame = cooccurrDF
.withColumn("count-x", sum("count-x-y") over windowX)
.withColumn("count-y", sum("count-x-y") over windowY)
countsDF.show()
First you groups every possible combination of two columns and use count to get the cooccurrences number. The windowed aggregates windowX and windowY allow summing over aggregated rows, so you will get counts for either column x or y.
+---+---+---------+-------+-------+
| x| y|count-x-y|count-x|count-y|
+---+---+---------+-------+-------+
| 0| MK| 1| 17| 1|
| 0| MX| 1| 17| 1|
| 1| EE| 1| 3| 1|
| 0| CN| 1| 17| 1|
| 1| RU| 1| 3| 2|
| 0| RU| 1| 17| 2|
| 0| CL| 1| 17| 1|
| 0| ES| 1| 17| 1|
| 0| KR| 1| 17| 1|
| 0| US| 2| 17| 2|
| 1| FR| 1| 3| 2|
| 0| FR| 1| 17| 2|
| 0| TN| 2| 17| 2|
| 0| IT| 1| 17| 1|
| 0| SE| 1| 17| 1|
| 0| DO| 1| 17| 1|
| 0| JP| 1| 17| 1|
| 0| KW| 1| 17| 1|
+---+---+---------+-------+-------+