I have a dataframe where I want to get the first row where the indicator column is 0. For example, my dataframe will look like this:
network volume indicator Hour
YYY 20 1 10
YYY 30 0 9
YYY 40 0 8
YYY 80 1 7
TTT 50 0 10
TTT 40 1 8
TTT 10 0 4
TTT 10 1 2
The result should look like this:
network volume indicator Hour
YYY 20 1 10
YYY 30 0 9
YYY 80 1 7
TTT 50 0 10
TTT 40 1 8
TTT 10 1 2
So the ones with ones would still stay, while I get the first time the indicator was 0 for each network. I want to have everything sorted by hour in descending order when I do this, so I get the most recent 0 indicator. How do I go about achieving this result?
Here is your required code, with comments inline to help you understand too: (updated the output with your latest dataset, with multiple 1's in the indicator column)
sourceData.show()
+-------+------+---------+----+
|network|volume|indicator|Hour|
+-------+------+---------+----+
| YYY| 20| 1| 10|
| YYY| 30| 0| 9|
| YYY| 40| 0| 8|
| YYY| 80| 1| 7|
| TTT| 50| 0| 10|
| TTT| 40| 1| 8|
| TTT| 10| 0| 4|
| TTT| 10| 1| 2|
+-------+------+---------+----+
sourceData.printSchema()
root
|-- network: string (nullable = true)
|-- volume: integer (nullable = true)
|-- indicator: integer (nullable = true)
|-- Hour: integer (nullable = true)
Required Transformation Code:
//splitting your data set into two parts with indicator 1 and 0
val indicator1Df = sourceData.filter("indicator == 1")
val indicator0Df = sourceData.filter("indicator == 0")
//getting the first row for all indicator=0
indicator0Df.createOrReplaceTempView("indicator0")
val firstIndicator0df = spark.sql("select network, volume, indicator, hour from (select i0.network,i0.volume,i0.indicator,i0.hour,ROW_NUMBER() over (partition by i0.network order by i0.Hour desc) as rnk from indicator0 i0) i where rnk = 1")
//merging both the dataframes back to for your required output result
val finalDf = indicator1Df.union(firstIndicator0df).orderBy($"network".desc,$"Hour".desc)
finalDf.show()
Final Output:
+-------+------+---------+----+
|network|volume|indicator|Hour|
+-------+------+---------+----+
| YYY| 20| 1| 10|
| YYY| 30| 0| 9|
| YYY| 80| 1| 7|
| TTT| 50| 0| 10|
| TTT| 40| 1| 8|
| TTT| 10| 1| 2|
+-------+------+---------+----+
Related
I have a pyspark dataframe df :-
status
Flag
present
1
present
0
na
1
Void
0
present
1
notpresent
0
present
0
present
0
ok
1
I want to update the Flag as 1 wherever we have status is present or ok :-
Expected :-
status
Flag
present
1
present
1
na
1
Void
0
present
1
notpresent
0
present
1
present
1
ok
1
You can do so using withColumn and a check using when. You recreate the Flag column setting it to 1 if status is ok or present, otherwise you keep the existing value.
from pyspark.sql.functions import when, col, lit
data = [
('present', 0),
('ok', 0),
('present', 1),
('void', 0),
('na', 1),
('notpresent', 0)
]
df = spark.createDataFrame(data, ['status', 'Flag'])
df.show()
df.withColumn('Flag', when(col('status').isin(['ok', 'present']), lit(1)).otherwise(col('Flag'))).show()
Output
+----------+----+
| status|Flag|
+----------+----+
| present| 0|
| ok| 0|
| present| 1|
| void| 0|
| na| 1|
|notpresent| 0|
+----------+----+
+----------+----+
| status|Flag|
+----------+----+
| present| 1|
| ok| 1|
| present| 1|
| void| 0|
| na| 1|
|notpresent| 0|
+----------+----+
The simplest way
df.withColumn('Flag', col('status').isin(['ok', 'present']).astype('int')).show()
+----------+----+
| status|Flag|
+----------+----+
| present| 1|
| ok| 1|
| present| 1|
| void| 0|
| na| 1|
|notpresent| 0|
+----------+----+
I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/
Say I have a DataFrame like:
+------------+-----------+-----+
| feed|artist |count|
+------------+-----------+-----+
| y| Kanye West| 9|
| y| Radiohead| 6|
| y| Zero 7| 3|
| y| Puts Marie| 1|
| gwas| Drax| 7|
| gwas| Calibre| 4|
| gwas| Aphex Twin| 1|
| gwas| Jay Z| 1|
| x| DJ Krush| 2|
| x| Titeknots| 1|
+------------+-----------+-----+
I want to add a new column which chunks the rows into buckets of N rows for each partition (feed).
It seems like the inverse of NTILE to me. NTILE lets you choose the # of buckets but I want to choose the bucket-size instead.
Here's the desired result. Notice how each feed is chunked into groups of N = 2, including the x feed which has just one chunk of 2 rows. (Edit: each partition is ordered by count, so group 1 in each partition will be the rows with the highest value for count)
+------------+-----------+-----+-----+
| feed|artist |count|group|
+------------+-----------+-----+-----+
| y| Kanye West| 1| 9|
| y| Radiohead| 1| 6|
| y| Zero 7| 1| 3|
| y| Puts Marie| 1| 1|
| gwas| Drax| 7| 7|
| gwas| Calibre| 1| 4|
| gwas| Aphex Twin| 1| 1|
| gwas| Jay Z| 8| 1|
| x| DJ Krush| 2| 2|
| x| Titeknots| 1| 1|
+------------+-----------+-----+-----+
As a bonus, I would like each bucket to be a different size. E.g. List(2, 2, 4, 10, 10, -1) would mean that the first bucket has 2 rows, the second has 2 rows, the third has 4 rows, etc., and the final bucket (-1) contains the remainder.
EDIT
(Another useful variation)
While implementing the answers, I realized that there's another variation which I would prefer:
Add a column to a DataFrame which chunks its rows into groups of N, without knowing the size of the DataFrame.
Example:
If N = 100 and the DataFrame has 800 rows, it chunk it into 8 buckets of 100. If the DataFrame has 950 rows, it will chunk it into 9 buckets of 100, and 1 bucket of 50. It should not require a scan/call to .count().
The example DataFrames are analogous to the ones above.
(meta: should I make a new question for this variation? I feel like "NTILE with a fixed bucket size" is a more elegant problem and probably more common than my original use-case)
If I understand you correctly, this can be handled by using an SQL expression:
import org.apache.spark.sql.functions.{expr,row_number,desc}
import org.apache.spark.sql.expressions.Window
// set up WindowSpec
val w1 = Window.partitionBy("feed").orderBy(desc("count"))
val L = List(2, 2, 4, 10, 10, -1)
// dynamically create SQL expression from the List `L` to map row_number into group-id
var sql_expr = "CASE"
var running_total = 0
for(i <- 1 to L.size) {
running_total += L(i-1)
sql_expr += (if(L(i-1) > 0) s" WHEN rn <= $running_total THEN $i " else s" ELSE $i END")
}
println(sql_expr)
//CASE WHEN rn <= 2 THEN 1 WHEN rn <= 4 THEN 2 WHEN rn <= 8 THEN 3 WHEN rn <= 18 THEN 4 WHEN rn <= 28 THEN 5 ELSE 6 END
val df_new = df.withColumn("rn", row_number().over(w1)).withColumn("group", expr(sql_expr)).drop("rn")
df_new.show
+----+----------+-----+-----+
|feed| artist|count|group|
+----+----------+-----+-----+
|gwas| Drax| 7| 1|
|gwas| Calibre| 4| 1|
|gwas|Aphex Twin| 1| 2|
|gwas| Jay Z| 1| 2|
| x| DJ Krush| 2| 1|
| x| Titeknots| 1| 1|
| y|Kanye West| 9| 1|
| y| Radiohead| 6| 1|
| y| Zero 7| 3| 2|
| y|Puts Marie| 1| 2|
+----+----------+-----+-----+
For a fixed N, just cast (row_number-1)/N + 1 to int:
val N = 2
val df_new = df.withColumn("group", ((row_number().over(w1)-1)/N+1).cast("int"))
This could work :
val bucketDef = List(2, 2, 4, 10, 10)
val bucketRunsum = bucketDef.scanLeft(1)( _ + _) // calc running sum
// maps a row-number to a bucket
val indexBucketMapping = bucketRunsum.zip(bucketRunsum.tail)
.zipWithIndex
.map{case ((start,end),index) => ((start,end),index+1)} // make index start at 1
// gives List(((1,3),1), ((3,5),2), ((5,9),3), ((9,19),4), ((19,29),5))
// udf to assign a bucket to a given row-number
val calcBucket = udf((rnb:Long) => indexBucketMapping
.find{case ((start,end),_) => start<=rnb && rnb < end}
.map(_._2) // get index
.getOrElse(indexBucketMapping.last._2+1) // is in last bucket
)
df
.withColumn("group",calcBucket(row_number().over(Window.partitionBy($"feed").orderBy($"count"))))
alternatively (without UDF), construct a DataFrame which maps a row-number to a bucket and then join
val bucketSizeDef =List(2, 2, 4, 10, 10)
val bucketDef = (1 +: bucketSizeDef).zipWithIndex.map{case (bs,index) => (bs,index+1)}
.toDF("bucketSize","group")
.withColumn("i",sum($"bucketSize").over(Window.orderBy($"group")))
.withColumn("i_to",coalesce(lead($"i",1).over(Window.orderBy($"group")),lit(Long.MaxValue)))
.drop($"bucketSize")
bucketDef.show()
gives:
+-----+---+-------------------+
|group| i| i_to|
+-----+---+-------------------+
| 1| 1| 3|
| 2| 3| 5|
| 3| 5| 9|
| 4| 9| 19|
| 5| 19| 29|
| 6| 29|9223372036854775807|
+-----+---+-------------------+
then join to df:
df
.withColumn("rnb",row_number().over(Window.partitionBy($"feed").orderBy($"count")))
.join(broadcast(bucketDef),$"rnb">= $"i" and $"rnb"< $"i_to")
.drop("rnb","i","i_to")
I wish to have below expected output:
My code:
import numpy as np
pd_dataframe = pd.DataFrame({'id': [i for i in range(10)],
'values': [10,5,3,-1,0,-10,-4,10,0,10]})
sp_dataframe = spark.createDataFrame(pd_dataframe)
sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType())
sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values'))
sp_dataframe.show()
I wanted to create another column with which it returns an additional of 1 when the value is different from previous row.
Expected output:
id values sign numbering
0 0 10 1 1
1 1 5 1 1
2 2 3 1 1
3 3 -1 -1 2
4 4 0 0 3
5 5 -10 -1 4
6 6 -4 -1 4
7 7 10 1 5
8 8 0 0 6
9 9 10 1 7
Here's a way you can do using a custom function:
import pyspark.sql.functions as F
# compare the next value with previous
def f(x):
c = 1
l = [c]
last_value = [x[0]]
for i in x[1:]:
if i == last_value[-1]:
l.append(c)
else:
c += 1
l.append(c)
last_value.append(i)
return l
# take sign column as a list
sign_list = sp_dataframe.select('sign').rdd.map(lambda x: x.sign).collect()
# create a new dataframe using the output
sp = spark.createDataFrame(pd.DataFrame(f(sign_list), columns=['numbering']))
Append a list as a column to a dataframe is a bit tricky in pyspark. For this we'll need to create a dummy row_idx to join the dataframes.
# create dummy indexes
sp_dataframe = sp_dataframe.withColumn("row_idx", F.monotonically_increasing_id())
sp = sp.withColumn("row_idx", F.monotonically_increasing_id())
# join the dataframes
final_df = (sp_dataframe
.join(sp, sp_dataframe.row_idx == sp.row_idx)
.orderBy('id')
.drop("row_idx"))
final_df.show()
+---+------+----+---------+
| id|values|sign|numbering|
+---+------+----+---------+
| 0| 10| 1| 1|
| 1| 5| 1| 1|
| 2| 3| 1| 1|
| 3| -1| -1| 2|
| 4| 0| 0| 3|
| 5| -10| -1| 4|
| 6| -4| -1| 4|
| 7| 10| 1| 5|
| 8| 0| 0| 6|
| 9| 10| 1| 7|
+---+------+----+---------+
Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you
I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).
I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+
I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))