Sum column based on another columns values - scala

I have a dataframe which represents grayscale image
+---+---+-----+
| w| h|color|
+---+---+-----+
| 0| 0|255.0|
| 0| 1|255.0|
| 0| 2|255.0|
| 0| 3|255.0|
| 0| 4|255.0|
| 0| 5|255.0|
| 0| 6|255.0|
| 0| 7|255.0|
| 0| 8|255.0|
| 0| 9|255.0|
| 0| 10|255.0|
| 0| 11|255.0|
| 1| 0|255.0|
| 1| 1|255.0|
| 1| 2|255.0|
| 1| 3|255.0|
| 1| 4|255.0|
| 1| 5|255.0|
| 1| 6|255.0|
| 1| 7|255.0|
+---+---+-----+
top 20 rows
For each row I need to sum "color" when values from "w" and "h" are in range from current value to current value plus a number.
To better understading, possible solution would look like this:
val windowW = Window.rangeBetween(Window.currentRow, Window.currentRow + num1)
val windowH = Window.rangeBetween(Window.currentRow, Window.currentRow + num2)
df.withColumn("color_sum", sum(col("color")).over(col("w").windowW and col("h").windowH))
Could you please give me some hints how to achieve this calculation?
Expected output for the very first row:
+---+---+-----+----------+
| w| h|color|sum(color)|
+---+---+-----+----------+
| 0| 0|255.0| 1020|
+---+---+-----+----------+
Where num1 and num2 are both equals 1.
That means sum is taken from rows:
(0, 0), (0, 1), (1, 0), (1, 1)
For row (1, 1) sum would be taken from rows (1, 1), (1, 2), (2, 1), (2, 2).

You can't use window functions to apply a function on 2-dimensional data with these constraints.
You can use the join method to find the rows:
df.as("df1")
.join(df.as("df2"),
((col("df1.w") - col("df2.w") <= 1) && col("df1.w") - col("df2.w") >= -1) &&
((col("df1.h") - col("df2.h") <= 1) && col("df1.h") - col("df2.h") >= -1),
"inner"
)
.groupBy("df1.w", "df1.h")
.agg(min("df1.color") as "color", sum("df2.color") as "sum")
.orderBy("w", "h")
.show()

Related

Calculate number of columns with missing values per each row in PySpark

Let see we have the following data set
columns = ['id', 'dogs', 'cats']
values = [(1, 2, 0),(2, None, None),(3, None,9)]
df = spark.createDataFrame(values,columns)
df.show()
+----+----+----+
| id|dogs|cats|
+----+----+----+
| 1| 2| 0|
| 2|null|null|
| 3|null| 9|
+----+----+----+
I would like to calculate number ("miss_nb") and percents ("miss_pt") of columns with missing values per rows and get the following table
+----+-------+-------+
| id|miss_nb|miss_pt|
+----+-------+-------+
| 1| 0| 0.00|
| 2| 2| 0.67|
| 3| 1| 0.33|
+----+-------+-------+
The number of columns should be any (non-fixed list).
How to do it?
Thanks!

If-If statement Scala Spark

I have a dataframe for which I have to create a new column based on values in the already existing columns. The catch is, I can't write CASE statements, because here it checks for first WHEN condition if it is not satisfied then it will go to next WHEN. E.g. consider this dataframe:
+-+-----+-+
|A|B |C|
+-+-----+-+
|1|true |1|-----> Condition 1 and 2 is satisfied Here
|1|true |0|-----> Condition 1 is satisfied here
|1|false|1|
|2|true |1|
|2|true |0|
+-+-----+-+
Consider this CASE statement:
CASE WHEN A = 1 and B = 'true' then 'A'
WHEN A = 1 and B = 'true' and C=1 then 'B'
END
It gives me no row for value B.
Expected output:
+-+-----+-+----+
|A|B |C|D |
+-+-----+-+----+
|1|true |1|A |
|1|true |1|B |
|1|true |0|A |
|1|false|1|null|
|2|true |1|null|
|2|true |0|null|
+-+-----+-+----+
I know I can derive this in 2 separate dataframes and then union them. But I am looking for more efficient solution.
Creating the dataframe:
val df1 = Seq((1, true, 1), (1, true, 0), (1, false, 1), (2, true, 1), (2, true, 0)).toDF("A", "B", "C")
df1.show()
// +---+-----+---+
// | A| B| C|
// +---+-----+---+
// | 1| true| 1|
// | 1| true| 0|
// | 1|false| 1|
// | 2| true| 1|
// | 2| true| 0|
// +---+-----+---+
The code:
val condition1 = ($"A" === 1) && ($"B" === true)
val condition2 = condition1 && ($"C" === 1)
val arr1 = array(when(condition1, "A"), when(condition2, "B"))
val arr2 = when(element_at(arr1, 2).isNull, slice(arr1, 1, 1)).otherwise(arr1)
val df2 = df.withColumn("D", explode(arr2))
df2.show()
// +---+-----+---+----+
// | A| B| C| D|
// +---+-----+---+----+
// | 1| true| 1| A|
// | 1| true| 1| B|
// | 1| true| 0| A|
// | 1|false| 1|null|
// | 2| true| 1|null|
// | 2| true| 0|null|
// +---+-----+---+----+

How to remove outliers from multiple columns in pyspark using mean and standard deviation

I have the below data frame and I want to remove outliers from defined columns. In the below example price and income. Outliers should be removed for each group of data. In this example its 'cd' and 'segment' columns. Outliers should be removed based 5 standard deviations.
data = [
('a', '1',20,10),
('a', '1',30,16),
('a', '1',50,91),
('a', '1',60,34),
('a', '1',200,23),
('a', '2',33,87),
('a', '2',86,90),
('a','2',89,35),
('a', '2',90,24),
('a', '2',40,97),
('a', '2',1,21),
('b', '1',45,96),
('b', '1',56,99),
('b', '1',89,23),
('b', '1',98,64),
('b', '2',86,42),
('b', '2',45,54),
('b', '2',67,95),
('b','2',86,70),
('b', '2',91,64),
('b', '2',2,53),
('b', '2',4,87)
]
data = (spark.createDataFrame(data, ['cd','segment','price','income']))
I have used the code below to remove outliers but this would work only for one column.
mean_std = (
data
.groupBy('cd', 'segment')
.agg(
*[f.mean(colName).alias('{}{}'.format('mean_',colName)) for colName in ['price']],
*[f.stddev(colName).alias('{}{}'.format('stddev_',colName)) for colName in ['price']])
)
mean_columns = ['mean_price']
std_columns = ['stddev_price']
upper = mean_std
for col_1 in mean_columns:
for col_2 in std_columns:
if col_1 != col_2:
name = col_1 + '_upper_limit'
upper = upper.withColumn(name, f.col(col_1) + f.col(col_2)*5)
lower = upper
for col_1 in mean_columns:
for col_2 in std_columns:
if col_1 != col_2:
name = col_1 + '_lower_limit'
lower = lower.withColumn(name, f.col(col_1) - f.col(col_2)*5)
outliers = (data.join(lower,
how = 'left',
on = ['cd', 'segment'])
.withColumn('is_outlier_price', f.when((f.col('price')>f.col('mean_price_upper_limit')) |
(f.col('price')<f.col('mean_price_lower_limit')),1)
.otherwise(None))
)
my final output should have a column for each variable stating whether its 1 = remove or 0 = keep.
Really appreciate any help on this.
Your code works almost 100% fine. All you have to do is to replace the single fixed column name with an array of column names and then loop over this array:
numeric_cols = ['price', 'income']
mean_std = \
data \
.groupBy('cd', 'segment') \
.agg( \
*[F.mean(colName).alias('mean_{}'.format(colName)) for colName in numeric_cols],\
*[F.stddev(colName).alias('stddev_{}'.format(colName)) for colName in numeric_cols])
mean_std is now a dataframe with two columns (mean_... and stddev_...) per element of numeric_cols.
In the next step we calculate the lower and upper limit per element of numeric_cols:
mean_std_min_max = mean_std
for colName in numeric_cols:
meanCol = 'mean_{}'.format(colName)
stddevCol = 'stddev_{}'.format(colName)
minCol = 'min_{}'.format(colName)
maxCol = 'max_{}'.format(colName)
mean_std_min_max = mean_std_min_max.withColumn(minCol, F.col(meanCol) - 5 * F.col(stddevCol))
mean_std_min_max = mean_std_min_max.withColumn(maxCol, F.col(meanCol) + 5 * F.col(stddevCol))
mean_std_min_max now contains the two additional columns min_... and max... per element of numeric_cols.
Finally the join, followed by the calculation of the is_outliers_... columns as before:
outliers = data.join(mean_std_min_max, how = 'left', on = ['cd', 'segment'])
for colName in numeric_cols:
isOutlierCol = 'is_outlier_{}'.format(colName)
minCol = 'min_{}'.format(colName)
maxCol = 'max_{}'.format(colName)
meanCol = 'mean_{}'.format(colName)
stddevCol = 'stddev_{}'.format(colName)
outliers = outliers.withColumn(isOutlierCol, F.when((F.col(colName) > F.col(maxCol)) | (F.col(colName) < F.col(minCol)), 1).otherwise(0))
outliers = outliers.drop(minCol,maxCol, meanCol, stddevCol)
The last line of the loop is only to clean up and drop the intermediate columns. It might be helpful to comment it out.
The final result is:
+---+-------+-----+------+----------------+-----------------+
| cd|segment|price|income|is_outlier_price|is_outlier_income|
+---+-------+-----+------+----------------+-----------------+
| b| 2| 86| 42| 0| 0|
| b| 2| 45| 54| 0| 0|
| b| 2| 67| 95| 0| 0|
| b| 2| 86| 70| 0| 0|
| b| 2| 91| 64| 0| 0|
+---+-------+-----+------+----------------+-----------------+
only showing top 5 rows
you can use the list comprehension using F.when.
A very simplified example of your problem:
import pyspark.sql.functions as F
tst1= sqlContext.createDataFrame([(1,2,3,4,1,10),(1,3,5,7,2,11),(9,9,10,6,2,9),(2,4,90,9,1,2),(2,10,3,4,1,7),(3,5,11,5,7,8),(10,9,12,6,7,9),(3,6,99,8,1,9)],schema=['val1','val1_low_lim','val1_upper_lim','val2','val2_low_lim','val2_upper_lim'])
tst_res = tst1.select(tst1.columns+[(F.when((F.col(coln)<F.col(coln+'_upper_lim'))&(F.col(coln)>F.col(coln+'_low_lim')),1).otherwise(0)).alias(coln+'_valid') for coln in tst1.columns if "_lim" not in coln ])
The results:
tst_res.show()
+----+------------+--------------+----+------------+--------------+----------+----------+
|val1|val1_low_lim|val1_upper_lim|val2|val2_low_lim|val2_upper_lim|val1_valid|val2_valid|
+----+------------+--------------+----+------------+--------------+----------+----------+
| 1| 2| 3| 4| 1| 10| 0| 1|
| 1| 3| 5| 7| 2| 11| 0| 1|
| 9| 9| 10| 6| 2| 9| 0| 1|
| 2| 4| 90| 9| 1| 2| 0| 0|
| 2| 10| 3| 4| 1| 7| 0| 1|
| 3| 5| 11| 5| 7| 8| 0| 0|
| 10| 9| 12| 6| 7| 9| 1| 0|
| 3| 6| 99| 8| 1| 9| 0| 1|
+----+------------+--------------+----+------------+--------------+----------+----------+

How can I add a column to a DataFrame which groups rows in chunks of N? Like NTILE, but with a fixed bucket size

Say I have a DataFrame like:
+------------+-----------+-----+
| feed|artist |count|
+------------+-----------+-----+
| y| Kanye West| 9|
| y| Radiohead| 6|
| y| Zero 7| 3|
| y| Puts Marie| 1|
| gwas| Drax| 7|
| gwas| Calibre| 4|
| gwas| Aphex Twin| 1|
| gwas| Jay Z| 1|
| x| DJ Krush| 2|
| x| Titeknots| 1|
+------------+-----------+-----+
I want to add a new column which chunks the rows into buckets of N rows for each partition (feed).
It seems like the inverse of NTILE to me. NTILE lets you choose the # of buckets but I want to choose the bucket-size instead.
Here's the desired result. Notice how each feed is chunked into groups of N = 2, including the x feed which has just one chunk of 2 rows. (Edit: each partition is ordered by count, so group 1 in each partition will be the rows with the highest value for count)
+------------+-----------+-----+-----+
| feed|artist |count|group|
+------------+-----------+-----+-----+
| y| Kanye West| 1| 9|
| y| Radiohead| 1| 6|
| y| Zero 7| 1| 3|
| y| Puts Marie| 1| 1|
| gwas| Drax| 7| 7|
| gwas| Calibre| 1| 4|
| gwas| Aphex Twin| 1| 1|
| gwas| Jay Z| 8| 1|
| x| DJ Krush| 2| 2|
| x| Titeknots| 1| 1|
+------------+-----------+-----+-----+
As a bonus, I would like each bucket to be a different size. E.g. List(2, 2, 4, 10, 10, -1) would mean that the first bucket has 2 rows, the second has 2 rows, the third has 4 rows, etc., and the final bucket (-1) contains the remainder.
EDIT
(Another useful variation)
While implementing the answers, I realized that there's another variation which I would prefer:
Add a column to a DataFrame which chunks its rows into groups of N, without knowing the size of the DataFrame.
Example:
If N = 100 and the DataFrame has 800 rows, it chunk it into 8 buckets of 100. If the DataFrame has 950 rows, it will chunk it into 9 buckets of 100, and 1 bucket of 50. It should not require a scan/call to .count().
The example DataFrames are analogous to the ones above.
(meta: should I make a new question for this variation? I feel like "NTILE with a fixed bucket size" is a more elegant problem and probably more common than my original use-case)
If I understand you correctly, this can be handled by using an SQL expression:
import org.apache.spark.sql.functions.{expr,row_number,desc}
import org.apache.spark.sql.expressions.Window
// set up WindowSpec
val w1 = Window.partitionBy("feed").orderBy(desc("count"))
val L = List(2, 2, 4, 10, 10, -1)
// dynamically create SQL expression from the List `L` to map row_number into group-id
var sql_expr = "CASE"
var running_total = 0
for(i <- 1 to L.size) {
running_total += L(i-1)
sql_expr += (if(L(i-1) > 0) s" WHEN rn <= $running_total THEN $i " else s" ELSE $i END")
}
println(sql_expr)
//CASE WHEN rn <= 2 THEN 1 WHEN rn <= 4 THEN 2 WHEN rn <= 8 THEN 3 WHEN rn <= 18 THEN 4 WHEN rn <= 28 THEN 5 ELSE 6 END
val df_new = df.withColumn("rn", row_number().over(w1)).withColumn("group", expr(sql_expr)).drop("rn")
df_new.show
+----+----------+-----+-----+
|feed| artist|count|group|
+----+----------+-----+-----+
|gwas| Drax| 7| 1|
|gwas| Calibre| 4| 1|
|gwas|Aphex Twin| 1| 2|
|gwas| Jay Z| 1| 2|
| x| DJ Krush| 2| 1|
| x| Titeknots| 1| 1|
| y|Kanye West| 9| 1|
| y| Radiohead| 6| 1|
| y| Zero 7| 3| 2|
| y|Puts Marie| 1| 2|
+----+----------+-----+-----+
For a fixed N, just cast (row_number-1)/N + 1 to int:
val N = 2
val df_new = df.withColumn("group", ((row_number().over(w1)-1)/N+1).cast("int"))
This could work :
val bucketDef = List(2, 2, 4, 10, 10)
val bucketRunsum = bucketDef.scanLeft(1)( _ + _) // calc running sum
// maps a row-number to a bucket
val indexBucketMapping = bucketRunsum.zip(bucketRunsum.tail)
.zipWithIndex
.map{case ((start,end),index) => ((start,end),index+1)} // make index start at 1
// gives List(((1,3),1), ((3,5),2), ((5,9),3), ((9,19),4), ((19,29),5))
// udf to assign a bucket to a given row-number
val calcBucket = udf((rnb:Long) => indexBucketMapping
.find{case ((start,end),_) => start<=rnb && rnb < end}
.map(_._2) // get index
.getOrElse(indexBucketMapping.last._2+1) // is in last bucket
)
df
.withColumn("group",calcBucket(row_number().over(Window.partitionBy($"feed").orderBy($"count"))))
alternatively (without UDF), construct a DataFrame which maps a row-number to a bucket and then join
val bucketSizeDef =List(2, 2, 4, 10, 10)
val bucketDef = (1 +: bucketSizeDef).zipWithIndex.map{case (bs,index) => (bs,index+1)}
.toDF("bucketSize","group")
.withColumn("i",sum($"bucketSize").over(Window.orderBy($"group")))
.withColumn("i_to",coalesce(lead($"i",1).over(Window.orderBy($"group")),lit(Long.MaxValue)))
.drop($"bucketSize")
bucketDef.show()
gives:
+-----+---+-------------------+
|group| i| i_to|
+-----+---+-------------------+
| 1| 1| 3|
| 2| 3| 5|
| 3| 5| 9|
| 4| 9| 19|
| 5| 19| 29|
| 6| 29|9223372036854775807|
+-----+---+-------------------+
then join to df:
df
.withColumn("rnb",row_number().over(Window.partitionBy($"feed").orderBy($"count")))
.join(broadcast(bucketDef),$"rnb">= $"i" and $"rnb"< $"i_to")
.drop("rnb","i","i_to")

How to use Sum on groupBy result in Spark DatFrames?

Based on the following dataframe:
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
| 1| A| 10|
| 1| A| 5|
| 2| A| 56|
| 2| B| 13|
+---+-----+----+
I would like to obtain the sum of the column Amnt groupby ID and Categ.
+---+-----+-----+
| ID|Categ|Count|
+---+-----+-----+
| 1| A| 15 |
| 2| A| 56 |
| 2| B| 13 |
+---+-----+-----+
In SQL I would be doing something like
SELECT ID,
Categ,
SUM (Count)
FROM Table
GROUP BY ID,
Categ;
But how to do this in Scala?
I tried
DF.groupBy($"ID", $"Categ").sum("Count")
But this just changed the Count column name into sum(count) instead of actually giving me the sum of the counts.
Maybe you were summing the wrong column, but your grougBy/sum statement looks syntactically correct to me:
val df = Seq(
(1, "A", 10),
(1, "A", 5),
(2, "A", 56),
(2, "B", 13)
).toDF("ID", "Categ", "Amnt")
df.groupBy("ID", "Categ").sum("Amnt").show
// +---+-----+---------+
// | ID|Categ|sum(Amnt)|
// +---+-----+---------+
// | 1| A| 15|
// | 2| A| 56|
// | 2| B| 13|
// +---+-----+---------+
EDIT:
To alias the sum(Amnt) column (or, for multiple aggregations), wrap the aggregation expression(s) with agg. For example:
// Rename `sum(Amnt)` as `Sum`
df.groupBy("ID", "Categ").agg(sum("Amnt").as("Sum"))
// Aggregate `sum(Amnt)` and `count(Categ)`
df.groupBy("ID", "Categ").agg(sum("Amnt"), count("Categ"))