scala how to drop lines from df based on the column value - scala

i have Data frame with these values i need to filtered min date (groupby( id,count) and summary should change as equal to more
id secid count date summary
1 2 9 20170608 equal
1 3 9 20160608 equal
2 3 8 20170608 less
3 3 9 20160608 equal
I need to show
id secid count date summary
1 2 9 20170608 more
2 3 8 20170608 less
3 3 9 20160608 equal

You can use groupBy to group id and count together and then use when and otherwise to change your summary field to more in case you have more date for the same id and count.
//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
(1, 3, 9, 20160608, "equal"),
(2, 3, 8, 20170608, "less"),
(3, 3, 9, 20160608, "equal"),
(1, 2, 8, 20170608, "random"),
(1, 2, 8, 20170608, "random"))
.toDF("id", "secid", "count", "date", "summary")
//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))
//apply groupby and other operations to get the result
df.groupBy("id", "count")
.agg(collect_list("date").as("datelist"),
max("date").as("date"),
first("secid").as("secid"),
first("summary").as("summary"))
.withColumn("summary",
when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
.drop("datelist")
.show()
// output
// +---+-----+--------+-----+-------+
// | id|count| date|secid|summary|
// +---+-----+--------+-----+-------+
// | 1| 9|20170608| 2| more|
// | 1| 8|20170608| 2| random|
// | 3| 9|20160608| 3| equal|
// | 2| 8|20170608| 3| less|
// +---+-----+--------+-----+-------+

Related

Custom order pyspark dataframe usign a column

I have a pyspark dataframe df :
I want to proprtize items based on Type column this order : AAIC>AAFC>TBIC>TBFC and among them uisng RANK column i.e items with lower rank prioritzed within above order groups.
Any values in Type column other than AAIC or AAFC TBIC or TBFC I want to relabel them as NON
ITEM
Type
RANK
1
AAIC
11
2
AAFC
8
3
TBIC
2
4
TBFC
1
5
XYZ
5
6
AAIC
7
7
JHK
10
8
SWE
3
9
TBIC
4
10
AAFC
9
11
AAFC
6
Desired pyspark dataframe df :-
ITEM
Type
RANK
NEW_RANK
6
AAIC
7
1
1
AAIC
11
2
11
AAFC
6
3
2
AAFC
8
4
10
AAFC
9
5
3
TBIC
2
6
9
TBIC
4
7
4
TBFC
1
8
8
NON
3
9
5
NON
5
10
7
NON
10
11
You may check this code:
import pyspark.sql.functions as F
from pyspark.sql import Window
inputData = [
(1, "AAIC", 11),
(2, "AAFC", 8),
(3, "TBIC", 2),
(4, "TBFC", 1),
(5, "XYZ", 5),
(6, "AAIC", 7),
(7, "JHK", 10),
(8, "SWE", 3),
(9, "TBIC", 4),
(10, "AAFC", 9),
(11, "AAFC", 6),
]
inputDf = spark.createDataFrame(inputData, schema=["item", "type", "rank"])
preprocessedDf = inputDf.withColumn(
"type",
F.when(
F.col("type").isin(["AAIC", "AAFC", "TBIC", "TBFC"]), F.col("type")
).otherwise(F.lit("NON")),
).withColumn(
"priority",
F.when(F.col("type") == F.lit("AAIC"), 1).otherwise(
F.when(F.col("type") == F.lit("AAFC"), 2).otherwise(
F.when(F.col("type") == F.lit("TBIC"), 3).otherwise(
F.when(F.col("type") == F.lit("TBFC"), 4).otherwise(F.lit(5))
)
)
),
)
windowSpec = Window.partitionBy().orderBy("priority", "rank")
preprocessedDf.withColumn("NEW_RANK", F.row_number().over(windowSpec)).drop(
"priority"
).show()
Priorities for codes are hardcoded which may be hard to maintain in case of more values. You may want to adjust this part if it needs to be more flexible
I am moving all records to one partition to calculate the correct row_order. Its a common problem, its hard to calculate consistent ids with given order in distributed manner. If your dataset is big, there may be need to think about something else, probably more complicated
output:
+----+----+----+--------+
|item|type|rank|NEW_RANK|
+----+----+----+--------+
| 6|AAIC| 7| 1|
| 1|AAIC| 11| 2|
| 11|AAFC| 6| 3|
| 2|AAFC| 8| 4|
| 10|AAFC| 9| 5|
| 3|TBIC| 2| 6|
| 9|TBIC| 4| 7|
| 4|TBFC| 1| 8|
| 8| NON| 3| 9|
| 5| NON| 5| 10|
| 7| NON| 10| 11|
+----+----+----+--------+

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

Create separate columns from array column in Spark Dataframe in Scala when array is large [duplicate]

This question already has answers here:
Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]
(5 answers)
Closed 4 years ago.
I have two columns: one of type Integer and one of type linalg.Vector. I can convert linalg.Vector to array. Each array has 32 elements. I want to convert each element in the array to a column. So the input is like :
column1 column2
(3, 5, 25, ...., 12) 3
(2, 7, 15, ...., 10) 4
(1, 10, 12, ..., 35) 2
Output should be:
column1_1 column1_2 column1_3 ......... column1_32 column 2
3 5 25 ......... 12 3
2 7 15 ......... 10 4
1 1 0 12 ......... 12 2
Except, in my case there are 32 elements in the array. It is too many to use the method in question Convert Array of String column to multiple columns in spark scala
I tried a few ways and none of it worked. What is the right way to do this?
Thanks a lot.
scala> import org.apache.spark.sql.Column
scala> val df = Seq((Array(3,5,25), 3),(Array(2,7,15),4),(Array(1,10,12),2)).toDF("column1", "column2")
df: org.apache.spark.sql.DataFrame = [column1: array<int>, column2: int]
scala> def getColAtIndex(id:Int): Column = col(s"column1")(id).as(s"column1_${id+1}")
getColAtIndex: (id: Int)org.apache.spark.sql.Column
scala> val columns: IndexedSeq[Column] = (0 to 2).map(getColAtIndex) :+ col("column2") //Here, instead of 2, you can give the value of n
columns: IndexedSeq[org.apache.spark.sql.Column] = Vector(column1[0] AS `column1_1`, column1[1] AS `column1_2`, column1[2] AS `column1_3`, column2)
scala> df.select(columns: _*).show
+---------+---------+---------+-------+
|column1_1|column1_2|column1_3|column2|
+---------+---------+---------+-------+
| 3| 5| 25| 3|
| 2| 7| 15| 4|
| 1| 10| 12| 2|
+---------+---------+---------+-------+
This can be done best by writing a UserDefinedFunction like:
val getElementFromVectorUDF = udf(getElementFromVector(_: Vector, _: Int))
def getElementFromVector(vec: Vector, idx: Int) = {
vec(idx)
}
You can use it like this then:
df.select(
getElementFromVectorUDF($"column1", 0) as "column1_0",
...
getElementFromVectorUDF($"column1", n) as "column1_n",
)
I hope this helps.

Multiplying two columns from different data frames in spark

I have two dataframes representing the following csv data:
Store Date Weekly_Sales
1 05/02/2010 249
2 12/02/2010 455
3 19/02/2010 415
4 26/02/2010 194
Store Date Weekly_Sales
5 05/02/2010 400
6 12/02/2010 460
7 19/02/2010 477
8 26/02/2010 345
What i'm attempting to do is for each date, read the associated weekly sales for it in both dataframes and find the average of the two numbers. I'm not sure how to accomplish this.
Assuming that you want to have individual store data in the result data set, one approach would be to union the two dataframes and use Window function to calculate average weekly sales (along with the corresponding list of stores, if wanted), as follows:
val df1 = Seq(
(1, "05/02/2010", 249),
(2, "12/02/2010", 455),
(3, "19/02/2010", 415),
(4, "26/02/2010", 194)
).toDF("Store", "Date", "Weekly_Sales")
val df2 = Seq(
(5, "05/02/2010", 400),
(6, "12/02/2010", 460),
(7, "19/02/2010", 477),
(8, "26/02/2010", 345)
).toDF("Store", "Date", "Weekly_Sales")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"Date")
df1.union(df2).
withColumn("Avg_Sales", avg($"Weekly_Sales").over(window)).
withColumn("Store_List", collect_list($"Store").over(window)).
orderBy($"Date", $"Store").
show
// +-----+----------+------------+---------+----------+
// |Store| Date|Weekly_Sales|Avg_Sales|Store_List|
// +-----+----------+------------+---------+----------+
// | 1|05/02/2010| 249| 324.5| [1, 5]|
// | 5|05/02/2010| 400| 324.5| [1, 5]|
// | 2|12/02/2010| 455| 457.5| [2, 6]|
// | 6|12/02/2010| 460| 457.5| [2, 6]|
// | 3|19/02/2010| 415| 446.0| [3, 7]|
// | 7|19/02/2010| 477| 446.0| [3, 7]|
// | 4|26/02/2010| 194| 269.5| [4, 8]|
// | 8|26/02/2010| 345| 269.5| [4, 8]|
// +-----+----------+------------+---------+----------+
You should first merge them using union function. Then grouping on Date column find the average ( using avg inbuilt function) as
import org.apache.spark.sql.functions._
df1.union(df2)
.groupBy("Date")
.agg(collect_list("Store").as("Stores"), avg("Weekly_Sales").as("average_weekly_sales"))
.show(false)
which should give you
+----------+------+--------------------+
|Date |Stores|average_weekly_sales|
+----------+------+--------------------+
|26/02/2010|[4, 8]|269.5 |
|12/02/2010|[2, 6]|457.5 |
|19/02/2010|[3, 7]|446.0 |
|05/02/2010|[1, 5]|324.5 |
+----------+------+--------------------+
I hope the answer is helpful

Adding a column of rowsums across a list of columns in Spark Dataframe

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
Thanks in advance for your help.
You should try the following:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
Then the result is:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+
Plain and simple:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}
def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))
with Python equivalent:
from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col
def sum_(*cols):
return reduce(add, cols, lit(0))
columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))
Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.
I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.
val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)
Here's an elegant solution using python:
NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
Hopefully this will influence something similar in Spark ... anyone?.