How to remove Spark values that are out of sequence - scala

I need to remove some values from dataframe that is not in right place.
I have the following dataframe, for example:
+-----+-----+
|count|PHASE|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 6|
| 4| 6|
| 5| 8|
| 6| 4|
| 7| 4|
| 8| 4|
+-----+-----+
I need to remove 6 and 8 from dataframe because of some rules:
phase === 3 and lastPhase.isNull
phase === 4 and lastPhase.isin(2, 3)
phase === 6 and lastPhase.isin(4, 5)
phase === 8 and lastPhase.isin(6, 7)
This is a huge dataframe and those misplaced values can happen many times.
Could you help with that, please?
Expected output:
+-----+-----+------+
|count|PHASE|CHANGE|
+-----+-----+------+
| 1| 3| 3|
| 2| 3| 3|
| 3| 6| 3|
| 4| 6| 3|
| 5| 8| 3|
| 6| 4| 4|
| 7| 4| 4|
| 8| 4| 4|
+-----+-----+------+
val rows = Seq(
Row(1, 3),
Row(2, 3),
Row(3, 6),
Row(4, 6),
Row(5, 8),
Row(6, 4),
Row(7, 4),
Row(8, 4)
)
val schema = StructType(
Seq(StructField("count", IntegerType), StructField("PHASE", IntegerType))
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(rows),
schema
)
Thanks in advance!

If I correctly understood your question, you want to populate column CHANGE as follow:
For a dataframe sorted by count column, for each row, if the value of the PHASE column matches a defined set of rules, set this value in CHANGE column. If value doesn't match the rules, set latest valid PHASE value in CHANGE column
To do so, You can use an user-defined aggregate function to setup CHANGE column over a window ordered by COUNT column
First, you define an Aggregator object where its buffer will be the last valid phase, and you implement your set of rules in its reduce function:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder, Encoders}
object LatestValidPhase extends Aggregator[Integer, Integer, Integer] {
def zero: Integer = null
def reduce(lastPhase: Integer, phase: Integer): Integer = {
if (lastPhase == null && phase == 3) {
phase
} else if (Set(2, 3).contains(lastPhase) && phase == 4) {
phase
} else if (Set(4, 5).contains(lastPhase) && phase == 6) {
phase
} else if (Set(6, 7).contains(lastPhase) && phase == 8) {
phase
} else {
lastPhase
}
}
def merge(b1: Integer, b2: Integer): Integer = {
throw new NotImplementedError("should not use as general aggregation")
}
def finish(reduction: Integer): Integer = reduction
def bufferEncoder: Encoder[Integer] = Encoders.INT
def outputEncoder: Encoder[Integer] = Encoders.INT
}
Then you transform it into an aggregate user-defined function that you apply over your window ordered by COUNT column:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val latest_valid_phase = udaf(LatestValidPhase)
val window = Window.orderBy("count")
df.withColumn("CHANGE", latest_valid_phase(col("PHASE")).over(window))

Related

Selecting specific rows from different dataframes within a map scope

Hello I am new to Spark and scala, and I have three similar dataframes as the following:
df1:
+--------+-------+-------+-------+
| Country|1/22/20|1/23/20|1/24/20|
+--------+-------+-------+-------+
| Chad| 1| 0| 5|
+--------+-------+-------+-------+
|Paraguay| 4| 6| 3|
+--------+-------+-------+-------+
| Russia| 0| 0| 1|
+--------+-------+-------+-------+
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using
df1.map{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
}
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.
df1.map{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(s.name, s"$key${s.name}")
})
newdf
}
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.drop("key2_country")
.drop("key3_country")
.withColumnRenamed("key1_country", "country")
finaldf.show(10)
//Output
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

create a simple DF after reading a parquet file

I am a new developer on Scala and I met some problems to write a simple code on Spark Scala. I have this DF that I get after reading a parquet file :
ID Timestamp
1 0
1 10
1 11
2 20
3 15
And what I want is to create a DF result from the first DF (if the ID = 2 for example, the timestamp should be multiplied by two). So, I created a new class :
case class OutputData(id: bigint, timestamp:bigint)
And here is my code :
val tmp = spark.read.parquet("/user/test.parquet").select("id", "timestamp")
val outputData:OutputData = tmp.map(x:Row => {
var time_result
if (x.getString("id") == 2) {
time_result = x.getInt(2)* 2
}
if (x.getString("id") == 1) {
time_result = x.getInt(2) + 10
}
OutputData2(x.id, time_result)
})
case class OutputData2(id: bigint, timestamp:bigint)
Can you help me please ?
To make the implementation easier, you can cast your df using a case class, the process that Dataset with object notation instead of access to your row each time that you want the value of some element. Apart of that, based on your input and output will take have same format you can use same case class instead of define 2.
Code looks like:
// Sample intput data
val df = Seq(
(1, 0L),
(1, 10L),
(1, 11L),
(2, 20L),
(3, 15L)
).toDF("ID", "Timestamp")
df.show()
// Case class as helper
case class OutputData(ID: Integer, Timestamp: Long)
val newDF = df.as[OutputData].map(record=>{
val newTime = if(record.ID == 2) record.Timestamp*2 else record.Timestamp // identify your id and apply logic based on that
OutputData(record.ID, newTime)// return same format with updated values
})
newDF.show()
The output of above code:
// original
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 20|
| 3| 15|
+---+---------+
// new one
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 40|
| 3| 15|
+---+---------+

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

Apply UDF function to Spark window where the input paramter is a list of all column values in range

I would like to build a moving average on each row in a window. Let's say -10 rows. BUT if there are less than 10 rows available I would like to insert a 0 in the resulting row -> new column.
So what I would try to achieve is using a UDF in an aggregate window with input paramter List() (or whatever superclass) which has the values of all rows available.
Here's a code example that doesn't work:
val w = Window.partitionBy("id").rowsBetween(-10, +0)
dfRetail2.withColumn("test", udftestf(dfRetail2("salesMth")).over(w))
Expected output: List( 1,2,3,4) if no more rows are available and take this as input paramter for the udf function. udf function should return a calculated value or 0 if less than 10 rows available.
the above code terminates: Expression 'UDF(salesMth#152L)' not supported within a window function.;;
You can use Spark's built-in Window functions along with when/otherwise for your specific condition without the need of UDF/UDAF. For simplicity, the sliding-window size is reduced to 4 in the following example with dummy data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 2).flatMap(i => Seq.tabulate(8)(j => (i, i * 10.0 + j))).
toDF("id", "amount")
val slidingWin = 4
val winSpec = Window.partitionBy($"id").rowsBetween(-(slidingWin - 1), 0)
df.
withColumn("slidingCount", count($"amount").over(winSpec)).
withColumn("slidingAvg", when($"slidingCount" < slidingWin, 0.0).
otherwise(avg($"amount").over(winSpec))
).show
// +---+------+------------+----------+
// | id|amount|slidingCount|slidingAvg|
// +---+------+------------+----------+
// | 1| 10.0| 1| 0.0|
// | 1| 11.0| 2| 0.0|
// | 1| 12.0| 3| 0.0|
// | 1| 13.0| 4| 11.5|
// | 1| 14.0| 4| 12.5|
// | 1| 15.0| 4| 13.5|
// | 1| 16.0| 4| 14.5|
// | 1| 17.0| 4| 15.5|
// | 2| 20.0| 1| 0.0|
// | 2| 21.0| 2| 0.0|
// | 2| 22.0| 3| 0.0|
// | 2| 23.0| 4| 21.5|
// | 2| 24.0| 4| 22.5|
// | 2| 25.0| 4| 23.5|
// | 2| 26.0| 4| 24.5|
// | 2| 27.0| 4| 25.5|
// +---+------+------------+----------+
Per remark in the comments section, I'm including a solution via UDF below as an alternative:
def movingAvg(n: Int) = udf{ (ls: Seq[Double]) =>
val (avg, count) = ls.takeRight(n).foldLeft((0.0, 1)){
case ((a, i), next) => (a + (next-a)/i, i + 1)
}
if (count <= n) 0.0 else avg // Expand/Modify this for specific requirement
}
// To apply the UDF:
df.
withColumn("average", movingAvg(slidingWin)(collect_list($"amount").over(winSpec))).
show
Note that unlike sum or count, collect_list ignores rowsBetween() and generates partitioned data that can potentially be very large to be passed to the UDF (hence the need for takeRight()). If the computed Window sum and count are sufficient for what's needed for your specific requirement, consider passing them to the UDF instead.
In general, especially if the data at hand is already in DataFrame format, it'd perform and scale better by using built-in DataFrame API to take advantage of Spark's execution engine optimization than using user-defined UDF/UDAF. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.

Scala & Spark: Add value to every cell of every row

I have a two DataFrames:
scala> df1.show()
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 2|null| 3|...| 4|
| 4| 3| 3| | 1|
| 5| 2| 8| | 1|
+----+----+----+---+----+
scala> df2.show() // has one row only (avg())
+----+----+----+---+----+
|col1|col2|col3| |colN|
+----+----+----+ +----+
| 3.6|null| 4.6|...| 2|
+----+----+----+---+----+
and a constant val c : Double = 0.1.
Desired output is a df3: Dataframe that is given by
,
with n=numberOfRow and m=numberOfColumn.
I already looked through the list of sql.functions and failed implementing it myself with some nested map operations (fearing performance issues). One idea I had was:
val cBc = spark.sparkContext.broadcast(c)
val df2Bc = spark.sparkContext.broadcast(averageObservation)
df1.rdd.map(row => {
for (colIdx <- 0 until row.length) {
val correspondingDf2value = df2Bc.value.head().getDouble(colIdx)
row.getDouble(colIdx) * (1 - cBc.value) + correspondingDf2value * cBc.value
}
})
Thank you in advance!
(cross)join combined with select is more than enough and will be much more efficient than mapping. Required imports:
import org.apache.spark.sql.functions.{broadcast, col, lit}
and expression:
val exprs = df1.columns.map { x => (df1(x) * (1 - c) + df2(x) * c).alias(x) }
join and select:
df1.crossJoin(broadcast(df2)).select(exprs: _*)