Fill null values in a row with frequency of other column - scala

In a spark structured streaming context, I have this dataframe :
+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1 |1632899456|4 |
|BR1 |1632901256|4 |
|BR300 |1632901796|null |
|BR300 |1632899155|null |
|BR90 |1632901743|1 |
|BR1 |1632899933|4 |
|BR1 |1632899756|4 |
|BR22 |1632900776|null |
|BR22 |1632900176|null |
+------+----------+---------+
I would like to replace the null values by the frequency of the brand in the batch, in order to obtain a dataframe like this one :
+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1 |1632899456|4 |
|BR1 |1632901256|4 |
|BR300 |1632901796|2 |
|BR300 |1632899155|2 |
|BR90 |1632901743|1 |
|BR1 |1632899933|4 |
|BR1 |1632899756|4 |
|BR22 |1632900776|2 |
|BR22 |1632900176|2 |
+------+----------+---------+
I am using Spark version 2.4.3 and SQLContext, with scala language.

With "count" over window function:
val df = Seq(
("BR1", 1632899456, Some(4)),
("BR1", 1632901256, Some(4)),
("BR300", 1632901796, None),
("BR300", 1632899155, None),
("BR90", 1632901743, Some(1)),
("BR1", 1632899933, Some(4)),
("BR1", 1632899756, Some(4)),
("BR22", 1632900776, None),
("BR22", 1632900176, None)
).toDF("brand", "Timestamp", "frequency")
val brandWindow = Window.partitionBy("brand")
val result = df.withColumn("frequency", when($"frequency".isNotNull, $"frequency").otherwise(count($"brand").over(brandWindow)))
Result:
+-----+----------+---------+
|BR1 |1632899456|4 |
|BR1 |1632901256|4 |
|BR1 |1632899933|4 |
|BR1 |1632899756|4 |
|BR22 |1632900776|2 |
|BR22 |1632900176|2 |
|BR300|1632901796|2 |
|BR300|1632899155|2 |
|BR90 |1632901743|1 |
+-----+----------+---------+
Solution with GroupBy:
val countDF = df.select("brand").groupBy("brand").count()
df.alias("df")
.join(countDF.alias("cnt"), Seq("brand"))
.withColumn("frequency", when($"df.frequency".isNotNull, $"df.frequency").otherwise($"cnt.count"))
.select("df.brand", "df.Timestamp", "frequency")

Hi bro I'm a java programmer . It's better to make a loop through the freq column and search for first null and its related brand . so count the number of that till the end of the table and correct the null value of that brand and go for the other null brand and correct it . here is my java solution :(I didn't test this code just wrote it text editor but I hope works well, 70%;)
//this is your table + dimensions
table[9][3];
int repeatCounter = 0;
String brand;
boolean thereIsNull = true;
//define an array to save the address of the specified null brand
int[tablecolumns.length()] brandmemory;
while (thereisnull) {
for (int i = 0; i < tablecolumns.length(); i++) {
if (array[i][3] == null) {
thereIsNull = true;
brand = array[i][1];
for (int n = i; n < tablecolumns.length(); i++) {
if (brand == array[i][1]) {
repeatCounter++;
// making an array to save address of the null brand in table:
brandmemory[repeatCounter] = i;
else{
break ;
}
}
for (int p = 1; p = repeatCounter ; p++) {
//changing null values to number of repeats
array[brandmemory[p]][3] = repeatCounter;
}
}
}
else{
continue;
//check if the table has any null content if no :end of program.
for(int w>i ; w=tablecolumns.length();w++ ){
if(array[w] != null ){
thereIsNull = false;
else{ thereIsNull = true;
break;
}
}
}
}
}

Related

How to apply an empty condition to sql select by using "and" in Spark?

I have an UuidConditionSet, when the if condition is wrong, I want apply an empty string to my select statement(or just ignore this UuidConditionSet), but I got this error. How to solve this problem?
mismatched input 'FROM' expecting <EOF>(line 10, pos 3)
This is the select
(SELECT
item,
amount,
date
from my_table
where record_type = 'myType'
and ( date_format(date, "yyyy-MM-dd") >= '2020-02-27'
and date_format(date, "yyyy-MM-dd") <= '2020-02-28' )
and ()
var UuidConditionSet = ""
var UuidCondition = Seq.empty[String]
if(!UuidList.mkString.isEmpty) {
UuidCondition = for {
Uuid <- UuidList
UuidConditionSet = s"${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} = '".concat(eventUuid).concat("'")
} yield UuidConditionSet
UuidConditionSet = UuidCondition.reduce(_.concat(" or ").concat(_))
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and ( date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}' )
| and ($UuidConditionSet)
You can use pattern matching on the list UuidList to check the size and return an empty string if the list is empty. Also, you can use IN instead of multiple ORs here.
Try this:
val UuidCondition = UuidList match {
case l if (l.size > 0) => {
l.map(u => s"'$u'").mkString(
s"and ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} in (",
",",
")"
)
}
case _ => ""
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}'
| $UuidCondition
"""

Spark dataframe Column content modification

I have a dataframe as shown below df.show():
+--------+---------+---------+---------+---------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | value1 | 123 | 2264 | 56 |
| Value1 | value2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
Can I transform the above data frame to the below using some SQL?
+--------+---------+-------------+---------------+------------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+-------------+---------------+------------+
| Value1 | value1 | Expend1:123 | Expend2: 2264 | Expend3:56 |
| Value1 | value2 | Expend1:124 | Expend2: 2255 | Expend3:23 |
+--------+---------+-------------+---------------+------------+
You can use the idea of foldLeft here
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.sparkContext.parallelize(Seq(
("Value1", "value1", "123", "2264", "56"),
("Value1", "value2", "124", "2255", "23")
)).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
//Lists your columns for operation
val cols = List("Expend1", "Expend2", "Expend3")
val newDF = cols.foldLeft(df){(acc, name) =>
acc.withColumn(name, concat(lit(name + ":"), col(name)))
}
newDF.show()
Output:
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+
you can do that using simple sql select statement if you want can use udf as well
Ex -> select Col11 , Col22 , 'Expend1:' + cast(Expend1 as varchar(10)) as Expend1, .... from table
val df = Seq(("Value1", "value1", "123", "2264", "56"), ("Value1", "value2", "124", "2255", "23") ).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
val cols = df.columns.filter(!_.startsWith("Col")) // It will only fetch other than col% prefix columns
val getCombineData = udf { (colName:String, colvalue:String) => colName + ":"+ colvalue}
var in = df
for (e <- cols) {
in = in.withColumn(e, getCombineData(lit(e), col(e)) )
}
in.show
// results
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+

How to use withColumn Spark Dataframe scala with while

This is my function apply rule, the col mdp_codcat,mdp_idregl, usedRef changechanges according to the data in array bRef.
def withMdpCodcat(bRef: Broadcast[Array[RefRglSDC]])(dataFrame: DataFrame):DataFrame ={var matchRule = false
var i = 0
while (i < bRef.value.size && !matchRule) {
if ((bRef.value(i).sensop.isEmpty || bRef.value(i).sensop.equals(col("signe")))
&& (bRef.value(i).cdopcz.isEmpty || Lib.matchCdopcz(strTail(col("cdopcz")).toString(), bRef.value(i).cdopcz))
&& (bRef.value(i).libope.isEmpty || Lib.matchRule(col("lib_ope").toString(), bRef.value(i).libope))
&& (bRef.value(i).qualib.isEmpty || Lib.matchRule(col("qualif_lib_ope").toString(), bRef.value(i).qualib))) {
matchRule = true
dataFrame.withColumn("mdp_codcat", lit(bRef.value(i).codcat))
dataFrame.withColumn("mdp_idregl", lit(bRef.value(i).idregl))
dataFrame.withColumn("usedRef", lit("SDC"))
}else{
dataFrame.withColumn("mdp_codcat", lit("NOT_CATEGORIZED"))
dataFrame.withColumn("mdp_idregl", lit("-1"))
dataFrame.withColumn("usedRef", lit(""))
}
i += 1
}
dataFrame
}
dataFrame : "cdenjp", "cdguic", "numcpt", "mdp_codcat", "mdp_idregl" , mdp_codcat","mdp_idregl","usedRef" if match add mdp_idregl, mdp_idregl,mdp_idregl with value bRef
Example - my dataframe :
val DF = Seq(("tt", "aa","bb"),("tt1", "aa1","bb2"),("tt1", "aa1","bb2")).toDF("t","a","b)
+---+---+---+---+
| t| a| b| c|
+---+---+---+---+
| tt| aa| bb| cc|
|tt1|aa1|bb2|cc3|
+---+---+---+---+
file.text content :
,aa,bb,cc
,aa1,bb2,cc3
tt4,aa4,bb4,cc4
tt1,aa1,,cc6
case class TOTO(a: String, b:String, c: String, d:String)
val text = sc.textFile("file:///home/X176616/file")
val bRef= textFromCsv.map(row => row.split(",", -1))
.map(c => TOTO(c(0), c(1), c(2), c(3))).collect().sortBy(_.a)
def withMdpCodcat(bRef: Broadcast[Array[RefRglSDC]])(dataFrame: DataFrame):DataFrame
dataframe.withColumn("mdp_codcat_new", "NOT_FOUND") //first init not found, change if while if match
var matchRule = false
var i = 0
while (i < bRef.value.size && !matchRule) {
if ((bRef.value(i).a.isEmpty || bRef.value(i).a.equals(signe))
&& (bRef.value(i).b.isEmpty || Lib.matchCdopcz(col(b), bRef.value(i).b))
&& (bRef.value(i).c.isEmpty || Lib.matchRule(col(c), bRef.value(i).c))
)) {
matchRule = true
dataframe.withColumn("mdp_codcat_new", bRef.value(i).d)
dataframe.withColumn("mdp_mdp_idregl_new" = bRef.value(i).e
}
i += 1
}
Finally df if condition true
bRef.value(i).a.isEmpty || bRef.value(i).a.equals(signe))
&& (bRef.value(i).b.isEmpty || Lib.matchCdopcz(b.substring(1).toInt.toString, bRef.value(i).b))
&& (bRef.value(i).c.isEmpty || Lib.matchRule(c, bRef.value(i).c)
+---+---+---+---+-----------+----------+
| t| a| b| c|mdp_codcat |mdp_idregl|
+---+---+---+---+-----------|----------+
| tt| aa| bb| cc|cc | other |
| ab|aa1|bb2|cc3|cc4 | toto | from bRef if true in while
| cd|aa1|bb2|cc3|cc4 | titi |
| b|a1 |b2 |c3 |NO_FOUND |NO_FOUND | (not_found if conditional false)
+---+---+---+---+----------------------+
+---+---+---+---+----------------------+
You can not create a dataframe schema depending on a runtime value. I would try to do it simpler. First I´d create the three columns with a default value:
dataFrame.withColumn("mdp_codcat", lit(""))
dataFrame.withColumn("mdp_idregl", lit(""))
dataFrame.withColumn("usedRef", lit(""))
Then you can use a udf with your broadcasted value:
def mdp_codcat(bRef: Broadcast[Array[RefRglSDC]]) = udf { (field: String) =>
{
// Your while and if stuff
// return your update data
}}
And apply each udf to each field:
dataframe.withColumn("mdp_codcat_new", mdp_codcat(bRef)("mdp_codcat"))
Maybe it can help

Update dataframe column by comparing with existing data in another column using Levenshtein algorithm

How can I update m_name column with Levenshtein algorithm to replace nulls ?
+--------------------+--------------------+-------------------+
| original_name| m_name| created|
+--------------------+--------------------+-------------------+
| New York| New York|2017-08-01 09:33:40|
| new york| null|2017-08-01 15:15:06|
| New York city| null|2017-08-01 15:15:06|
| california| California|2017-09-01 09:33:40|
| California,000IU...| null|2017-09-01 01:40:00|
| Californiya| California|2017-09-01 11:38:21|
For every "original_name" value should be taken first nearest "m_name" value founded by algorithm based on Levenshtein distance (edit distance).
similarity(s1,s2) = [max(len(s1), len(s2)) − editDistance(s1,s2)] / max(len(s1), len(s2))
"ideal" final result should be like that
+--------------------+--------------------+-------------------+
| original_name| m_name| created|
+--------------------+--------------------+-------------------+
| New York| New York|2017-08-01 09:33:40|
| new york| New York|2017-08-01 15:15:06|
| New York city| New York|2017-08-01 15:15:06|
| california| California|2017-09-01 09:33:40|
| California,000IU...| California|2017-09-01 01:40:00|
| Californiya| California|2017-09-01 11:38:21|
Credit goes to rossettacode Levenshtein_distance
You can do the following (commented for clarity and explanation)
//collecting the m_name to unique set and filtering out nulls and finally broadcasting to be used in udf function
import org.apache.spark.sql.functions._
val collectedList = df.select(collect_set("m_name")).rdd.collect().flatMap(row => row.getAs[Seq[String]](0).filterNot(_ == "null")).toList
val broadcastedList = sc.broadcast(collectedList)
//levenshtein distance formula applying
import scala.math.{min => mathmin, max => mathmax}
def minimum(i1: Int, i2: Int, i3: Int) = mathmin(mathmin(i1, i2), i3)
def editDistance(s1: String, s2: String) = {
val dist = Array.tabulate(s2.length + 1, s1.length + 1) { (j, i) => if (j == 0) i else if (i == 0) j else 0 }
for (j <- 1 to s2.length; i <- 1 to s1.length)
dist(j)(i) = if (s2(j - 1) == s1(i - 1)) dist(j - 1)(i - 1)
else minimum(dist(j - 1)(i) + 1, dist(j)(i - 1) + 1, dist(j - 1)(i - 1) + 1)
dist(s2.length)(s1.length)
}
//udf function definition to find the levenshtein distance and finding the closest first match from the broadcasted list with original_name column
def levenshteinUdf = udf((str1: String)=> {
val distances = for(str2 <- broadcastedList.value) yield (str2, editDistance(str1.toLowerCase, str2.toLowerCase))
distances.minBy(_._2)._1
})
//calling the udf function when m_name is null
df.withColumn("m_name", when(col("m_name").isNull || col("m_name") === "null", levenshteinUdf(col("original_name"))).otherwise(col("m_name"))).show(false)
which should give you
+-------------------+----------+-------------------+
|original_name |m_name |created |
+-------------------+----------+-------------------+
|New York |New York |2017-08-01 09:33:40|
|new york |New York |2017-08-01 15:15:06|
|New York city |New York |2017-08-01 15:15:06|
|california |California|2017-09-01 09:33:40|
|California,000IU...|California|2017-09-01 01:40:00|
|Californiya |California|2017-09-01 11:38:21|
+-------------------+----------+-------------------+
Note : I didn't use your similarity(s1,s2) = [max(len(s1), len(s2)) − editDistance(s1,s2)] / max(len(s1), len(s2)) logic as its giving wrong output

Spark Dataframe access of previous calculated row

I have following Data:
+-----+-----+----+
|Col1 |t0 |t1 |
+-----+-----+----+
| A |null |20 |
| A |20 |40 |
| B |null |10 |
| B |10 |20 |
| B |20 |120 |
| B |120 |140 |
| B |140 |320 |
| B |320 |340 |
| B |340 |360 |
+-----+-----+----+
And what I want is something like this:
+-----+-----+----+----+
|Col1 |t0 |t1 |grp |
+-----+-----+----+----+
| A |null |20 |1A |
| A |20 |40 |1A |
| B |null |10 |1B |
| B |10 |20 |1B |
| B |20 |120 |2B |
| B |120 |140 |2B |
| B |140 |320 |3B |
| B |320 |340 |3B |
| B |340 |360 |3B |
+-----+-----+----+----+
Explanation:
The extra column is based on the Col1 and the difference between t1 and t0.
When the difference between that two is too high => a new number is generated. (in the dataset above when the difference is greater than 50)
I build t0 with:
val windowSpec = Window.partitionBy($"Col1").orderBy("t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)
Can someone help me how to do it?
I searched but didn't get a good idea.
I'm a little bit lost because I need the value of the previous calculated row of grp...
Thanks
I solved it myself
val grp = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50).cast("bigint")
df = df.withColumn("grp", sum(grp).over(windowSpec))
With this I don't need both colums (t0 and t1) anymore but can use only t1 (or t) without compute t0.
(I only need to add the value of Col1 but the most important part the number is done and works fine.)
I got the solution from:
Spark SQL window function with complex condition
thanks for your help
You can use udf function to generate the grp column
def testUdf = udf((col1: String, t0: Int, t1: Int)=> (t1-t0) match {
case x : Int if(x > 50) => 2+col1
case _ => 1+col1
})
Call the udf function as
df.withColumn("grp", testUdf($"Col1", $"t0", $"t1"))
The udf function above won't work properly due to null values in t0 which can be replaced by 0
df.na.fill(0)
I hope this is the answer you are searching for.
Edited
Here's the complete solution using udaf . The process is complex . You've already got easy answer but it might help somebody who might use it
First defining udaf
class Boendal extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("Col1", StringType).add("t0", IntegerType).add("t1", IntegerType).add("rank", IntegerType)
def bufferSchema = new StructType().add("buff", StringType).add("buffer1", IntegerType)
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, "")
buffer.update(1, 0)
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getString(0)
val col1 = input.getString(0)
val t0 = input.getInt(1)
val t1 = input.getInt(2)
val rank = input.getInt(3)
var value = 1
if((t1-t0) < 50)
value = 1
else
value = (t1-t0)/50
val lastValue = buffer(1).asInstanceOf[Integer]
// if(!buff.isEmpty) {
if (value < lastValue)
value = lastValue
// }
buffer.update(1, value)
var finalString = ""
if(buff.isEmpty){
finalString = rank+";"+value+col1
}
else
finalString = buff+"::"+rank+";"+value+col1
buffer.update(0, finalString)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getString(0)
val buff2 = buffer2.getString(0)
buffer1.update(0, buff1+buff2)
}
def evaluate(buffer: Row) : String = {
buffer.getString(0)
}
}
Then some udfs
def rankUdf = udf((grp: String)=> grp.split(";")(0))
def removeRankUdf = udf((grp: String) => grp.split(";")(1))
And finally call the udaf and udfs
val windowSpec = Window.partitionBy($"Col1").orderBy($"t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)
.withColumn("rank", rank() over windowSpec)
df = df.na.fill(0)
val boendal = new Boendal
val df2 = df.groupBy("Col1").agg(boendal($"Col1", $"t0", $"t1", $"rank").as("grp2")).withColumnRenamed("Col1", "Col2")
.withColumn("grp2", explode(split($"grp2", "::")))
.withColumn("rank2", rankUdf($"grp2"))
.withColumn("grp2", removeRankUdf($"grp2"))
df = df.join(df2, df("Col1") === df2("Col2") && df("rank") === df2("rank2"))
.drop("Col2", "rank", "rank2")
df.show(false)
Hope it helps