reverse effect of explode function

reverse effect of explode function - scala

In scala with spark-2.4, I would like to filter the value inside the arrays in a column.
From
+---+------------+
| id| letter|
+---+------------+
| 1|[x, xxx, xx]|
| 2|[yy, y, yyy]|
+---+------------+
To
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
I thought of using explode + filter
val res = Seq(("1", Array("x", "xxx", "xx")), ("2", Array("yy", "y", "yyy"))).toDF("id", "letter")
res.withColumn("tmp", explode(col("letter"))).filter(length(col("tmp")) < 3).drop(col("letter")).show()
And I'm getting
+---+---+
| id|tmp|
+---+---+
| 1| x|
| 1| xx|
| 2| yy|
| 2| y|
+---+---+
How do I zip/groupBy back by id ?
Or maybe there is a better, more optimised solution ?

You can filter the array without explode() in Spark 2.4:
res.withColumn("letter", expr("filter(letter, x -> length(x) < 3)")).show()
Output:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+

In Spark 2.4+, higher order functions are the way to go (filter), alternatively use collect_list :
res.withColumn("tmp",explode(col("letter")))
.filter(length(col("tmp")) < 3)
.drop(col("letter"))
// aggregate back
.groupBy($"id")
.agg(collect_list($"tmp").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
As this introduces a shuffle, it's better to use UDF for that:
def filter_arr(maxLength:Int)= udf((arr:Seq[String]) => arr.filter(str => str.size<=maxLength))
res
.select($"id",filter_arr(maxLength = 2)($"letter").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+

Related

How to get the set of rows which contains null values from dataframe in scala using filter

I'm new to spark and have a question regarding filtering dataframe based on null condition.
I have gone through many answers which has solution like
df.filter(($"col2".isNotNULL) || ($"col2" !== "NULL") || ($"col2" !== "null") || ($"col2".trim !== "NULL"))
But in my case, I can not write hard coded column names as my schema is not fixed. I am reading csv file and depending upon the columns in it, I have to filter my dataframe for null values and want it in another dataframe. In short, any column which has null value, that complete row should come under a different dataframe.
for example :
Input DataFrame :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
| n4| 4|n4#c2.com|[c2,2,d2]|
| n6| 6|n6#c2.com|[c2,2,d2]|
Output :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
Thank you in advance.

Try this-
val df1 = spark.sql("select col1, col2 from values (null, 1), (2, null), (null, null), (1,2) T(col1, col2)")
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* |1 |2 |
* +----+----+
*/
df1.show(false)
df1.filter(df1.columns.map(col(_).isNull).reduce(_ || _)).show(false)
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* +----+----+
*/

Thank you so much for your answers. I tried below logic and it worked for me.
var arrayColumn = df.columns;
val filterString = String.format(" %1$s is null or %1$s == '' "+ arrayColumn(0));
val x = new StringBuilder(filterString);
for(i <- 1 until arrayColumn.length){
if (x.toString() != ""){
x ++= String.format("or %1$s is null or %1$s == '' ", arrayColumn(i))
}
}
val dfWithNullRows = df.filter(x.toString());

To deal with null values and dataframes spark has some useful functions.
I will show some dataframes examples with distinct number of columns.
val schema = StructType(List(StructField("id", IntegerType, true), StructField("obj",DoubleType, true)))
val schema1 = StructType(List(StructField("id", IntegerType, true), StructField("obj",StringType, true), StructField("obj",IntegerType, true)))
val t1 = sc.parallelize(Seq((1,null),(1,1.0),(8,3.0),(2,null),(3,1.4),(3,2.5),(null,3.7))).map(t => Row(t._1,t._2))
val t2 = sc.parallelize(Seq((1,"A",null),(2,"B",null),(3,"C",36),(null,"D",15),(5,"E",25),(6,null,7),(7,"G",null))).map(t => Row(t._1,t._2,t._3))
val tt1 = spark.createDataFrame(t1, schema)
val tt2 = spark.createDataFrame(t2, schema1)
tt1.show()
tt2.show()
// To clean all rows with null values
val dfWithoutNull = tt1.na.drop()
dfWithoutNull.show()
val df2WithoutNull = tt2.na.drop()
df2WithoutNull.show()
// To fill null values with another value
val df1 = tt1.na.fill(-1)
df1.show()
// to get new dataframes with the null values rows
val nullValues = tt1.filter(row => row.anyNull == true)
nullValues.show()
val nullValues2 = tt2.filter(row => row.anyNull == true)
nullValues2.show()
output
// input dataframes
+----+----+
| id| obj|
+----+----+
| 1|null|
| 1| 1.0|
| 8| 3.0|
| 2|null|
| 3| 1.4|
| 3| 2.5|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
| 3| C| 36|
|null| D| 15|
| 5| E| 25|
| 6|null| 7|
| 7| G|null|
+----+----+----+
// Dataframes without null values
+---+---+
| id|obj|
+---+---+
| 1|1.0|
| 8|3.0|
| 3|1.4|
| 3|2.5|
+---+---+
+---+---+---+
| id|obj|obj|
+---+---+---+
| 3| C| 36|
| 5| E| 25|
+---+---+---+
// Dataframe with null values replaced
+---+----+
| id| obj|
+---+----+
| 1|-1.0|
| 1| 1.0|
| 8| 3.0|
| 2|-1.0|
| 3| 1.4|
| 3| 2.5|
| -1| 3.7|
+---+----+
// Dataframes which the rows have at least one null value
+----+----+
| id| obj|
+----+----+
| 1|null|
| 2|null|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
|null| D| 15|
| 6|null| 7|
| 7| G|null|
+----+----+----+

Column comparison in spark scala

I have 2 dataframes like this.
scala> df1.show
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df2.show
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
How can i take take the max of the column-count after join?
Expected output.
+---+---------+
| id| Count|
+---+---------+
| 1|20.565656|
| 2|40.7767 |
+---+---------+

You can do this:
df1.union(df2).groupBy("ID").max("Count").show()
+---+----------+
| ID|max(Count)|
+---+----------+
| 1| 20.565656|
| 2| 40.7767|
+---+----------+

After joining both dataframes, create an UDF with 2 count columns as input and in the UDF return the greatest value between those columns.
Always its a good practice to use UDF when we need to derive a single column based on multiple columns.

scala> df.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2|30.676776|
+---+---------+
scala> df1.show()
+---+-----------+
| ID| Count|
+---+-----------+
| 1|10.00998787|
| 2| 40.7767|
+---+-----------+
scala> df.alias("x").join(df1.alias("y"), List("ID"))
.select(col("ID"), col("x.count").alias("Xcount"),col("y.count").alias("Ycount"))
.withColumn("Count", when(col("Xcount") >= col("Ycount"), col("Xcount")).otherwise(col("Ycount")))
.drop("Xcount", "YCount")
.show()
+---+---------+
| ID| Count|
+---+---------+
| 1|20.565656|
| 2| 40.7767|
+---+---------+

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.

You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+

I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.

I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB

You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+

Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+

You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+

Spark dataframe filter

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Can any one please help on this?
I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?
Version: Spark 1.6.2
Scala : 2.10

This works too. Concise and very similar to SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+

df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)

I used below to filter rows from dataframe and this worked form me.Spark 2.2
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
val filter=data.filter($"dept" =!= "IT" )

val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))
This worked.

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

reverse effect of explode function - scala

You can filter the array without explode() in Spark 2.4: res.withColumn("letter", expr("filter(letter, x -> length(x) < 3)")).show() Output: +---+-------+ | id| letter| +---+-------+ | 1|[x, xx]| | 2|[yy, y]| +---+-------+

Related

How to get the set of rows which contains null values from dataframe in scala using filter

Column comparison in spark scala

Fill null values in dataframe column with next value

How to replace empty values in a column of DataFrame?

Spark dataframe filter

Categories

Resources