Scala dataframe: replace spaces with null value using regexp_replace - scala

I am trying to replace white-spaces with a null value using regexp_replace in Scala. However, all variations I have tried do not arrive at the expected output:
+---+-----+
| Id|col_1|
+---+-----+
| 0| null|
| 1| null|
+---+-----+
I had a go at it which looks like this:
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(
(0, " "),
(1, null),
(2, "hello"))).toDF("Id", "col_1")
val test = df.withColumn("col_1", regexp_replace(df("col_1"), "^\\s*", lit(Null)))
test.filter("col_1 is null").show()

The way you use regexp_replace won't work as the result will simply be a string with the matched substring replaced with another provided substring. You can use regexp_extract instead for a regex equality check in a when/other clause as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(0, " "),
(1, null),
(2, "hello"),
(3, "")
).toDF("Id", "col_1")
df.withColumn("col_1",
when($"col_1" === regexp_extract($"col_1", "(^\\s*$)", 1), null).
otherwise($"col_1")
).show
// +---+-----+
// | Id|col_1|
// +---+-----+
// | 0| null|
// | 1| null|
// | 2|hello|
// | 3| null|
// +---+-----+

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Concatenate list of columns except when any of them is null

I have a dataframe for which I want to add a new column, which is a concatenation of all the items from the columns in listOfFixedColumns using "_". I want to set the new column's value to null if any of the columns in listOfFixedColumns is null.
+---+----+---------+
| a| b| unique_id |
+---+----+---------+
|foo | bar| foo_bar|
|null|bar | null |
|baz |null| null |
|null|null| null |
+---+----+---------+
I tried this which gets me only the concatenated column values
val listOfFixedColumns = List("A", "B", ..) // dynamic list of columns names as strings
df.withColumn("unique_id", concat_ws("_", listOfFixedColumns.map(c => col(c)): _*))
but I am not able to figure out how to take care of the null cases:
+---+----+---------+
| a| b|unique_id|
+---+----+---------+
|foo | bar| foo_bar|
|null|bar | bar |<-- needs a fix
|baz |null| baz |<-- needs a fix
|null|null| null |
+---+----+---------+
Do I have to use UDFs for this? I am a Scala beginner and any help will be appreciated.
You can use isNull method of Column class, together with the OR operator to find out when there is a null column. Then use is it in a condition with when:
import org.apache.spark.sql.functions.{col, concat_ws, when}
val df = Seq(
("foo", "bar", "foo_bar"),
(null, "bar", null),
("baz", null, null),
(null, null, null)
).toDF("A", "B", "C")
val listOfFixedColumns = List("A", "B", "C")
val hasNull = listOfFixedColumns
.map(col(_).isNull)
.reduce(_ || _)
val concatNonEmpty = concat_ws("_", listOfFixedColumns.map(col): _*)
df.withColumn("unique_id", when(!hasNull, concatNonEmpty).otherwise(null)).show
// +----+----+-------+---------------+
// | A| B| C| unique_id|
// +----+----+-------+---------------+
// | foo| bar|foo_bar|foo_bar_foo_bar|
// |null| bar| null| null|
// | baz|null| null| null|
// |null|null| null| null|
// +----+----+-------+---------------+

Scala Spark collect_list() vs array()

What is the difference between collect_list() and array() in spark using scala?
I see uses all over the place and the use cases are not clear to me to determine the difference.
Even though both array and collect_list return an ArrayType column, the two methods are very different.
Method array combines "column-wise" a number of columns into an array, whereas collect_list aggregates "row-wise" on a single column typically by group (or Window partition) into an array, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "a", "b"),
(1, "c", "d"),
(2, "e", "f")
).toDF("c1", "c2", "c3")
df.
withColumn("arr", array("c2", "c3")).
show
// +---+---+---+------+
// | c1| c2| c3| arr|
// +---+---+---+------+
// | 1| a| b|[a, b]|
// | 1| c| d|[c, d]|
// | 2| e| f|[e, f]|
// +---+---+---+------+
df.
groupBy("c1").agg(collect_list("c2")).
show
// +---+----------------+
// | c1|collect_list(c2)|
// +---+----------------+
// | 1| [a, c]|
// | 2| [e]|
// +---+----------------+

How to paralelize processing of dataframe in apache spark with combination over a column

I'm looking a solution to build an aggregation with all combination of a column. For example , I have for a data frame as below:
val df = Seq(("A", 1), ("B", 2), ("C", 3), ("A", 4), ("B", 5)).toDF("id", "value")
+---+-----+
| id|value|
+---+-----+
| A| 1|
| B| 2|
| C| 3|
| A| 4|
| B| 5|
+---+-----+
And looking an aggregation for all combination over the column "id". Here below I found a solution, but this cannot use the parallelism of Spark, works only on driver node or only on a single executor. Is there any better solution in order to get rid of the for loop?
import spark.implicits._;
val list =df.select($"id").distinct().orderBy($"id").as[String].collect();
val combinations = (1 to list.length flatMap (x => list.combinations(x))) filter(_.length >1)
val schema = StructType(
StructField("indexvalue", IntegerType, true) ::
StructField("segment", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
for (x <- combinations) {
initialDF = initialDF.union(df.filter($"id".isin(x: _*))
.agg(expr("sum(value)").as("indexvalue"))
.withColumn("segment",lit(x.mkString("+"))))
}
initialDF.show()
+----------+-------+
|indexvalue|segment|
+----------+-------+
| 12| A+B|
| 8| A+C|
| 10| B+C|
| 15| A+B+C|
+----------+-------+

Comparing two dataframes in Spark

I'm comparing two dataframes in spark using except().
For exmaple: df.except(df2)
I will get all the records that are not available in df2 from df. However, I would like to list field details also which are not matching.
For example:
df:
------------------
id,name,age,city
101,kp,28,CHN
------------------
df2:
-----------------
id,name,age,city
101,kp,28,HYD
----------------
Expected output:
df3
--------------------------
id,name,age,city,diff
101,kp,28,CHN,City is not matching
--------------------------------
How can I acheive this?
Use intersect to get the values common to both DataFrames,then build your not matching logic
intersect -returns a new Dataset containing rows only in both this Dataset and another Dataset.
df.intersect(df2)
return a new RDD that contains the intersection of elements in the source dataset and the argument.
intersection(anotherrdd) returns the elements which are present in both the DF.
intersection(anotherrdd) remove all the duplicate including duplicated in single DF
Newer again attempt on the above but not possible elegantly, but with JOIN as opposed to except. Best I can do.
I believe it does what you need and takes into the fact there are things in one data set or not.
Run under Databricks.
case class Person(personid: Int, personname: String, cityid: Int)
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
val df1 = Seq(
Person(0, "AgataZ", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(9999, "Maria", 2),
Person(5, "John", 2),
Person(6, "Patsy", 2),
Person(7, "Gloria", 222),
Person(3333, "Maksym", 0)).toDF
val df2 = Seq(
Person(0, "Agata", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(5, "John", 2),
Person(6, "Patsy", 333),
Person(7, "Gloria", 2),
Person(4444, "Hans", 3)).toDF
val joined = df1.join(df2, df1("personid") === df2("personid"), "outer")
val newNames = Seq("personId1", "personName1", "personCity1", "personId2", "personName2", "personCity2")
val df_Renamed = joined.toDF(newNames: _*)
// Some deliberate variation shown in approach for learning
val df_temp = df_Renamed.filter($"personCity1" =!= $"personCity2" || $"personName1" =!= $"personName2" || $"personName1".isNull || $"personName2".isNull || $"personCity1".isNull || $"personCity2".isNull).select($"personId1", $"personName1".alias("Name"), $"personCity1", $"personId2", $"personName2".alias("Name2"), $"personCity2"). withColumn("PersonID", when($"personId1".isNotNull, $"personId1").otherwise($"personId2"))
val df_final = df_temp.withColumn("nameChange ?", when($"Name".isNull or $"Name2".isNull or $"Name" =!= $"Name2", "Yes").otherwise("No")).withColumn("cityChange ?", when($"personCity1".isNull or $"personCity2".isNull or $"personCity1" =!= $"personCity2", "Yes").otherwise("No")).drop("PersonId1").drop("PersonId2")
df_final.show()
gives:
+------+-----------+------+-----------+--------+------------+------------+
| Name|personCity1| Name2|personCity2|PersonID|nameChange ?|cityChange ?|
+------+-----------+------+-----------+--------+------------+------------+
| Patsy| 2| Patsy| 333| 6| No| Yes|
|Maksym| 0| null| null| 3333| Yes| Yes|
| null| null| Hans| 3| 4444| Yes| Yes|
|Gloria| 222|Gloria| 2| 7| No| Yes|
| Maria| 2| null| null| 9999| Yes| Yes|
|AgataZ| 0| Agata| 0| 0| Yes| No|
+------+-----------+------+-----------+--------+------------+------------+