How to count the number of missing values in each row of a data frame -spark scala? - scala

I want to count the number of missing values in each row of a data frame in spark scala.
Code:
val samplesqlDF = spark.sql("SELECT * FROM sampletable")
samplesqlDF.show()
Input Dataframe:
------------------------------------------------------------------
| name | age | degree | Place |
| -----------------------------------------------------------------|
| Ram | | MCA | Bangalore |
| | 25 | | |
| | 26 | BE | |
| Raju | 21 | Btech | Chennai |
-----------------------------------------------------------------
The Output Data frame (Row Level Count) as follows:
-----------------------------------------------------------------
| name | age | degree | Place | rowcount |
| ----------------------------------------------------------------|
| Ram | | MCA | Bangalore | 1 |
| | 25 | | | 3 |
| | 26 | BE | | 2 |
| Raju | 21 | Btech | Chennai | 0 |
-----------------------------------------------------------------
I am a beginner to scala and spark. Thanks in advance.

Looks like you want to get the null count in a dynamic way. Check this out
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
df.show(false)
val df2 = df.columns.foldLeft(df)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )
df2.createOrReplaceTempView("student")
val sql_str_null = df.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = df.columns.mkString( "select ", ",", " , " + sql_str_null + " from student")
spark.sql(sql_str_full).show(false)
Output:
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+

Also a possibility and checking also for "" but not using foldLeft just to demonstrate the point:
import org.apache.spark.sql.functions._
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,""),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","place")
// Count per row the null or "" columns!
val null_counter = Seq("name", "age", "degree", "place").map(x => when(col(x) === "" || col(x).isNull , 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("nulls_cnt", null_counter)
df2.show(false)
returns:
+----+----+------+---------+---------+
|name|age |degree|place |nulls_cnt|
+----+----+------+---------+---------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null | |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+---------+

A simplified version of the one suggested by #stack0114106 is
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),
(null,"26","BE",null),("Raju","21","Btech","Chennai"))
.toDF("name","age","degree","Place")
.withColumn("null_count", lit(0))
val df2 = df.columns.foldLeft(df)((df,c) =>
df.withColumn("null_count",
when(col(c).isNull,$"null_count" + 1).otherwise($"null_count")
)
)
df2.show(false)
the output is
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+

Related

Create new columns from values of other columns in Scala Spark

I have an input dataframe:
inputDF=
+--------------------------+-----------------------------+
| info (String) | chars (Seq[String]) |
+--------------------------+-----------------------------+
|weight=100,height=70 | [weight,height] |
+--------------------------+-----------------------------+
|weight=92,skinCol=white | [weight,skinCol] |
+--------------------------+-----------------------------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |
+--------------------------+-----------------------------+
How to I get this dataframe as an output? I do not know in advance what are the strings contained in chars column
outputDF=
+--------------------------+-----------------------------+-------+-------+-------+-------+
| info (String) | chars (Seq[String]) | weight|height |skinCol|hairCol|
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=100,height=70 | [weight,height] | 100 | 70 | null |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=92,skinCol=white | [weight,skinCol] | 92 |null |white |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |null |null |white |gray |
+--------------------------+-----------------------------+-------+-------+-------+-------+
I also would like to save the following Seq[String] as a variable, but without using .collect() function on the dataframes.
val aVariable: Seq[String] = [weight, height, skinCol, hairCol]
You create another dataframe pivoting on the key of info column than join it back using an id column:
import spark.implicits._
val data = Seq(
("weight=100,height=70", Seq("weight", "height")),
("weight=92,skinCol=white", Seq("weight", "skinCol")),
("hairCol=gray,skinCol=white", Seq("hairCol", "skinCol"))
)
val df = spark.sparkContext.parallelize(data).toDF("info", "chars")
.withColumn("id", monotonically_increasing_id() + 1)
val pivotDf = df
.withColumn("tmp", split(col("info"), ","))
.withColumn("tmp", explode(col("tmp")))
.withColumn("val1", split(col("tmp"), "=")(0))
.withColumn("val2", split(col("tmp"), "=")(1)).select("id", "val1", "val2")
.groupBy("id").pivot("val1").agg(first(col("val2")))
df.join(pivotDf, Seq("id"), "left").drop("id").show(false)
+--------------------------+------------------+-------+------+-------+------+
|info |chars |hairCol|height|skinCol|weight|
+--------------------------+------------------+-------+------+-------+------+
|weight=100,height=70 |[weight, height] |null |70 |null |100 |
|hairCol=gray,skinCol=white|[hairCol, skinCol]|gray |null |white |null |
|weight=92,skinCol=white |[weight, skinCol] |null |null |white |92 |
+--------------------------+------------------+-------+------+-------+------+
for your second question you can get those values in a dataframe like this:
df.withColumn("tmp", explode(split(col("info"), ",")))
.withColumn("values", split(col("tmp"), "=")(0)).select("values").distinct().show()
+-------+
| values|
+-------+
| height|
|hairCol|
|skinCol|
| weight|
+-------+
but you cannot get them in Seq variable without using collect, that just impossible.

Combine multiple spark dataframe rows into one based on other columns i.e. applying CDC

I have data in below form in my spark dataframe.
+----+------+------+------+-----------+-------------------------+
| id | name | age | city | operation | update_time |
+----+------+------+------+-----------+-------------------------+
| 1 | jon | 12 | NULL | INSERT | 2021-10-11T16:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 1 | NULL | NULL | NY | UPDATE | 2021-10-11T17:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 1 | jack | NULL | NULL | UPDATE | 2021-10-11T18:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 2 | sam | 11 | TN | INSERT | 2021-10-11T18:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 3 | tim | NULL | CA | INSERT | 2021-10-11T16:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 3 | NULL | 33 | MT | UPDATE | 2021-10-11T17:11:00.378 |
+----+------+------+------+-----------+-------------------------+
I am trying to look for functions in dataframe that can help me transform the data into below form. But didn't find anything. The most I can think of is joins but it should be with multiple dataframe. But here I have only one. So how to collapse the rows to one and include all the updated column values into one row.
+----+------+-----+------+-------------------------+
| id | name | age | city | update_time |
+----+------+-----+------+-------------------------+
| 1 | jack | 12 | NY | 2021-10-11T18:11:00.378 |
+----+------+-----+------+-------------------------+
| 2 | sam | 11 | TN | 2021-10-11T18:11:00.378 |
+----+------+-----+------+-------------------------+
| 3 | tim | 33 | MT | 2021-10-11T17:11:00.378 |
+----+------+-----+------+-------------------------+
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DataFrameWindowFunc extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("DataFrame-WindowFunction")
.getOrCreate()
import spark.implicits._
case class Data(
id: Int,
name: Option[String],
age: Option[Int],
city: Option[String],
operation: String,
updateTime: String
)
val sourceData = Seq(
Data(1, Some("jon"), Some(12), None, "INSERT", "2021-10-11T16:11:00.378"),
Data(1, None, None, Some("NY"), "UPDATE", "2021-10-11T17:11:00.378"),
Data(1, Some("jack"), None, None, "UPDATE", "2021-10-11T18:11:00.378"),
Data(
2,
Some("sam"),
Some(11),
Some("TN"),
"INSERT",
"2021-10-11T18:11:00.378"
),
Data(3, Some("tim"), None, Some("CA"), "INSERT", "2021-10-11T16:11:00.378"),
Data(3, None, Some(33), Some("MT"), "UPDATE", "2021-10-11T17:11:00.378")
).toDF
sourceData.show(false)
// +---+----+----+----+---------+-----------------------+
// |id |name|age |city|operation|updateTime |
// +---+----+----+----+---------+-----------------------+
// |1 |jon |12 |null|INSERT |2021-10-11T16:11:00.378|
// |1 |null|null|NY |UPDATE |2021-10-11T17:11:00.378|
// |1 |jack|null|null|UPDATE |2021-10-11T18:11:00.378|
// |2 |sam |11 |TN |INSERT |2021-10-11T18:11:00.378|
// |3 |tim |null|CA |INSERT |2021-10-11T16:11:00.378|
// |3 |null|33 |MT |UPDATE |2021-10-11T17:11:00.378|
// +---+----+----+----+---------+-----------------------+
val res = sourceData
.orderBy(col("id"), col("updateTime").desc)
.groupBy("id")
.agg(
first(col("name"), true).alias("name"),
first(col("age"), true).alias("age"),
first(col("city"), true).alias("city"),
first(col("updateTime"), true).alias("update_time")
)
.orderBy(col("id"))
res.show(false)
// +---+----+---+----+-----------------------+
// |id |name|age|city|update_time |
// +---+----+---+----+-----------------------+
// |1 |jack|12 |NY |2021-10-11T18:11:00.378|
// |2 |sam |11 |TN |2021-10-11T18:11:00.378|
// |3 |tim |33 |MT |2021-10-11T17:11:00.378|
// +---+----+---+----+-----------------------+
}

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.
Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+

Converting dataframe column into onehotencoder like columns

I am trying to find the solution to convert specific column into onehotencoder type columns. For example
-------------
Content|type|
-------------
alpha | A |
beta | B |
gamma | C |
theta | A |
zeta | C |
neta | B |
-------------
And, what I am trying to do is following.
----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha | 1 | 0 | 0 |
beta | 0 | 1 | 0 |
gamma | 0 | 0 | 1 |
theta | 1 | 0 | 0 |
zeta | 0 | 0 | 1 |
neta | 0 | 1 | 0 |
-----------------------------
I think pivot is what you are looking for
val df = Seq(
("alpha", "A"),
("beta", "B"),
("gamma", "C"),
("theta", "A"),
("zeta", "C"),
("neta", "B")
).toDF("Content", "type")
val result = df.groupBy("Content")
.pivot("type")
.agg(count("type"))
.na.fill(0)
Output:
+-------+---+---+---+
|Content|A |B |C |
+-------+---+---+---+
|neta |0 |1 |0 |
|beta |0 |1 |0 |
|gamma |0 |0 |1 |
|theta |1 |0 |0 |
|zeta |0 |0 |1 |
|alpha |1 |0 |0 |
+-------+---+---+---+

SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID

I want to create a new column end_date for an id with the value of start_date column of the updated record for the same id using Spark Scala
Consider the following Data frame:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | b | 1/1/2018 |
| 3 | c | 1/1/2018 |
| 4 | d | 1/1/2018 |
| 1 | e | 10/1/2018|
+---+-----+----------+
Here initially start date of id=1 is 1/1/2018 and value is a, while on 10/1/2018(start_date) the value of id=1 became e. so i have to populate a new column end_date and populate value for id=1 in the beginning to 10/1/2018 and NULL values for all other records for end_date column
Result should be like below:
+---+-----+----------+---------+
| id|Value|start_date|end_date |
+---+---- +----------+---------+
| 1 | a | 1/1/2018 |10/1/2018|
| 2 | b | 1/1/2018 |NULL |
| 3 | c | 1/1/2018 |NULL |
| 4 | d | 1/1/2018 |NULL |
| 1 | e | 10/1/2018|NULL |
+---+-----+----------+---------+
I am using spark 2.3.
Can anyone help me out here please
With Window function "lead":
val df = List(
(1, "a", "1/1/2018"),
(2, "b", "1/1/2018"),
(3, "c", "1/1/2018"),
(4, "d", "1/1/2018"),
(1, "e", "10/1/2018")
).toDF("id", "Value", "start_date")
val idWindow = Window.partitionBy($"id")
.orderBy($"start_date")
val result = df.withColumn("end_date", lead($"start_date", 1).over(idWindow))
result.show(false)
Output:
+---+-----+----------+---------+
|id |Value|start_date|end_date |
+---+-----+----------+---------+
|3 |c |1/1/2018 |null |
|4 |d |1/1/2018 |null |
|1 |a |1/1/2018 |10/1/2018|
|1 |e |10/1/2018 |null |
|2 |b |1/1/2018 |null |
+---+-----+----------+---------+