How to count the number of missing values in each row of a data frame -spark scala?

How to count the number of missing values in each row of a data frame -spark scala? - scala

I want to count the number of missing values in each row of a data frame in spark scala.
Code:
val samplesqlDF = spark.sql("SELECT * FROM sampletable")
samplesqlDF.show()
Input Dataframe:
------------------------------------------------------------------
| name | age | degree | Place |
| -----------------------------------------------------------------|
| Ram | | MCA | Bangalore |
| | 25 | | |
| | 26 | BE | |
| Raju | 21 | Btech | Chennai |
-----------------------------------------------------------------
The Output Data frame (Row Level Count) as follows:
-----------------------------------------------------------------
| name | age | degree | Place | rowcount |
| ----------------------------------------------------------------|
| Ram | | MCA | Bangalore | 1 |
| | 25 | | | 3 |
| | 26 | BE | | 2 |
| Raju | 21 | Btech | Chennai | 0 |
-----------------------------------------------------------------
I am a beginner to scala and spark. Thanks in advance.

Looks like you want to get the null count in a dynamic way. Check this out
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
df.show(false)
val df2 = df.columns.foldLeft(df)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )
df2.createOrReplaceTempView("student")
val sql_str_null = df.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = df.columns.mkString( "select ", ",", " , " + sql_str_null + " from student")
spark.sql(sql_str_full).show(false)
Output:
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+

Also a possibility and checking also for "" but not using foldLeft just to demonstrate the point:
import org.apache.spark.sql.functions._
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,""),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","place")
// Count per row the null or "" columns!
val null_counter = Seq("name", "age", "degree", "place").map(x => when(col(x) === "" || col(x).isNull , 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("nulls_cnt", null_counter)
df2.show(false)
returns:
+----+----+------+---------+---------+
|name|age |degree|place |nulls_cnt|
+----+----+------+---------+---------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null | |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+---------+

A simplified version of the one suggested by #stack0114106 is
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),
(null,"26","BE",null),("Raju","21","Btech","Chennai"))
.toDF("name","age","degree","Place")
.withColumn("null_count", lit(0))
val df2 = df.columns.foldLeft(df)((df,c) =>
df.withColumn("null_count",
when(col(c).isNull,$"null_count" + 1).otherwise($"null_count")
)
)
df2.show(false)
the output is
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+

Related

Create new columns from values of other columns in Scala Spark

I have an input dataframe:
inputDF=
+--------------------------+-----------------------------+
| info (String) | chars (Seq[String]) |
+--------------------------+-----------------------------+
|weight=100,height=70 | [weight,height] |
+--------------------------+-----------------------------+
|weight=92,skinCol=white | [weight,skinCol] |
+--------------------------+-----------------------------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |
+--------------------------+-----------------------------+
How to I get this dataframe as an output? I do not know in advance what are the strings contained in chars column
outputDF=
+--------------------------+-----------------------------+-------+-------+-------+-------+
| info (String) | chars (Seq[String]) | weight|height |skinCol|hairCol|
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=100,height=70 | [weight,height] | 100 | 70 | null |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|weight=92,skinCol=white | [weight,skinCol] | 92 |null |white |null |
+--------------------------+-----------------------------+-------+-------+-------+-------+
|hairCol=gray,skinCol=white| [hairCol,skinCol] |null |null |white |gray |
+--------------------------+-----------------------------+-------+-------+-------+-------+
I also would like to save the following Seq[String] as a variable, but without using .collect() function on the dataframes.
val aVariable: Seq[String] = [weight, height, skinCol, hairCol]

You create another dataframe pivoting on the key of info column than join it back using an id column:
import spark.implicits._
val data = Seq(
("weight=100,height=70", Seq("weight", "height")),
("weight=92,skinCol=white", Seq("weight", "skinCol")),
("hairCol=gray,skinCol=white", Seq("hairCol", "skinCol"))
)
val df = spark.sparkContext.parallelize(data).toDF("info", "chars")
.withColumn("id", monotonically_increasing_id() + 1)
val pivotDf = df
.withColumn("tmp", split(col("info"), ","))
.withColumn("tmp", explode(col("tmp")))
.withColumn("val1", split(col("tmp"), "=")(0))
.withColumn("val2", split(col("tmp"), "=")(1)).select("id", "val1", "val2")
.groupBy("id").pivot("val1").agg(first(col("val2")))
df.join(pivotDf, Seq("id"), "left").drop("id").show(false)
+--------------------------+------------------+-------+------+-------+------+
|info |chars |hairCol|height|skinCol|weight|
+--------------------------+------------------+-------+------+-------+------+
|weight=100,height=70 |[weight, height] |null |70 |null |100 |
|hairCol=gray,skinCol=white|[hairCol, skinCol]|gray |null |white |null |
|weight=92,skinCol=white |[weight, skinCol] |null |null |white |92 |
+--------------------------+------------------+-------+------+-------+------+
for your second question you can get those values in a dataframe like this:
df.withColumn("tmp", explode(split(col("info"), ",")))
.withColumn("values", split(col("tmp"), "=")(0)).select("values").distinct().show()
+-------+
| values|
+-------+
| height|
|hairCol|
|skinCol|
| weight|
+-------+
but you cannot get them in Seq variable without using collect, that just impossible.

Combine multiple spark dataframe rows into one based on other columns i.e. applying CDC

I have data in below form in my spark dataframe.
+----+------+------+------+-----------+-------------------------+
| id | name | age | city | operation | update_time |
+----+------+------+------+-----------+-------------------------+
| 1 | jon | 12 | NULL | INSERT | 2021-10-11T16:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 1 | NULL | NULL | NY | UPDATE | 2021-10-11T17:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 1 | jack | NULL | NULL | UPDATE | 2021-10-11T18:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 2 | sam | 11 | TN | INSERT | 2021-10-11T18:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 3 | tim | NULL | CA | INSERT | 2021-10-11T16:11:00.378 |
+----+------+------+------+-----------+-------------------------+
| 3 | NULL | 33 | MT | UPDATE | 2021-10-11T17:11:00.378 |
+----+------+------+------+-----------+-------------------------+
I am trying to look for functions in dataframe that can help me transform the data into below form. But didn't find anything. The most I can think of is joins but it should be with multiple dataframe. But here I have only one. So how to collapse the rows to one and include all the updated column values into one row.
+----+------+-----+------+-------------------------+
| id | name | age | city | update_time |
+----+------+-----+------+-------------------------+
| 1 | jack | 12 | NY | 2021-10-11T18:11:00.378 |
+----+------+-----+------+-------------------------+
| 2 | sam | 11 | TN | 2021-10-11T18:11:00.378 |
+----+------+-----+------+-------------------------+
| 3 | tim | 33 | MT | 2021-10-11T17:11:00.378 |
+----+------+-----+------+-------------------------+

package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DataFrameWindowFunc extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("DataFrame-WindowFunction")
.getOrCreate()
import spark.implicits._
case class Data(
id: Int,
name: Option[String],
age: Option[Int],
city: Option[String],
operation: String,
updateTime: String
)
val sourceData = Seq(
Data(1, Some("jon"), Some(12), None, "INSERT", "2021-10-11T16:11:00.378"),
Data(1, None, None, Some("NY"), "UPDATE", "2021-10-11T17:11:00.378"),
Data(1, Some("jack"), None, None, "UPDATE", "2021-10-11T18:11:00.378"),
Data(
2,
Some("sam"),
Some(11),
Some("TN"),
"INSERT",
"2021-10-11T18:11:00.378"
),
Data(3, Some("tim"), None, Some("CA"), "INSERT", "2021-10-11T16:11:00.378"),
Data(3, None, Some(33), Some("MT"), "UPDATE", "2021-10-11T17:11:00.378")
).toDF
sourceData.show(false)
// +---+----+----+----+---------+-----------------------+
// |id |name|age |city|operation|updateTime |
// +---+----+----+----+---------+-----------------------+
// |1 |jon |12 |null|INSERT |2021-10-11T16:11:00.378|
// |1 |null|null|NY |UPDATE |2021-10-11T17:11:00.378|
// |1 |jack|null|null|UPDATE |2021-10-11T18:11:00.378|
// |2 |sam |11 |TN |INSERT |2021-10-11T18:11:00.378|
// |3 |tim |null|CA |INSERT |2021-10-11T16:11:00.378|
// |3 |null|33 |MT |UPDATE |2021-10-11T17:11:00.378|
// +---+----+----+----+---------+-----------------------+
val res = sourceData
.orderBy(col("id"), col("updateTime").desc)
.groupBy("id")
.agg(
first(col("name"), true).alias("name"),
first(col("age"), true).alias("age"),
first(col("city"), true).alias("city"),
first(col("updateTime"), true).alias("update_time")
)
.orderBy(col("id"))
res.show(false)
// +---+----+---+----+-----------------------+
// |id |name|age|city|update_time |
// +---+----+---+----+-----------------------+
// |1 |jack|12 |NY |2021-10-11T18:11:00.378|
// |2 |sam |11 |TN |2021-10-11T18:11:00.378|
// |3 |tim |33 |MT |2021-10-11T17:11:00.378|
// +---+----+---+----+-----------------------+
}

Scala group by with mapped keys

I have a DataFrame that has a list of countries and the corresponding data. However the countries are either iso3 or iso2.
dfJSON
.select("value.country")
.filter(size($"value.country") > 0)
.groupBy($"country")
.agg(count("*").as("cnt"));
Now this country field can have USA as the country code or US as the country code. I need to map both USA / US ==> "United States" and then do a groupBy. How do I do this in scala.

Create another DataFrame with country_name, iso_2 & iso_3 columns.
Join your actual DataFrame with this DataFrame & Apply your logic on that data.
Check below code for sample.
scala> countryDF.show(false)
+-------------------+-----+-----+
|country_name |iso_2|iso_3|
+-------------------+-----+-----+
|Afghanistan |AF |AFG |
|?land Islands |AX |ALA |
|Albania |AL |ALB |
|Algeria |DZ |DZA |
|American Samoa |AS |ASM |
|Andorra |AD |AND |
|Angola |AO |AGO |
|Anguilla |AI |AIA |
|Antarctica |AQ |ATA |
|Antigua and Barbuda|AG |ATG |
|Argentina |AR |ARG |
|Armenia |AM |ARM |
|Aruba |AW |ABW |
|Australia |AU |AUS |
|Austria |AT |AUT |
|Azerbaijan |AZ |AZE |
|Bahamas |BS |BHS |
|Bahrain |BH |BHR |
|Bangladesh |BD |BGD |
|Barbados |BB |BRB |
+-------------------+-----+-----+
only showing top 20 rows ```
scala> df.show(false)
+-------+
|country|
+-------+
|USA |
|US |
|IN |
|IND |
|ID |
|IDN |
|IQ |
|IRQ |
+-------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.show(false)
+-------+------------------------+
|country|country_name |
+-------+------------------------+
|USA |United States of America|
|US |United States of America|
|IN |India |
|IND |India |
|ID |Indonesia |
|IDN |Indonesia |
|IQ |Iraq |
|IRQ |Iraq |
+-------+------------------------+
scala> df
.join(countryDF,(df("country") === countryDF("iso_2") || df("country") === countryDF("iso_3")),"left")
.select(df("country"),countryDF("country_name"))
.groupBy($"country_name")
.agg(collect_list($"country").as("country_code"),count("*").as("country_count"))
.show(false)
+------------------------+------------+-------------+
|country_name |country_code|country_count|
+------------------------+------------+-------------+
|Iraq |[IQ, IRQ] |2 |
|India |[IN, IND] |2 |
|United States of America|[USA, US] |2 |
|Indonesia |[ID, IDN] |2 |
+------------------------+------------+-------------+

Converting dataframe column into onehotencoder like columns

I am trying to find the solution to convert specific column into onehotencoder type columns. For example
-------------
Content|type|
-------------
alpha | A |
beta | B |
gamma | C |
theta | A |
zeta | C |
neta | B |
-------------
And, what I am trying to do is following.
----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha | 1 | 0 | 0 |
beta | 0 | 1 | 0 |
gamma | 0 | 0 | 1 |
theta | 1 | 0 | 0 |
zeta | 0 | 0 | 1 |
neta | 0 | 1 | 0 |
-----------------------------

I think pivot is what you are looking for
val df = Seq(
("alpha", "A"),
("beta", "B"),
("gamma", "C"),
("theta", "A"),
("zeta", "C"),
("neta", "B")
).toDF("Content", "type")
val result = df.groupBy("Content")
.pivot("type")
.agg(count("type"))
.na.fill(0)
Output:
+-------+---+---+---+
|Content|A |B |C |
+-------+---+---+---+
|neta |0 |1 |0 |
|beta |0 |1 |0 |
|gamma |0 |0 |1 |
|theta |1 |0 |0 |
|zeta |0 |0 |1 |
|alpha |1 |0 |0 |
+-------+---+---+---+

SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID

I want to create a new column end_date for an id with the value of start_date column of the updated record for the same id using Spark Scala
Consider the following Data frame:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | b | 1/1/2018 |
| 3 | c | 1/1/2018 |
| 4 | d | 1/1/2018 |
| 1 | e | 10/1/2018|
+---+-----+----------+
Here initially start date of id=1 is 1/1/2018 and value is a, while on 10/1/2018(start_date) the value of id=1 became e. so i have to populate a new column end_date and populate value for id=1 in the beginning to 10/1/2018 and NULL values for all other records for end_date column
Result should be like below:
+---+-----+----------+---------+
| id|Value|start_date|end_date |
+---+---- +----------+---------+
| 1 | a | 1/1/2018 |10/1/2018|
| 2 | b | 1/1/2018 |NULL |
| 3 | c | 1/1/2018 |NULL |
| 4 | d | 1/1/2018 |NULL |
| 1 | e | 10/1/2018|NULL |
+---+-----+----------+---------+
I am using spark 2.3.
Can anyone help me out here please

With Window function "lead":
val df = List(
(1, "a", "1/1/2018"),
(2, "b", "1/1/2018"),
(3, "c", "1/1/2018"),
(4, "d", "1/1/2018"),
(1, "e", "10/1/2018")
).toDF("id", "Value", "start_date")
val idWindow = Window.partitionBy($"id")
.orderBy($"start_date")
val result = df.withColumn("end_date", lead($"start_date", 1).over(idWindow))
result.show(false)
Output:
+---+-----+----------+---------+
|id |Value|start_date|end_date |
+---+-----+----------+---------+
|3 |c |1/1/2018 |null |
|4 |d |1/1/2018 |null |
|1 |a |1/1/2018 |10/1/2018|
|1 |e |10/1/2018 |null |
|2 |b |1/1/2018 |null |
+---+-----+----------+---------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to count the number of missing values in each row of a data frame -spark scala? - scala

Related

Create new columns from values of other columns in Scala Spark

Combine multiple spark dataframe rows into one based on other columns i.e. applying CDC

Scala group by with mapped keys

Converting dataframe column into onehotencoder like columns

SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID

Categories

Resources