how to find total count of nulls present in a DataFrame using Scala - scala

I'm looking for a command in Scala to find the sum of nulls in all the columns present in a DataFrame.
In python I can get it using the command:
Df.isnull().sum()
Can you let me know the scala command for the same?

AFAIK, there is no spark function to do that but you can count the number of null values in each column and then compute the sum of all these values.
Here is sample data:
val df = spark.range(5).select(
when('id % 2 === 0, 'id) as "mod2",
when('id %3 === 0, 'id) as "mod3"
)
df.show
+----+----+
|mod2|mod3|
+----+----+
| 0| 0|
|null|null|
| 2|null|
|null| 3|
| 4|null|
+----+----+
And here is a solution:
val result = df.select(
df.columns
// null count per column
.map(c => sum(isnull(col(c)) cast "int"))
// computing the sum
.reduce(_+_) as "total_nulls"
)
result.show
+-----------+
|total_nulls|
+-----------+
| 5|
+-----------+
NB: use result.head.getAs[Long](0) to get the total out of the result dataframe.

Related

How to transform this dataset to the following dataset

Input
+------+------+------+------+
|emp_name|emp_area| dept|zip|
+------+------+------+------+
|ram|USA|"Sales"|805912|
|sham|USA|"Sales"|805912|
|ram|Canada|"Marketing"|805912|
|ram|USA|"Sales"|805912|
|sham|USA|"Marketing"|805912|
+------+------+------+------
Desired output
feature |Top1 name |Top 1 value1|Top2 name|top 2 value|
emp_name ram |3|sham |2
emp_area Usa |4|canada |1
dept sales|3|Marketing|3
zip 805912|5|NA|NA
I started with dynamically generating the count for each one of them but unable to store them in a dataset
val features=ds.columns.toList
for (e <- features) {
val ds1=ds.groupBy(e).count().sort(desc("count")).limit(5).withColumnRenamed("count", e+"_count")
}
Now how to collect all the values into one dataframe and transform to the output?
Here's a slightly verbose approach. You can map each column to a dataframe with one row, which corresponds to the row in the desired output. Add NA columns if necessary. Convert the column names to the desired ones, and finally do a unionAll to combine the dataframes (one row each).
import org.apache.spark.sql.expressions.Window
val top = 2
val result = ds.columns.map(
c => ds.groupBy(c).count()
.withColumn("rn", row_number().over(Window.orderBy(desc("count"))))
.filter(s"rn <= $top")
.groupBy().pivot("rn")
.agg(first(col(c)), first(col("count")))
.select(lit(c), col("*"))
).map(df =>
if (df.columns.size != 1 + top*2)
df.select(List(col("*")) ::: (1 to (top*2+1 - df.columns.size)).toList.map(x => lit("NA")): _*)
else df
).map(df =>
df.toDF(List("feature") ::: (1 to top).toList.flatMap(x => Seq(s"top$x name", s"top$x value")): _*)
).reduce(_ unionAll _)
result.show
+--------+---------+----------+---------+----------+
| feature|top1 name|top1 value|top2 name|top2 value|
+--------+---------+----------+---------+----------+
|emp_name| ram| 3| sham| 2|
|emp_area| USA| 4| Canada| 1|
| dept| Sales| 3|Marketing| 2|
| zip| 805912| 5| NA| NA|
+--------+---------+----------+---------+----------+

Pyspark: Delete rows on column condition after groupBy

This is my input dataframe:
id val
1 Y
1 N
2 a
2 b
3 N
Result should be:
id val
1 Y
2 a
2 b
3 N
I want to group by on col id which has both Y and N in the val and then remove the row where the column val contains "N".
Please help me resolve this issue as i am beginner to pyspark
you can first identify the problematic rows with a filter for val=="Y" and then join this dataframe back to the original one. Finally you can filter for Null values and for the rows you want to keep, e.g. val==Y. Pyspark should be able to handle the self-join even if there are a lot of rows.
The example is shown below:
df_new = spark.createDataFrame([
(1, "Y"), (1, "N"), (1,"X"), (1,"Z"),
(2,"a"), (2,"b"), (3,"N")
], ("id", "val"))
df_Y = df_new.filter(col("val")=="Y").withColumnRenamed("val","val_Y").withColumnRenamed("id","id_Y")
df_new = df_new.join(df_Y, df_new["id"]==df_Y["id_Y"],how="left")
df_new.filter((col("val_Y").isNull()) | ((col("val_Y")=="Y") & ~(col("val")=="N"))).select("id","val").show()
The result would be your preferred:
+---+---+
| id|val|
+---+---+
| 1| X|
| 1| Y|
| 1| Z|
| 3| N|
| 2| a|
| 2| b|
+---+---+

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.

How to combine where and groupBy in Spark's DataFrame?

How can I use aggregate functions in a where clause in Apache Spark 1.6?
Consider the following DataFrame
+---+------+
| id|letter|
+---+------+
| 1| a|
| 2| b|
| 3| b|
+---+------+
How can I select all rows where letter occurs more than once, i.e. the expected output would be
+---+------+
| id|letter|
+---+------+
| 2| b|
| 3| b|
+---+------+
This does obviously not work:
df.where(
df.groupBy($"letter").count()>1
)
My example its about count, but I'd like to be able to use other aggregate functions (the results thereof) as well.
EDIT:
Just for counting,I just came up with the following solution:
df.groupBy($"letter").agg(
collect_list($"id").as("ids")
)
.where(size($"ids") > 1)
.withColumn("id", explode($"ids"))
.drop($"ids")
You can use left semi join:
df.join(
broadcast(df.groupBy($"letter").count.where($"count" > 1)),
Seq("letter"),
"leftsemi"
)
or window functions:
import org.apache.spark.sql.expressions.Window
df
.withColumn("count", count($"*").over(Window.partitionBy("letter")))
.where($"count" > 1)
In Spark 2.0 or later you can Bloom filter but it is not available in 1.x

Spark dataframe filter both nulls and spaces

I have a spark dataframe for which I need to filter nulls and spaces for a particular column.
Lets say dataframe has two columns. col2 has both nulls and also blanks.
col1 col2
1 abc
2 null
3 null
4
5 def
I want to apply a filter out the records which have col2 as nulls or blanks.
Can any one please help on this.
Version:
Spark1.6.2
Scala 2.10
The standard logical operators are defined on Spark Columns:
scala> val myDF = Seq((1, "abc"),(2,null),(3,null),(4, ""),(5,"def")).toDF("col1", "col2")
myDF: org.apache.spark.sql.DataFrame = [col1: int, col2: string]
scala> myDF.show
+----+----+
|col1|col2|
+----+----+
| 1| abc|
| 2|null|
| 3|null|
| 4| |
| 5| def|
+----+----+
scala> myDF.filter(($"col2" =!= "") && ($"col2".isNotNull)).show
+----+----+
|col1|col2|
+----+----+
| 1| abc|
| 5| def|
+----+----+
Note: depending on your Spark version you will need !== or =!= (the latter is the more current option).
If you had n conditions to be met I would probably use a list to reduce the boolean columns together:
val conds = List(myDF("a").contains("x"), myDF("b") =!= "y", myDF("c") > 2)
val filtered = myDF.filter(conds.reduce(_&&_))