Drop rows in spark which dont follow schema - scala

currently, schema for my table is:
root
|-- product_id: integer (nullable = true)
|-- product_name: string (nullable = true)
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
I want to apply the below schema on the above table and delete all the rows which do not follow the below schema:
val productsSchema = StructType(Seq(
StructField("product_id",IntegerType,nullable = true),
StructField("product_name",StringType,nullable = true),
StructField("aisle_id",IntegerType,nullable = true),
StructField("department_id",IntegerType,nullable = true)
))

Use option "DROPMALFORMED" while loading the data which ignores corrupted records.
spark.read.format("json")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.schema(productsSchema)
.load("sample.json")

If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns.
Used filter to filter ```null`` values for all columns.
scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0
scala> schema.printTreeString
root
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
|-- product_id: long (nullable = true)
|-- product_name: string (nullable = true)
scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]
scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
|null |null |null |null |
+--------+-------------+----------+------------+
scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
+--------+-------------+----------+------------+
scala>

do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls.
scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF
res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]
scala> res7.show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//dropping row if a null is found
scala> res7.na.drop.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//drops only if `minNonNulls = 3` if accepted to each row
scala> res7.na.drop(minNonNulls = 3).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//not dropping any
scala> res7.na.drop(minNonNulls = 2).show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//drops row based on nulls in `_3` column
scala> res7.na.drop(Seq("_3")).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+

Related

How to convert rdd object to dataframe in Scala

I read data from ElasticSearch and save into an RDD.
val es_rdd = sc.esRDD("indexname/typename",query="?q=*")
The rdd has the next example data:
(uniqueId,Map(field -> value))
(uniqueId2,Map(field2 -> value2))
How can I convert this RDD (String, Map to a Dataframe (String, String, String)?
You can use explode to achieve it.
import spark.implicits._
import org.apache.spark.sql.functions._
val rdd = sc.range(1, 10).map(s => (s, Map(s -> s)))
val ds = spark.createDataset(rdd)
val df = ds.toDF()
df.printSchema()
df.show()
df.select('_1,explode('_2)).show()
output:
root
|-- _1: long (nullable = false)
|-- _2: map (nullable = true)
| |-- key: long
| |-- value: long (valueContainsNull = false)
+---+--------+
| _1| _2|
+---+--------+
| 1|[1 -> 1]|
| 2|[2 -> 2]|
| 3|[3 -> 3]|
| 4|[4 -> 4]|
| 5|[5 -> 5]|
| 6|[6 -> 6]|
| 7|[7 -> 7]|
| 8|[8 -> 8]|
| 9|[9 -> 9]|
+---+--------+
+---+---+-----+
| _1|key|value|
+---+---+-----+
| 1| 1| 1|
| 2| 2| 2|
| 3| 3| 3|
| 4| 4| 4|
| 5| 5| 5|
| 6| 6| 6|
| 7| 7| 7|
| 8| 8| 8|
| 9| 9| 9|
+---+---+-----+
I readed directly in Spark.SQL format using the next call to elastic:
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", "?q=*")
.option("pushdown", "true")
.load("indexname/typename")

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

Drop duplicates if reverse is present between two columns

I have dataframe contain (around 20000000 rows) and I'd like to drop duplicates from a dataframe for two columns if those columns have the same values, or even if those values are in the reverse order.
For example the original dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 1| B|
| 2| 1| C|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
| 4| 3| G|
+----+----+----+
where the schema of the column as follows:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
The desired dataframe should look like:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| A|
| 1| 2| D|
| 3| 5| E|
| 3| 4| F|
+----+----+----+
The dropDuplicates() method remove duplicates if the values in the same order
I followed the accepted answer to this question Pandas: remove reverse duplicates from dataframe but it took more time.
You can use this:
Hope this helps.
Note : In 'col3' 'D' will be removed istead of 'C', because 'C' is positioned before 'D'.
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack2.csv', header = 'True')
df2 = df.select(F.least(df.col1,df.col2).alias('col1'),F.greatest(df.col1,df.col2).alias('col2'),df.col3)
df2.dropDuplicates(['col1','col2']).show()

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+

How to write a nested query?

I have following table:
+-----+---+----+
|type | t |code|
+-----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+-----+---+----+
And what I want:
+-----+---+----+
|t1 | t2|code|
+-----+---+----+
| 25| 88| 11|
| 114|520| 11|
+-----+---+----+
There are two types of events A and B.
Event A is the start, Event B is the end.
I want to connect the start with the next end dependence of the code.
It's quite easy in SQL to do this:
SELECT a.t AS t1,
(SELECT b.t FROM events AS b WHERE a.code == b.code AND a.t < b.t LIMIT 1) AS t2, a.code AS code
FROM events AS a
But I have to problem to implement this in Spark because it looks like that this kind of nested query isn't supported...
I tried it with:
df.createOrReplaceTempView("events")
val sqlDF = spark.sql(/* SQL-query above */)
Error i get:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in:
Do you have any other ideas to solve that problem?
It's quite easy in SQL to do this
And so is in Spark SQL, luckily.
val events = ...
scala> events.show
+----+---+----+
|type| t|code|
+----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+----+---+----+
// assumed that t is int
scala> events.printSchema
root
|-- type: string (nullable = true)
|-- t: integer (nullable = true)
|-- code: integer (nullable = true)
val eventsA = events.
where($"type" === "A").
as("a")
val eventsB = events.
where($"type" === "B").
as("b")
val solution = eventsA.
join(eventsB, "code").
where($"a.t" < $"b.t").
select($"a.t" as "t1", $"b.t" as "t2", $"a.code").
orderBy($"t1".asc, $"t2".asc).
dropDuplicates("t1", "code").
orderBy($"t1".asc)
That should give you the requested output.
scala> solution.show
+---+---+----+
| t1| t2|code|
+---+---+----+
| 25| 88| 11|
|114|520| 11|
+---+---+----+