Casting the Dataframe columns with validation in spark - scala

I need to cast the column of the data frame containing values as all string to a defined schema data types.
While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column
Example of Dataframe
+---+----------+-----+
|id |name |class|
+---+----------+-----+
|1 |abc |21 |
|2 |bca |32 |
|3 |abab | 4 |
|4 |baba |5a |
|5 |cccca | |
+---+----------+-----+
Json Schema of the file:
{"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}
In this row 4 is corrupt records as the class column is of type Integer
So only this records has to be there in corrupt records, not the 5th row

Just check if value is NOT NULL before casting and NULL after casting
import org.apache.spark.sql.functions.when
df
.withColumn("class_integer", $"class".cast("integer"))
.withColumn(
"class_corrupted",
when($"class".isNotNull and $"class_integer".isNull, $"class"))
Repeat for each column / cast you need.

Related

Counting distinct values for a given column partitioned by a window function, without using approx_count_distinct()

I have the following dataframe:
val df1 = Seq(("Roger","Rabbit", "ABC123"), ("Roger","Rabit", "ABC123"),("Roger","Rabbit", "ABC123"), ("Trevor","Philips","XYZ987"), ("Trevor","Philips","XYZ987")).toDF("first_name", "last_name", "record")
+----------+---------+------+
|first_name|last_name|record|
+----------+---------+------+
|Roger |Rabbit |ABC123|
|Roger |Rabit |ABC123|
|Roger |Rabbit |ABC123|
|Trevor |Philips |XYZ987|
|Trevor |Philips |XYZ987|
+----------+---------+------+
I want to group records in this dataframe by the column record. And then I want to look for anomalies in the fields first_name and last_name, which should remain constant for all records with same record value.
The best approach I found so far is using approx_count_distinct:
val wind_person = Window.partitionBy("record")
df1.withColumn("unique_fields",cconcat($"first_name",$"last_name"))
.withColumn("anomaly",capprox_count_distinct($"unique_fields") over wind_person)
.show(false)
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabit |ABC123|RogerRabit |2 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
+----------+---------+------+-------------+-------+
Where an anomaly is detected is anomaly column is greater than 1.
The problem is with approx_count_distinct we get just an approximation, and I am not sure how much confident we can be that it will always return an accurate count.
Some extra information:
The Dataframe may contain over 500M records
The Dataframe is previously repartitioned based on record column
For each different value of record, no more than 15 rows will be there
Is is safe to use approx_count_distinct in this scenario with a 100% accuracy or are there better window functions in spark to achieve this?
You can use collect_set of unique_fields over the window wind_person and get it's size which is equivalent to the count distinct of that field :
df1.withColumn("unique_fields", concat($"first_name", $"last_name"))
.withColumn("anomaly", size(collect_set($"unique_fields").over(wind_person)))
.show
//+----------+---------+------+-------------+-------+
//|first_name|last_name|record|unique_fields|anomaly|
//+----------+---------+------+-------------+-------+
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Roger |Rabit |ABC123|RogerRabit |2 |
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//+----------+---------+------+-------------+-------+
You can get the exact countDistinct over a Window using some dense_rank operations:
val df2 = df1.withColumn(
"unique_fields",
concat($"first_name",$"last_name")
).withColumn(
"anomaly",
dense_rank().over(Window.partitionBy("record").orderBy("unique_fields")) +
dense_rank().over(Window.partitionBy("record").orderBy(desc("unique_fields")))
- 1
)
df2.show
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
| Roger| Rabit|ABC123| RogerRabit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
+----------+---------+------+-------------+-------+

delete records from dataframe where any of the column is null or empty

Is there any method where we can delete the records from a dataframe where any of the column values is null or empty?
+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type |city |state|population|
+---+-------+--------+-------------------+-----+----------+
|1 |704 |STANDARD| |PR |30100 |
|2 |704 | |PASEO COSTA DEL SUR|PR | |
|3 |76166 |UNIQUE |CINGULAR WIRELESS |TX |84000 |
+---+-------+--------+-------------------+-----+----------+
I want output to be:
+---+-------+------+-----------------+-----+----------+
|id |zipcode|type |city |state|population|
+---+-------+------+-----------------+-----+----------+
|4 |76166 |UNIQUE|CINGULAR WIRELESS|TX |84000 |
+---+-------+------+-----------------+-----+----------+
Try this:
df
.na.replace(df.columns,Map("" -> null)) // convert empty strings with null
.na.drop() // drop nulls and NaNs
.show()
Try this:
df_name.na.drop()
.show(false)
Hope it helps...

Convert every value of a dataframe

I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.
I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.
With this read in as input dataframe:
|prodid|name |city|
+------+-------+----+
|1 |Harshit|VNS |
|2 |Mohit |BLR |
|2 |Mohit |RAO |
|2 |Mohit |BTR |
|3 |Rohit |BOM |
|4 |Shobhit|KLK |
I tried the following code.
val columns = df.columns
df.map{ row =>
row.toSeq.map{col => "\""+col+"\"" }
}.toDF(columns:_*)
But it throws an error stating there's only 1 header i.e value in the mapped dataframe.
This is the actual result (if I remove ".df(columns:_*)"):
| value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+
And my expected result is something like:
|prodid|name |city |
+------+---------+------+
|"1" |"Harshit"|"VNS" |
|"2" |"Mohit" |"BLR" |
|"2" |"Mohit" |"RAO" |
|"2" |"Mohit" |"BTR" |
|"3" |"Rohit" |"BOM" |
|"4" |"Shobhit"|"KLK" |
Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?
Edit: If I need the quotes on all values except the Integers. So, the output is something like:
|prodid|name |city |
+------+---------+------+
|1 |"Harshit"|"VNS" |
|2 |"Mohit" |"BLR" |
|2 |"Mohit" |"RAO" |
|2 |"Mohit" |"BTR" |
|3 |"Rohit" |"BOM" |
|4 |"Shobhit"|"KLK" |
Might be easier to use select instead:
val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
.toDF("prodid", "name", "city")
df.select(df.schema.fields.map {
case StructField(name, IntegerType, _, _) => col(name)
case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
}:_*).show()
Output:
+------+---------+-----+
|prodid| name| city|
+------+---------+-----+
| 1|"Harshit"|"VNS"|
| 2| "Mohit"|"BLR"|
+------+---------+-----+
Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.

PySpark sql CASE fails

i've encoutered strange behaviour when working with PySpark sqlContext. The problem is best ilustrated in the code below.
I am checking the value of COLUMN in simple case statement. However WHEN is not triggered even though the condition checks TRUE and always jumps to ELSE. Am I doing something wrong with the syntax here?
dataTest = spark.sql("""SELECT
COLUMN > 1,
CASE COLUMN
WHEN COLUMN > 1 THEN 1
ELSE COLUMN
END AS COLUMN_2,
COLUMN
FROM TABLE
""")
dataTest.sort(col("COLUMN").desc()).show(5, False)
+---------------+-------------+---------+
|COLUMN >1 |COLUMN_2 |COLUMN |
+---------------+-------------+---------+
|true |14 |14 |
|true |5 |5 |
|true |4 |4 |
|true |3 |3 |
|true |2 |2 |
+---------------+-------------+---------+
You are missing the syntax, try:
SELECT
COLUMN > 1,
CASE WHEN COLUMN > 1 THEN 1
ELSE COLUMN
END AS COLUMN_2,
COLUMN
FROM TABLE
Notice there's no COLUMN between CASE and WHEN keywords.

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense