Escaping Characters in Spark.SQL() with Wildcard Inside of Query - pyspark

Sample Query:
df = spark.sql("""
select distinct key,
coalesce(get_json_object(col2,'$.value'), case when col2 like '%value\\u0022: false%' then 'false' when col2 like '%value\\u0022: true%' then 'true' end) as col2flag
from Table
""")
In Impala there exists a payload structure with \u0022 for the value needed. Escaping this unicode character in Impala SQL is done with an extra \ slash.
When this DF is pulled via Pyspark the values that are pulled from the case statement is null when it is expected to be true. I have tried the above query with one forward slash and two.

Try set spark.sql.parser.escapedStringLiterals to 'true' then use four slashes to parse literal \ in the strings. See this example here :
spark.sql("select * from df1").show(truncate=False)
+---+---------------------+
|id |str |
+---+---------------------+
|1 |value\x2 : true |
|2 |test value\u2 : false|
|3 |val: true |
+---+---------------------+
spark.conf.set("spark.sql.parser.escapedStringLiterals", "true")
spark.sql(" select * from df1 where str like '%value\\\\u%' ").show(truncate=False)
+---+---------------------+
|id |str |
+---+---------------------+
|2 |test value\u2 : false|
+---+---------------------+

Related

Counting distinct values for a given column partitioned by a window function, without using approx_count_distinct()

I have the following dataframe:
val df1 = Seq(("Roger","Rabbit", "ABC123"), ("Roger","Rabit", "ABC123"),("Roger","Rabbit", "ABC123"), ("Trevor","Philips","XYZ987"), ("Trevor","Philips","XYZ987")).toDF("first_name", "last_name", "record")
+----------+---------+------+
|first_name|last_name|record|
+----------+---------+------+
|Roger |Rabbit |ABC123|
|Roger |Rabit |ABC123|
|Roger |Rabbit |ABC123|
|Trevor |Philips |XYZ987|
|Trevor |Philips |XYZ987|
+----------+---------+------+
I want to group records in this dataframe by the column record. And then I want to look for anomalies in the fields first_name and last_name, which should remain constant for all records with same record value.
The best approach I found so far is using approx_count_distinct:
val wind_person = Window.partitionBy("record")
df1.withColumn("unique_fields",cconcat($"first_name",$"last_name"))
.withColumn("anomaly",capprox_count_distinct($"unique_fields") over wind_person)
.show(false)
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabbit |ABC123|RogerRabbit |2 |
|Roger |Rabit |ABC123|RogerRabit |2 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
|Trevor |Philips |XYZ987|TrevorPhilips|1 |
+----------+---------+------+-------------+-------+
Where an anomaly is detected is anomaly column is greater than 1.
The problem is with approx_count_distinct we get just an approximation, and I am not sure how much confident we can be that it will always return an accurate count.
Some extra information:
The Dataframe may contain over 500M records
The Dataframe is previously repartitioned based on record column
For each different value of record, no more than 15 rows will be there
Is is safe to use approx_count_distinct in this scenario with a 100% accuracy or are there better window functions in spark to achieve this?
You can use collect_set of unique_fields over the window wind_person and get it's size which is equivalent to the count distinct of that field :
df1.withColumn("unique_fields", concat($"first_name", $"last_name"))
.withColumn("anomaly", size(collect_set($"unique_fields").over(wind_person)))
.show
//+----------+---------+------+-------------+-------+
//|first_name|last_name|record|unique_fields|anomaly|
//+----------+---------+------+-------------+-------+
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Roger |Rabit |ABC123|RogerRabit |2 |
//|Roger |Rabbit |ABC123|RogerRabbit |2 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//|Trevor |Philips |XYZ987|TrevorPhilips|1 |
//+----------+---------+------+-------------+-------+
You can get the exact countDistinct over a Window using some dense_rank operations:
val df2 = df1.withColumn(
"unique_fields",
concat($"first_name",$"last_name")
).withColumn(
"anomaly",
dense_rank().over(Window.partitionBy("record").orderBy("unique_fields")) +
dense_rank().over(Window.partitionBy("record").orderBy(desc("unique_fields")))
- 1
)
df2.show
+----------+---------+------+-------------+-------+
|first_name|last_name|record|unique_fields|anomaly|
+----------+---------+------+-------------+-------+
| Roger| Rabit|ABC123| RogerRabit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Roger| Rabbit|ABC123| RogerRabbit| 2|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
| Trevor| Philips|XYZ987|TrevorPhilips| 1|
+----------+---------+------+-------------+-------+

delete records from dataframe where any of the column is null or empty

Is there any method where we can delete the records from a dataframe where any of the column values is null or empty?
+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type |city |state|population|
+---+-------+--------+-------------------+-----+----------+
|1 |704 |STANDARD| |PR |30100 |
|2 |704 | |PASEO COSTA DEL SUR|PR | |
|3 |76166 |UNIQUE |CINGULAR WIRELESS |TX |84000 |
+---+-------+--------+-------------------+-----+----------+
I want output to be:
+---+-------+------+-----------------+-----+----------+
|id |zipcode|type |city |state|population|
+---+-------+------+-----------------+-----+----------+
|4 |76166 |UNIQUE|CINGULAR WIRELESS|TX |84000 |
+---+-------+------+-----------------+-----+----------+
Try this:
df
.na.replace(df.columns,Map("" -> null)) // convert empty strings with null
.na.drop() // drop nulls and NaNs
.show()
Try this:
df_name.na.drop()
.show(false)
Hope it helps...

Spark Dataframes: Add Conditional column to dataframe

I want to add a conditional column Flag to dataframe A. When the following two conditions are satisfied, add 1 to Flag, otherwise 0:
num from dataframe A is in between numStart and numEnd from dataframe B.
If the above condition satifies, check if include is 1.
DataFrame A (it's a very big dataframe, containing millions of rows):
+----+------+-----+------------------------+
|num |food |price|timestamp |
+----+------+-----+------------------------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|
+----+------+-----+------------------------+
DataFrame B (it's a very small DF, containing only 100 rows):
+----------+-----------+-------+
|numStart |numEnd |include|
+----------+-----------+-------+
|0 |200 |1 |
|250 |1050 |0 |
|2000 |3000 |1 |
|10001 |15001 |1 |
+----------+-----------+-------+
Expected output:
+----+------+-----+------------------------+----------+
|num |food |price|timestamp |Flag |
+----+------+-----+------------------------+----------+
|1275|tomato|1.99 |2018-07-21T00:00:00.683Z|0 |
|145 |carrot|0.45 |2018-07-21T00:00:03.346Z|1 |
|2678|apple |0.99 |2018-07-21T01:00:05.731Z|1 |
|6578|banana|1.29 |2018-07-20T01:11:59.957Z|0 |
|1001|taco |2.59 |2018-07-21T01:00:07.961Z|0 |
+----+------+-----+------------------------+----------+
You can left-join dfB to dfA based on the condition you described in (i), then build a Flag column using withColumn and the coalesce function to "default" to 0:
Records for which a match was found would use the include value of the matching dfB record
Records for which there was no match would have include=null, and per your requirement such records should get Flag=0, so we use coalesce which in case of null returns the default value with a literal lit(0)
Lastly, get rid of the dfB columns which are of no interest to you:
import org.apache.spark.sql.functions._
import spark.implicits._ // assuming "spark" is your SparkSession
dfA.join(dfB, $"num".between($"numStart", $"numEnd"), "left")
.withColumn("Flag", coalesce($"include", lit(0)))
.drop(dfB.columns: _*)
.show()
// +----+------+-----+--------------------+----+
// | num| food|price| timestamp|Flag|
// +----+------+-----+--------------------+----+
// |1275|tomato| 1.99|2018-07-21T00:00:...| 0|
// | 145|carrot| 0.45|2018-07-21T00:00:...| 1|
// |2678| apple| 0.99|2018-07-21T01:00:...| 1|
// |6578|banana| 1.29|2018-07-20T01:11:...| 0|
// |1001| taco| 2.59|2018-07-21T01:00:...| 0|
// +----+------+-----+--------------------+----+
Join the two dataframes together on the first condition while keeping all rows in dataframe A (i.e. with a left join, see code below). After the join, the include column can be renamed Flag and any NaN values inside it are set to 0. The two extra columns, numStart and numEnd are dropped.
The code can thus be written as follows:
A.join(B, $"num" >= $"numStart" && $"num" <= $"numEnd", "left")
.withColumnRenamed("include", "Flag")
.drop("numStart", "numEnd")
.na.fill(Map("Flag" -> 0))

Convert every value of a dataframe

I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.
I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.
With this read in as input dataframe:
|prodid|name |city|
+------+-------+----+
|1 |Harshit|VNS |
|2 |Mohit |BLR |
|2 |Mohit |RAO |
|2 |Mohit |BTR |
|3 |Rohit |BOM |
|4 |Shobhit|KLK |
I tried the following code.
val columns = df.columns
df.map{ row =>
row.toSeq.map{col => "\""+col+"\"" }
}.toDF(columns:_*)
But it throws an error stating there's only 1 header i.e value in the mapped dataframe.
This is the actual result (if I remove ".df(columns:_*)"):
| value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+
And my expected result is something like:
|prodid|name |city |
+------+---------+------+
|"1" |"Harshit"|"VNS" |
|"2" |"Mohit" |"BLR" |
|"2" |"Mohit" |"RAO" |
|"2" |"Mohit" |"BTR" |
|"3" |"Rohit" |"BOM" |
|"4" |"Shobhit"|"KLK" |
Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?
Edit: If I need the quotes on all values except the Integers. So, the output is something like:
|prodid|name |city |
+------+---------+------+
|1 |"Harshit"|"VNS" |
|2 |"Mohit" |"BLR" |
|2 |"Mohit" |"RAO" |
|2 |"Mohit" |"BTR" |
|3 |"Rohit" |"BOM" |
|4 |"Shobhit"|"KLK" |
Might be easier to use select instead:
val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
.toDF("prodid", "name", "city")
df.select(df.schema.fields.map {
case StructField(name, IntegerType, _, _) => col(name)
case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
}:_*).show()
Output:
+------+---------+-----+
|prodid| name| city|
+------+---------+-----+
| 1|"Harshit"|"VNS"|
| 2| "Mohit"|"BLR"|
+------+---------+-----+
Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.

Data Type validation in pyspark

We are building a data ingestion framework in pyspark and wondering what the best way is to handle datatype exceptions. Basically, we want to have a reject table capturing all the data that does not confirm to the schema.
stringDf = sparkSession.createDataFrame(
[
("11/25/1991","1"),
("11/24/1991", None),
("11/30/1991","a")
],
['dateAsString','intAsString']
)
Here is my stringDf with two columns.
+------------+-----------+
|dateAsString|intAsString|
+------------+-----------+
| 11/25/1991| 1|
| 11/24/1991| null|
| 11/30/1991| a|
+------------+-----------+
I would like to create a new column to the data frame called dataTypeValidationErrors to capture all the errors that might be present in this dataset. What is the best way to achieve this using pyspark?
+------------+-----------+------------------------+
|dateAsString|intAsString|dataTypeValidationErrors|
+------------+-----------+------------------------+
| 11/25/1991| 1|None |
| 11/24/1991| null|None |
| 11/30/1991| a|Not a valid Number |
+------------+-----------+------------------------+
You can just try to cast the column to the desired DataType. If there is a mismatch or error, null will be returned. In these cases you need to verify that the original value wasn't null, and if not there was an error.
Use pyspark.sql.functions.when() to test if the casted column is null and the original value was not null.
If this is True, then use the string literal "Not a valid Number" as the column value. Otherwise return the string "None".
For example:
import pyspark.sql.functions as f
stringDf.withColumn(
"dataTypeValidationErrors",
f.when(
f.col("intAsString").cast("int").isNull() & f.col("intAsString").isNotNull(),
f.lit("Not a valid Number")
).otherwise(f.lit("None"))
)\
.show()
#+------------+-----------+------------------------+
#|dateAsString|intAsString|dataTypeValidationErrors|
#+------------+-----------+------------------------+
#| 11/25/1991| 1| None|
#| 11/24/1991| null| None|
#| 11/30/1991| a| Not a valid Number|
#+------------+-----------+------------------------+
You could also expand this to multiple columns:
Suppose you had one more row with an invalid dateAsString value:
stringDf = spark.createDataFrame(
[
("11/25/1991","1"),
("11/24/1991", None),
("11/30/1991","a"),
("13.14.15", "b")
],
['dateAsString','intAsString']
)
Use a dictionary to define the conversion for each column:
conversions = {
'dateAsString':lambda c: f.from_unixtime(f.unix_timestamp(c,"MM/dd/yyyy")).cast("date"),
'intAsString':lambda c: f.col(c).cast('int')
}
stringDf.withColumn(
"dataTypeValidationErrors",
f.concat_ws(", ",
*[
f.when(
v(k).isNull() & f.col(k).isNotNull(),
f.lit(k + " not valid")
).otherwise(f.lit(None))
for k, v in conversions.items()
]
)
)\
.show(truncate=False)
#+------------+-----------+---------------------------------------------+
#|dateAsString|intAsString|dataTypeValidationErrors |
#+------------+-----------+---------------------------------------------+
#|11/25/1991 |1 | |
#|11/24/1991 |null | |
#|11/30/1991 |a |intAsString not valid |
#|13.14.15 |b |dateAsString not valid, intAsString not valid|
#+------------+-----------+---------------------------------------------+
Or if you just want to know if there was an error on a row, without needing to know specifics:
stringDf.withColumn(
"dataTypeValidationErrors",
f.when(
reduce(
lambda a, b: a|b,
(v(k).isNull() & f.col(k).isNotNull() for k, v in conversions.items())
),
f.lit("Validation Error")
).otherwise(f.lit("None"))
)\
.show(truncate=False)
#+------------+-----------+------------------------+
#|dateAsString|intAsString|dataTypeValidationErrors|
#+------------+-----------+------------------------+
#|11/25/1991 |1 |None |
#|11/24/1991 |null |None |
#|11/30/1991 |a |Validation Error |
#|13.14.15 |b |Validation Error |
#+------------+-----------+------------------------+