Deequ satisfies function not behaving as expected - pyspark

I am using pydeequ to run some checks on data, however it is not behaving as expected. One of my columns should contain any values between 0 and 1. The data looks like this
|col 1 |
| 0.5635412 |
| 0.123 |
| 1.0 |
check = Check(spark, CheckLevel.Warning, "DQ Check")
result = VerificationSuite(spark)\
.onData(df)\
.addCheck(check
.satisfies("col1 BETWEEN 0 AND 1", "range check", lambda x: x==1))\
.run()
result_df = VerificationResult.checkResultsAsDataFrame(spark, result)
THe result is returning a failure with the message
Value: 0.5635412 does not meet the constraint requirement!
Can anyone advise on where I have gone wrong?

I realised there were a couple of null values in the data I hadn't expected.
Updated code to
check = Check(spark, CheckLevel.Warning, "DQ Check")
result = VerificationSuite(spark)\
.onData(df)\
.addCheck(check
.satisfies("col1 BETWEEN 0 AND 1 OR col1 IS NULL", "range check", lambda x: x==1))\
.run()
result_df = VerificationResult.checkResultsAsDataFrame(spark, result)

Related

how to create range column based on a column value?

I have sample data in table which contains distance_travelled_in_meter, where the values are of Integer type as follows:
distance_travelled_in_meter |
--------------------------- |
500 |
1221|
990 |
575|
I want to create range based on the value of the column distance_travelled_in_meter. Range column has values with 500 intervals.
The result dataset is as follows:
distance_travelled_in_meter | range
--------------------------- |---------
500 | 1-500
1221|1000-1500
990 |500-1000
575|500-1000
For value 500, the range is 1-500 as it is within 500 meter, 1221 is in 1000-1500 and so on..
I tried using Spark.sql.functions.sequence but it takes the start and stop column values which is not what I want and want to be in range that I mentioned above. And also it creates an Range array from start column value to stop column value.
I'm using Spark2.4.2 with Scala 2.11.12
Any help is much appreciated.
You can chain multiple when expressions that you generate dynamically using something like this:
val maxDistance = 1221 // you can get this from the dataframe
val ranges = (0 until maxDistance by 500).map(x => (x, x + 500))
val rangeExpr = ranges.foldLeft(lit(null)) {
case (acc, (lowerBound, upperBound)) =>
when(
col("distance_travelled_in_meter").between(lowerBound, upperBound),
lit(s"$lowerBound-$upperBound")
).otherwise(acc)
}
val df1 = df.withColumn("range", rangeExpr)

Scala - Spark : Get the column names of the columns that contains null values

Here's the situation : I've got a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
val columnsContainsNull = df.select(mandatoryColumns: _*)
.columns
.map(column => if(df.select(column).filter(col(column).isNull).count > 0) column)
When I execute that code, it becomes incredibly slow for a reason I don't know.
Do you have any clue how can I make it work and how can I optimize that please ?
In your code, you will executing one action for every column, which can cause the execution to slow down and even more so in case of wider data.
You can calculate using the following method
Register your DF as a table using createOrReplaceTempView
df.createOrReplaceTempView("TEMP_TABLE")
Then execute the following SQL statement on it.
SELECT
SUM(case when col1 is null then 1 else 0 end) as col1_isnull,
SUM(case when col2 is null then 1 else 0 end) as col2_isnull,
....
....
SUM(case when coln is null then 1 else 0 end) as coln_isnull
from
TEMP_TABLE
If you have a lot of columns, you can also generate the statement programatically using:
val query = df2.columns.map{ c =>
s"sum(case when $c is null then 1 else 0 end) as ${c}"
}.mkString("SELECT ", ",", " FROM TEMP_TABLE")
Then once you have executed the query using
val nullCounts = spark.sql(query)
You should have a dataframe that looks like:
+----+----+----+----+
|col1|col2|....|colN|
+----+----+----+----+
| 0| 1| | 1|
+----+----+----+----+
Then, you can extract the column names that have null values using the following:
val paired = nullCounts.first
nullCounts.columns
.zipWithIndex.map{
case(k,v) => (k, paired.getLong(v))
}
.filter(_._2 > 0 )
.map(_._1)
And you should have an Array[String] which will have the column names that have NULL value in them.
// Array[String] = Array(col2)

scala fast range lookup on 2 columns

I have a spark dataframe that I am broadcasting as Array[Array[String]].
My requirement is to do a range lookup on 2 columns.
Right now I have something like ->
val cols = data.filter(_(0).toLong <= ip).filter(_(1).toLong >= ip).take(1) match {
case Array(t) => t
case _ => Array()
}
The following data file is stored as Array[Array[String]] (except for the header row that I have shown below only as reference.) and passed to the filter function shown above.
sample data file ->
startIPInt | endIPInt | lat | lon
676211200 | 676211455 | 37.33053 | -121.83823
16777216 | 16777342 | -34.9210644736842 | 138.598709868421
17081712 | 17081712 | 0 | 0
sample value to search ->
ip = 676211325
based on the range of the startIPInt and endIPInt values, I want the rest of the mapping rows.
This lookup takes 1-2 sec for each, and I am not even sure the 2nd filter condition is getting executed(in debug mode always it only seems to execute the 1st condition). Can someone suggest me a faster and more reliable lookup here?
Thanks!

How to sum values column wise for Array of Arrays in Scala?

I am a newbie to Spark and Scala and trying to solve the below problem but couldn't. Please help me with this. Appreciate your help.
The requirement is to sum the values column wise.
The below code generates
val first = vlist.map(_.select("value"))
first.map(_.show())
Output:
first: Array[org.apache.spark.sql.DataFrame] =
Array([value: array<double>], [value: array<double>])
+--------------------+
| value|
+--------------------+
|[-0.047363, 0.187...|
|[0.043701, -0.114...|
|[-0.006439, 0.031...|
|[0.168945, 0.0639...|
|[0.049805, 0.0664...|
|[-0.054932, -0.11...|
|[0.094727, -0.118...|
|[0.136719, 0.1484...|
|[-0.12793, 0.2812...|
|[-0.071289, -0.07...|
|[0.115234, -0.012...|
|[0.253906, 0.0385...|
|[-0.062988, 0.031...|
|[0.110352, 0.2480...|
|[0.042725, 0.2949...|
|[-0.074219, 0.112...|
|[0.072754, -0.092...|
|[-0.063965, 0.058...|
|[0.083496, -0.007...|
|[0.043945, 0.1767...|
+--------------------+
only showing top 20 rows
+--------------------+
| value|
+--------------------+
|[0.045654, -0.145...|
|[0.053467, 0.0120...|
|[0.033203, -0.089...|
|[-0.08252, 0.0224...|
|[0.182617, -0.044...|
|[0.136719, 0.1484...|
|[0.112793, -0.130...|
|[0.096191, -0.028...|
|[-0.007141, 0.004...|
|[0.115234, -0.012...|
|[0.130859, 0.0084...|
|[-0.020874, 0.021...|
|[-0.267578, 0.084...|
|[-0.015015, 0.193...|
|[0.036865, 0.0201...|
|[0.205078, 0.0042...|
|[-0.013733, -0.07...|
|[0.175781, 0.2128...|
|[-0.061279, -0.06...|
|[0.058838, 0.3574...|
+--------------------+
The next step should be the sum of all the values column wise. So, I should ideally end up with one row.
I tried the below code:
first.toList.transpose.map(_.sum)
Output:
<console>:183: error: No implicit view available from
org.apache.spark.sql.DataFrame => scala.collection.GenTraversableOnce[B].
first.toList.transpose.map(_.sum)
Also, I tried dividing the values into separate columns(took only 4 columns for testing purpose) and applied agg function which didn't work as well like below:
var table = first
for (i <- 0 to 3) {
table = table.map(_.withColumn("vec_" + i, $"value"(i)))
}
var inter = table.map(_.drop("value"))
inter.map(_.show())
var exprs = inter.map(_.columns.map(_ -> "sum").toMap)
inter.agg(exprs)
Output:
table: Array[org.apache.spark.sql.DataFrame] =
Array([value: array<double>], [value: array<double>])
inter: Array[org.apache.spark.sql.DataFrame] =
Array([vec_0: double,
vec_1: double ... 2 more fields],
[vec_0: double,
vec_1: double ... 2 more fields])
+---------+---------+---------+---------+
| vec_0| vec_1| vec_2| vec_3|
+---------+---------+---------+---------+
|-0.047363| 0.1875| 0.002258| 0.173828|
| 0.043701|-0.114258| 0.067383|-0.060547|
|-0.006439| 0.031982| 0.012878| 0.020264|
| 0.168945| 0.063965|-0.084473| 0.173828|
| 0.049805| 0.066406| 0.03833| 0.02356|
|-0.054932|-0.117188| 0.027832| 0.074707|
| 0.094727|-0.118652| 0.118164| 0.253906|
| 0.136719| 0.148438| 0.114746| 0.069824|
| -0.12793| 0.28125| 0.01532|-0.046631|
|-0.071289| -0.07373| 0.199219|-0.069824|
| 0.115234|-0.012512|-0.022949| 0.194336|
| 0.253906| 0.038574|-0.030396| 0.248047|
|-0.062988| 0.031494|-0.302734| 0.030396|
| 0.110352| 0.248047| -0.00769|-0.031494|
| 0.042725| 0.294922| 0.019653| 0.030884|
|-0.074219| 0.112793| 0.094727| 0.071777|
| 0.072754|-0.092773|-0.174805|-0.022583|
|-0.063965| 0.058838| 0.086914| 0.320312|
| 0.083496|-0.007294|-0.026489| -0.05957|
| 0.043945| 0.176758| 0.094727|-0.083496|
+---------+---------+---------+---------+
only showing top 20 rows
+---------+---------+---------+---------+
| vec_0| vec_1| vec_2| vec_3|
+---------+---------+---------+---------+
| 0.045654|-0.145508| 0.15625| 0.166016|
| 0.053467| 0.012024| -0.0065| 0.008545|
| 0.033203|-0.089844|-0.294922| 0.115234|
| -0.08252| 0.022461|-0.149414| 0.099121|
| 0.182617|-0.044922| 0.138672| 0.011658|
| 0.136719| 0.148438| 0.114746| 0.069824|
| 0.112793|-0.130859| 0.066895| 0.138672|
| 0.096191|-0.028687|-0.108398| 0.145508|
|-0.007141| 0.004486| 0.02063| 0.010803|
| 0.115234|-0.012512|-0.022949| 0.194336|
| 0.130859| 0.008423| 0.033447|-0.058838|
|-0.020874| 0.021851|-0.083496|-0.072266|
|-0.267578| 0.084961| 0.109863| 0.086914|
|-0.015015| 0.193359| 0.014832| 0.07373|
| 0.036865| 0.020142| 0.22168| 0.155273|
| 0.205078| 0.004211| 0.084473| 0.091309|
|-0.013733|-0.074219| 0.017334|-0.016968|
| 0.175781| 0.212891|-0.071289| 0.084961|
|-0.061279|-0.068359| 0.120117| 0.191406|
| 0.058838| 0.357422| 0.128906|-0.162109|
+---------+---------+---------+---------+
only showing top 20 rows
res4164: Array[Unit] = Array((), ())
exprs: Array[scala.collection.immutable.Map[String,String]] = Array(Map(vec_0 -> sum, vec_1 -> sum, vec_2 -> sum, vec_3 -> sum), Map(vec_0 -> sum, vec_1 -> sum, vec_2 -> sum, vec_3 -> sum))
<console>:189: error: value agg is not a member of Array[org.apache.spark.sql.DataFrame]
inter.agg(exprs)
^
Please help me with this.
I am sure there should be an easy way to do this. Thanks in advance.
Adding Sample input and output.
Sample Input :
first: Array[org.apache.spark.sql.DataFrame] =
Array([value: array<double>], [value: array<double>])
value
1,2,3,4,5,6 7,8
1,2,3,4,5,6,7,8
value
1,2,3,4,5,6 7,8
1,2,3,4,5,6,7,8
Sample Output :
first: Array[org.apache.spark.sql.DataFrame] =
Array([value: array<double>], [value: array<double>])
value
2,4,6,8,10,14,16
value
2,4,6,8,10,14,16
The below code worked. Thanks for the people who spent time on this. Appreciate it.
val first = vlist.map(_.select("value"))
first.map(_.show())
var table = first
for (i <- 0 to 3) {
table = table.map(_.withColumn("vec_" + i, $"value"(i)))
}
var inter = table.map(_.drop("value"))
inter.map(_.show())
//var exprs = inter.map(_.columns.map(_ -> "sum").toMap)
//inter.agg(exprs)
**val tab = inter.map(_.groupBy().sum())
tab.map(_.show())**
first: Array[org.apache.spark.sql.DataFrame] = Array([value: array<double>], [value: array<double>])
table: Array[org.apache.spark.sql.DataFrame] = Array([value: array<double>], [value: array<double>])
inter: Array[org.apache.spark.sql.DataFrame] = Array([vec_0: double, vec_1: double ... 2 more fields], [vec_0: double, vec_1: double ... 2 more fields])
tab: Array[org.apache.spark.sql.DataFrame] = Array([sum(vec_0): double, sum(vec_1): double ... 2 more fields], [sum(vec_0): double, sum(vec_1): double ... 2 more fields])
+------------------+------------------+------------------+------------------+
| sum(vec_0)| sum(vec_1)| sum(vec_2)| sum(vec_3)|
+------------------+------------------+------------------+------------------+
|2.5046410000000003|2.1487149999999997|1.0884870000000002|3.5877090000000003|
+------------------+------------------+------------------+------------------+
+------------------+------------------+----------+------------------+
| sum(vec_0)| sum(vec_1)|sum(vec_2)| sum(vec_3)|
+------------------+------------------+----------+------------------+
|0.9558040000000001|0.9843780000000002| 0.545025|0.9979860000000002|
+------------------+------------------+----------+------------------+
res325: Array[Unit] = Array((), ())
Well that's great if you have solved your problem. After converting "Value" column to a dataframe of different columns as mentioned above. Just do the following.
val finalDf = df.groupBy().sum()
finalDf is the dataframe containing the sum of values column-wise.
You could try with the aggregation methods, which has got function by name 'sum' which does the aggregation column wise.
df.agg(sum("col1"), sum("col2"), ...)
Hope this will help you
If I understand the question correctly, you can use 'posexplode' to explode the array with index. Then you could have grouped and summed by index.

Spark columnar performance

I'm a relative beginner to things Spark. I have a wide dataframe (1000 columns) that I want to add columns to based on whether a corresponding column has missing values
so
+----+
| A |
+----+
| 1 |
+----+
|null|
+----+
| 3 |
+----+
becomes
+----+-------+
| A | A_MIS |
+----+-------+
| 1 | 0 |
+----+-------+
|null| 1 |
+----+-------+
| 3 | 1 |
+----+-------+
This is part of a custom ml transformer but the algorithm should be clear.
override def transform(dataset: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame = {
var ds = dataset
dataset.columns.foreach(c => {
if (dataset.filter(col(c).isNull).count() > 0) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}
Loop over the columns, if > 0 nulls create a new column.
The dataset passed in is cached (using the .cache method) and the relevant config settings are the defaults.
This is running on a single laptop for now, and runs in the order of 40 minutes for the 1000 columns even with a minimal amount of rows.
I thought the problem was due to hitting a database, so I tried with a parquet file instead with the same result. Looking at the jobs UI it appears to be doing filescans in order to do the count.
Is there a way I can improve this algorithm to get better performance, or tune the cacheing in some way? Increasing spark.sql.inMemoryColumnarStorage.batchSize just got me an OOM error.
Remove the condition:
if (dataset.filter(col(c).isNull).count() > 0)
and leave only the internal expression. As it is written Spark requires #columns data scans.
If you want prune columns compute statistics once, as outlined in Count number of non-NaN entries in each column of Spark dataframe with Pyspark, and use single drop call.
Here's the code that fixes the problem.
override def transform(dataset: Dataset[_]): DataFrame = {
var ds = dataset
val rowCount = dataset.count()
val exprs = dataset.columns.map(count(_))
val colCounts = dataset.agg(exprs.head, exprs.tail: _*).toDF(dataset.columns: _*).first()
dataset.columns.foreach(c => {
if (colCounts.getAs[Long](c) > 0 && colCounts.getAs[Long](c) < rowCount ) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}