PySpark Get Count of True Values in Boolean Columns by Row - pyspark

Please let me know if this is a duplicate. I've been searching all over to try to figure out hwo to do this without writing a user defined function. I have a bunch of boolean columns, each a different quality assurance flag, in a PySpark data frame. All I need to do is create a new column with the number of these columns with a True value, the count of QA checks each row is failing. However, I cannot, for the life of me, figure out an efficient way of doing this. Any ideas, references or links are greatly appreciated!
For instance, for one record with the above columns with the following values...
...I want to create a new column with a count of 2.
Have any good ideas?

Two methods come to mind without using user defined functions .
I'm assuming you have a python list with the boolean column names.
qa_tests = ['qa_flg_xy_equal', 'qa_flg_out_of_bounds_x'] and so forth.
plan a - build in local python a column that is the sum of all boolean columns cast as integers and then put it in spark.
from pyspark.sql.functions import col, lit
from functools import reduce
sum_bools = reduce(lambda acc, v: acc + col(v).cast("integer"), qa_tests, lit(0))
sum_bools is just an automatic way of writing lit(0) + col("qa_flg_xy_equal").cast("integer") + col("qa_flg_out_of_bounds_x").cast("integer") + ...
Here is how sum_bools is defnied:
>> Column<'((0 + CAST(qa_flg_xy_equal AS INT)) + CAST(qa_flg_out_of_bounds_x AS INT))'>
rest of code:
df.withColumn("tests_passed", sum_bools).show(truncate=0)
+---+---------------+----------------------+------------+
|id |qa_flg_xy_equal|qa_flg_out_of_bounds_x|tests_passed|
+---+---------------+----------------------+------------+
|1 |true |false |1 |
|2 |false |false |0 |
|3 |true |true |2 |
+---+---------------+----------------------+------------+
plan b - we can use Array columns to collect all booleans into one array value, filter only the true and check the size of the array after the filter.
No need to keep all the verbose steps below, you can write it in one withColumn of course.
from pyspark.sql.functions import array, filter, size
df \
.withColumn("qa_results", array(*qa_tests)) \
.withColumn("passed_results", filter(
"qa_results",
lambda test: test
)) \
.withColumn("passed_results_count", size("passed_results"))
+---+---------------+----------------------+--------------+--------------+--------------------+
| id|qa_flg_xy_equal|qa_flg_out_of_bounds_x| qa_results|passed_results|passed_results_count|
+---+---------------+----------------------+--------------+--------------+--------------------+
| 1| true| false| [true, false]| [true]| 1|
| 2| false| false|[false, false]| []| 0|
| 3| true| true| [true, true]| [true, true]| 2|
+---+---------------+----------------------+--------------+--------------+--------------------+

Related

How to set a dynamic where clause using pyspark

I have a dataset within which there are multiple groups. I have a rank column which incrementally counts counts each entry per group. An example of this structure is shown below:
+-----------+---------+---------+
| equipment| run_id|run_order|
+-----------+---------+---------+
|1 |430032589| 1|
|1 |430332632| 2|
|1 |430563033| 3|
|1 |430785715| 4|
|1 |431368577| 5|
|1 |431672148| 6|
|2 |435497596| 1|
|1 |435522469| 7|
Each group (equipment) has a different amount of runs. Shown above equipment 1 has 7 runs whilst equipment 2 has 1 run. I would like to select the first and last n runs per equipment. To select the first n runs is straightforward:
df.select("equipment", "run_id").distinct().where(df.run_order <= n).orderBy("equipment").show()
The distinct is in the query because each row is equivalent to a timestep and therefore each row will log sensor readings associated with that timestep. Therefore there will be many rows with the same equipment, run_id and run_order, which should be preserved in the end result and not aggregated.
As the number of runs is unique to each equipment I can't do an equivalent select query with a where clause (I think) to get the last n runs:
df.select("equipment", "run_id").distinct().where(df.rank >= total_runs - n).orderBy("equipment").show()
I can run a groupBy to get the highest run_order for each equipment
+-----------+----------------+
| equipment| max(run_order) |
+-----------+----------------+
|1 | 7|
|2 | 1|
But I am unsure if there is a way I can construct a dynamic where clause that works like this. So that I get the last n runs (including all timestep data for each run).
You can add a column of the max rank for each equipment and do a filter based on that column:
from pyspark.sql import functions as F, Window
n = 3
df2 = df.withColumn(
'max_run',
F.max('run_order').over(Window.partitionBy('equipment'))
).where(F.col('run_order') >= F.col('max_run') - n)

How to merge two rows in Spark SQL?

I need to merge rows in the same dataframe based on a key column "id". In the sample data frame, 1 row has data for id,name and age. The other row has id,name, and salary. Rows with same key 'id' have to be merged a single record in the final data frame. If there is just one record, should show them as well with null values [Smith, and Jake] as in example below.
The computation needs to happen on real time data, spark native function based solution would be ideal. I have tried filtering the records based on age and city columns to separate data frames and them perform a left join on ID. But its not very efficient. Looking for any alternate suggestions. Thanks in advance!
Sample Dataframe
val inputDF= Seq(("100","John", Some(35),None)
,("100","John", None,Some("Georgia")),
("101","Mike", Some(25),None),
("101","Mike", None,Some("New York")),
("103","Mary", Some(22),None),
("103","Mary", None,Some("Texas")),
("104","Smith", Some(25),None),
("105","Jake", None,Some("Florida")))
.toDF("id","name","age","city")
Input Dataframe
+---+-----+----+--------+
|id |name |age |city |
+---+-----+----+--------+
|100|John |35 |null |
|100|John |null|Georgia |
|101|Mike |25 |null |
|101|Mike |null|New York|
|103|Mary |22 |null |
|103|Mary |null|Texas |
|104|Smith|25 |null |
|105|Jake |null|Florida |
+---+-----+----+--------+
Expected Output Dataframe
+---+-----+----+---------+
| id| name| age| city|
+---+-----+----+---------+
|100| John| 35| Georgia|
|101| Mike| 25| New York|
|103| Mary| 22| Texas|
|104|Smith| 25| null|
|105| Jake|null| Florida|
+---+-----+----+---------+
Use first or last standard functions with ignoreNulls flag on.
first standard function
val q = inputDF
.groupBy("id", "name")
.agg(first("age", ignoreNulls = true) as "age", first("city", ignoreNulls = true) as "city")
.orderBy("id")
last standard function
val q = inputDF
.groupBy("id","name")
.agg(last("age", true) as "age", last("city") as "city")
.orderBy("id")

Spark - group and aggregate only several smallest items

In short
I have cartesian-product (cross-join) of two dataframes and function which gives some score for given element of this product. I want now to get few "best matched" elements of the second DF for every member of the first DF.
In details
What follows is a simplified example as my real code is somewhat bloated with additional fields and filters.
Given two sets of data, each having some id and value:
// simple rdds of tuples
val rdd1 = sc.parallelize(Seq(("a", 31),("b", 41),("c", 59),("d", 26),("e",53),("f",58)))
val rdd2 = sc.parallelize(Seq(("z", 16),("y", 18),("x",3),("w",39),("v",98), ("u", 88)))
// convert them to dataframes:
val df1 = spark.createDataFrame(rdd1).toDF("id1", "val1")
val df2 = spark.createDataFrame(rdd2).toDF("id2", "val2")
and some function which for pair of the elements from the first and second dataset gives their "matching score":
def f(a:Int, b:Int):Int = (a * a + b * b * b) % 17
// convert it to udf
val fu = udf((a:Int, b:Int) => f(a, b))
we can create the product of two sets and calculate score for every pair:
val dfc = df1.crossJoin(df2)
val r = dfc.withColumn("rez", fu(col("val1"), col("val2")))
r.show
+---+----+---+----+---+
|id1|val1|id2|val2|rez|
+---+----+---+----+---+
| a| 31| z| 16| 8|
| a| 31| y| 18| 10|
| a| 31| x| 3| 2|
| a| 31| w| 39| 15|
| a| 31| v| 98| 13|
| a| 31| u| 88| 2|
| b| 41| z| 16| 14|
| c| 59| z| 16| 12|
...
And now we want to have this result grouped by id1:
r.groupBy("id1").agg(collect_set(struct("id2", "rez")).as("matches")).show
+---+--------------------+
|id1| matches|
+---+--------------------+
| f|[[v,2], [u,8], [y...|
| e|[[y,5], [z,3], [x...|
| d|[[w,2], [x,6], [v...|
| c|[[w,2], [x,6], [v...|
| b|[[v,2], [u,8], [y...|
| a|[[x,2], [y,10], [...|
+---+--------------------+
But really we want only to retain only few (say 3) of "matches", those having the best score (say, least score).
The question is
How to get the "matches" sorted and reduced to top-N elements? Probably it is something about collect_list and sort_array, though I don't know how to sort by inner field.
Is there a way to ensure optimization in case of large input DFs - e.g. choosing minimums directly while aggregating. I know it could be done easily if I wrote the code without spark - keeping small array or priority queue for every id1 and adding element where it should be, possibly dropping out some previously added.
E.g. it's ok that cross-join is costly operation, but I want to avoid wasting memory on the results most of which I'm going to drop in the next step. My real use case deals with DFs with less than 1 mln entries so cross-join is yet viable but as we want to select only 10-20 top matches for each id1 it seems to be quite desirable not to keep unnecessary data between steps.
For start we need to take only the first n rows. To do this we are partitioning the DF by 'id1' and sorting the groups by the res. We use it to add row number column to the DF, like that we can use where function to take the first n rows. Than you can continue doing the same code your wrote. Grouping by 'id1' and collecting the list. Only now you already have the highest rows.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val n = 3
val w = Window.partitionBy($"id1").orderBy($"res".desc)
val res = r.withColumn("rn", row_number.over(w)).where($"rn" <= n).groupBy("id1").agg(collect_set(struct("id2", "res")).as("matches"))
A second option that might be better because you won't need to group the DF twice:
val sortTakeUDF = udf{(xs: Seq[Row], n: Int)} => xs.sortBy(_.getAs[Int]("res")).reverse.take(n).map{case Row(x: String, y:Int)}}
r.groupBy("id1").agg(sortTakeUDF(collect_set(struct("id2", "res")), lit(n)).as("matches"))
In here we create a udf that take the array column and an integer value n. The udf sorts the array by your 'res' and returns only the first n elements.

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

Spark2 Dataframe/RDD process in groups

I have the following table stored in Hive called ExampleData:
+--------+-----+---|
|Site_ID |Time |Age|
+--------+-----+---|
|1 |10:00| 20|
|1 |11:00| 21|
|2 |10:00| 24|
|2 |11:00| 24|
|2 |12:00| 20|
|3 |11:00| 24|
+--------+-----+---+
I need to be able to process the data by Site. Unfortunately partitioning it by Site doesn't work (there are over 100k sites, all with fairly small amounts of data).
For each site, I need to select the Time column and Age column separately, and use this to feed into a function (which ideally I want to run on the executors, not the driver)
I've got a stub of how I think I want it to work, but this solution would only run on the driver, so it's very slow. I need to find a way of writing it so it will run an executor level:
// fetch a list of distinct sites and return them to the driver
//(if you don't, you won't be able to loop around them as they're not on the executors)
val distinctSites = spark.sql("SELECT site_id FROM ExampleData GROUP BY site_id LIMIT 10")
.collect
val allSiteData = spark.sql("SELECT site_id, time, age FROM ExampleData")
distinctSites.foreach(row => {
allSiteData.filter("site_id = " + row.get(0))
val times = allSiteData.select("time").collect()
val ages = allSiteData.select("ages").collect()
processTimesAndAges(times, ages)
})
def processTimesAndAges(times: Array[Row], ages: Array[Row]) {
// do some processing
}
I've tried broadcasting the distinctSites across all nodes, but this did not prove fruitful.
This seems such a simple concept and yet I have spent a couple of days looking into this. I'm very new to Scala/Spark, so apologies if this is a ridiculous question!
Any suggestions or tips are greatly appreciated.
RDD API provides a number of functions which can be used to perform operations in groups starting with low level repartition / repartitionAndSortWithinPartitions and ending with a number of *byKey methods (combineByKey, groupByKey, reduceByKey, etc.).
Example :
rdd.map( tup => ((tup._1, tup._2, tup._3), tup) ).
groupByKey().
forEachPartition( iter => doSomeJob(iter) )
In DataFrame you can use aggregate functions,GroupedData class provides a number of methods for the most common functions, including count, max, min, mean and sum
Example :
val df = sc.parallelize(Seq(
(1, 10.3, 10), (1, 11.5, 10),
(2, 12.6, 20), (3, 2.6, 30))
).toDF("Site_ID ", "Time ", "Age")
df.show()
+--------+-----+---+
|Site_ID |Time |Age|
+--------+-----+---+
| 1| 10.3| 10|
| 1| 11.5| 10|
| 2| 12.6| 20|
| 3| 2.6| 30|
+--------+-----+---+
df.groupBy($"Site_ID ").count.show
+--------+-----+
|Site_ID |count|
+--------+-----+
| 1| 2|
| 3| 1|
| 2| 1|
+--------+-----+
Note : As you have mentioned that solution is very slow ,You need to use partition ,in your case range partition is good option.
http://dev.sortable.com/spark-repartition/
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/