Filtering PySpark dataframe rows - pyspark

I have a complicated data structure that I managed to flatten and the output has the following structure:
'name'
------
['a','b','c']
[]
[null]
null
['f']
[null,'d']
The desired output after filtering the above data frame:
'name'
------
['a','b','c']
['f']
I know that row that have 'null' only can be filtered by using df.where(col('name').isNotNull()). I tried using
filtered = udf(lambda row: int(not all(x is None for x in row)),IntegerType())
but that didn't produce the results I was hoping for. How do I filter rows that are empty list or contain at least one null?

the below filtered function can be used as your udf
filtered = lambda x: not bool([y for y in x if y is None]) if x else False
>>> filtered(['a','b','c'])
True
>>> filtered([])
False
>>> filtered([None])
False
>>> filtered(None)
False
>>> filtered(['f'])
True
>>> filtered([None,'d'])
False

Related

How to create a new Pyspark column with a subvalue of array of udts objects?

I have a column in my pyspark dataframe that has the following structure:
keywords_embedding : [
0:
vectorType: "dense"
length: 100
values: [1,2,3,4,5]
1:
vectorType: "dense"
length: 100
values: [4,5,6,7,8]
.
.
.
]
I need to create a new column that contains only the subvalue array 'values' from each position. I tried the following code:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
# Define the UDF
def extract_values(obj_array):
return [obj.values for obj in obj_array]
# Define the return type of the UDF
extract_values_udf = F.udf(extract_values, ArrayType(ArrayType(DoubleType())))
# Extract the values from each object in the array
df_teste = df_teste.withColumn(
'values_array',
extract_values_udf(F.col('keywords_embedding'))
)
# View the results
df_teste.display()
But it gives me this error:
PickleException: expected zero arguments for construction of ClassDict
(for numpy.dtype)
I didn't understand why it gives me this error. How can i achieve this result?

Spark computing percentile on a cell containing a list of doubles

I have a DataFrame that has a column of cells containing a list of doubles. Here is an example:
c1 c2 c3
-------------------------------------------
a a [0.0,1.0, 6.0,3.3 ...]
a b [1.0, 2.0, 3.4, ... ]
aa c [1.0, 2.2, 3.5, ... ]
...
This DataFrame was generated by reading in multiple CSV files, which then were passed through collect_list as well as sort_array. For example:
val df = orig.groupBy ("c1","c2").agg( sort_array(collect_list("c3")).as("c3") )
For each cell in column 3 (c3), I would like to compute the percentile over each cell. The resulting DataFrame would contain only a single value in c3.
I would appreciate any pointers to this matter.
Following seem to have done the trick. Note that the correctness of the function is not really relevant here, but rather, how it was invoked:
def computePercentile(data: WrappedArray [Double], tile: Int): Double ={ ... }
val test = orig.select("c3").rdd.map { case Row { val:WrappedArray[Double]) => (val,computePercentile (val,95))}.toDF("c1percent")
A second approach was a variation using a UDF.
val percentUDF = org.apache.spark.sql.functions.udf((val: WrappedArray[Double]) => {computePercentile(simval,95)})
...
val result = orig.groupBy ("c1","c2").agg(percentUDF(sort_array(collect_list("c3"))).as("c3"))
The resulting table is what I wanted:
c1 c2 c3
------------------------------
a a 0.111
a b 0.222
aa c 1.123

Squared Sums with aggregateByKey PySpark

I have the data set a,
a = sc.parallelize([((1,2),(10,20,[1,3])),((1,2),(30,40,[1]))])
and I need the following:
(1,2) is going to be the key
Since I want to calculate the streaming standard deviation of the first two values, I need to evaluate the
pure sums and sums of squares for each of these values. In other words, I need to
sumx=(10+30), sumx^2=(10^2 + 30^2) for the first value,
and
sumx=(20+40), sumx^2=(20^2 + 40^2) for the second value.
for the final value (the lists), I just want to concatenate them.
The final result needs to be:
([(1,2),(40,1000,60,2000,[1,3])])
Here is my code:
a.aggregateByKey((0.0,0.0,0.0,0.0,[]),\
(lambda x,y: (x[0]+y[0],x[0]*x[0]+y[0]*y[0],x[1]+y[1],x[1]*x[1]+y[1]*y[1],x[2]+y[2])),\
(lambda rdd1,rdd2: (rdd1[0]+rdd2[0],rdd1[1]+rdd2[1],rdd1[2]+rdd1[2],rdd1[3]+rdd2[3],rdd1[4]+rdd2[4]))).collect()
Unfortunately it returns the following error:
"TypeError: unsupported operand type(s) for +: 'float' and 'list'"
Any thoughts?
You can use hivecontext to solve this :
from pyspark.sql.context import HiveContext
hivectx = HiveContext(sc)
a = sc.parallelize([((1,2),(10,20,[1,3])),((1,2),(30,40,[1]))])
# Convert this to a dataframe
b = a.toDF(['col1','col2'])
# Explode col2 into individual columns
c = b.map(lambda x: (x.col1,x.col2[0],x.col2[1],x.col2[2])).toDF(['col1','col21','col22','col23'])
c.registerTempTable('mydf')
sql = """
select col1,
sum(col21) as sumcol21,
sum(POW(col21,2)) as sum2col21,
sum(col22) as sumcol22,
sum(POW(col22,2)) as sum2col22,
collect_set(col23) as col23
from mydf
group by col1
"""
d = hivectx.sql(sql)
# Get back your original dataframe
e = d.map(lambda x:(x.col1,(x.sumcol21,x.sum2col21,x.sumcol22,x.sum2col22,[item for sublist in x.col23 for item in sublist]))).toDF(['col1','col2'])

Replace missing values with mean - Spark Dataframe

I have a Spark Dataframe with some missing values. I would like to perform a simple imputation by replacing the missing values with the mean for that column. I am very new to Spark, so I have been struggling to implement this logic. This is what I have managed to do so far:
a) To do this for a single column (let's say Col A), this line of code seems to work:
df.withColumn("new_Col", when($"ColA".isNull, df.select(mean("ColA"))
.first()(0).asInstanceOf[Double])
.otherwise($"ColA"))
b) However, I have not been able to figure out, how to do this for all the columns in my dataframe. I was trying out the Map function, but I believe it loops through each row of a dataframe
c) There is a similar question on SO - here. And while I liked the solution (using Aggregated tables and coalesce), I was very keen to know if there is a way to do this by looping through each column (I come from R, so looping through each column using a higher order functional like lapply seems more natural to me).
Thanks!
Spark >= 2.2
You can use org.apache.spark.ml.feature.Imputer (which supports both mean and median strategy).
Scala :
import org.apache.spark.ml.feature.Imputer
val imputer = new Imputer()
.setInputCols(df.columns)
.setOutputCols(df.columns.map(c => s"${c}_imputed"))
.setStrategy("mean")
imputer.fit(df).transform(df)
Python:
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
)
imputer.fit(df).transform(df)
Spark < 2.2
Here you are:
import org.apache.spark.sql.functions.mean
df.na.fill(df.columns.zip(
df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)
where
df.columns.map(mean(_)): Array[Column]
computes an average for each column,
df.select(_: *).first.toSeq: Seq[Any]
collects aggregated values and converts row to Seq[Any] (I know it is suboptimal but this is the API we have to work with),
df.columns.zip(_).toMap: Map[String,Any]
creates aMap: Map[String, Any] which maps from the column name to its average, and finally:
df.na.fill(_): DataFrame
fills the missing values using:
fill: Map[String, Any] => DataFrame
from DataFrameNaFunctions.
To ingore NaN entries you can replace:
df.select(df.columns.map(mean(_)): _*).first.toSeq
with:
import org.apache.spark.sql.functions.{col, isnan, when}
df.select(df.columns.map(
c => mean(when(!isnan(col(c)), col(c)))
): _*).first.toSeq
For imputing the median (instead of the mean) in PySpark < 2.2
## filter numeric cols
num_cols = [col_type[0] for col_type in filter(lambda dtype: dtype[1] in {"bigint", "double", "int"}, df.dtypes)]
### Compute a dict with <col_name, median_value>
median_dict = dict()
for c in num_cols:
median_dict[c] = df.stat.approxQuantile(c, [0.5], 0.001)[0]
Then, apply na.fill
df_imputed = df.na.fill(median_dict)
For PySpark, this is the code I used:
mean_dict = { col: 'mean' for col in df.columns }
col_avgs = df.agg( mean_dict ).collect()[0].asDict()
col_avgs = { k[4:-1]: v for k,v in col_avgs.iteritems() }
df.fillna( col_avgs ).show()
The four steps are:
Create the dictionary mean_dict mapping column names to the aggregate operation (mean)
Calculate the mean for each column, and save it as the dictionary col_avgs
The column names in col_avgs start with avg( and end with ), e.g. avg(col1). Strip the parentheses out.
Fill the columns of the dataframe with the averages using col_avgs

Unexpected behavior of filtering RDD with var

I have encountered a weird bug in my code and while debugging I was able to refine the problem. The thing is that when I filter a var RDD with a var variable and then store the filter results in the same RDD the RDD is updated correctly.
The thing is that after I update the var variable I used to filter the result I automatically filter again!
example code:
var filter = 5
var a1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
a1 = a1.filter(t => !t.equals(filter))
a1.foreach(println) // result is: 1-9 without 5
filter = filter + 1
a1.foreach(println) // result is: 1-9 without 6
Why is that happening? what is the rule for this not to cause bug in my code?
Spark transformations are lazily evaluated. When you do a1.filter, you get back a FilteredRDD, you don't actually have the result of the computation at that point in time. Only when you request and action on the transformation with foreach, only then is the transformation invoked.
As well as the lazy filtering, the lambda expression captures the variable, not the value. This means that when you update filter, the same variable inside the captured lambda is updated from 5 to 6, and then filtering it again yields all elements with the updated value.
This is because the a1 contains the complete DAG . And foreach is an action which will trigger the DAG to get the result.
scala> var a1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
a1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:21
scala> a1.toDebugString
res5: String = (4) ParallelCollectionRDD[4] at parallelize at <console>:21 []
scala> a1 = a1.filter(t => !t.equals(filter))
a1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at filter at <console>:25
scala> a1.toDebugString
res6: String =
(4) MapPartitionsRDD[5] at filter at <console>:25 []
| ParallelCollectionRDD[4] at parallelize at <console>:21 []
So whenever you print the rdd using foreach it will take the filter value in closure and get you the result by computing the DAG.
filter = 6
a1.foreach(println) // will filter 6
filter = 9
a1.foreach(println) // will filter 9
Try these and see what happens:
var filter = 5
var a1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
a1 = sc.parallelize(a1.filter(t => !t.equals(filter)).collect())
a1.foreach(println)
filter = filter + 1
a1.foreach(println)
And this also:
var filter = 5
var a1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
a1 = a1.filter(t => !t.equals(filter)).cache()
a1.foreach(println)
filter = filter + 1
a1.foreach(println)
Hope these will make you think more!