Checking 200 million data using Pyspark, takes a lot of time

Checking 200 million data using Pyspark, takes a lot of time - pyspark

I have an Oracle table with about 200 million records. I must check these records and find any records which at least one of its field has problem.So, I have list of fields that should not been NULL and have to find records which have the fields NULL and save that records to the Error_table to report. My code is in the following:
def detect_errors(df, id, part, spark, repeated_id):
list_repetitived_id = list(repeated_id.keys()
entity_name = part.split('_')[0]
# Create an expected schema for df error
columns = StructType([StructField('date_time', StringType(), True),
StructField('entity_name', StringType(), True),
StructField('id', StringType(), True),
StructField('error_type', StringType(), True),
StructField('record', StringType(), True)])
df_error = spark.createDataFrame(data=[], schema=columns)
columns_partition = StructType([StructField('date_group', StringType(), True),
StructField('ac_id', StringType(), True),
StructField('partition_id', StringType(), True)])
df_partitions = spark.createDataFrame(data=[], schema=columns_partition)
# writing repetitived id in error table:
if len(list_repetitived_id) > 0:
records_repetitived_id = df.filter(col(id).isin(list_repetitived_id))
# Adding this line for Partition State
df_partitions = records_repetitived_id.select("date", "ac_id", "partition_id")\
.withColumnRenamed('DATE','date_group').union(df_partitions)
error_type = 'Repetitived_' + id
df_error = filling_df_error(records_repetitived_id, id, entity_name, error_type).union(df_error)
# Extract all columns of the table which must not be null
list_fields = [i for i in df.columns if gf.config['company'][part][i]['check']]
df_check = df.filter(~col(id).isin(list_repetitived_id))
for i in list_fields:
df_null = df_check.filter(col(i).isNull())
list_id = list(df_null.select(id).toPandas()[id])
df_check = df_check.filter(~col(id).isin(list_id))
if df_null.first() != None:
df_partitions = df_null.select("date", "ac_id", "partition_id")\
.withColumnRenamed('DATE','date_group').union(df_partitions)
error_type = 'Null_Value' + '_' + i
df_error = filling_df_error(df_null, id, entity_name, error_type).union(df_error)
list_id = list(df_error.select('id').toPandas()['id'])
df = df.filter(~col(id).isin(list_id))
return df_error, df,df_partitions
In the above code, df_partitions is a spark dataframe with State of each partitions which are Finished or not.
The code run without any error, but it takes a lot of time.
DAG of the code:
Would you please guide me how to improve the code to run faster for 200 million records?
Any help is really appreciated.

If the intention is to check if a column (from a list of columns) is null, we can use when().otherwise() to flag the records and create a new dataframe using the flagged records.
Here's an example.
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['col1', 'col2', 'col3', 'col4', 'col5'])
# +----+----+----+----+----+
# |col1|col2|col3|col4|col5|
# +----+----+----+----+----+
# | 1| 1| 1| 1| 1|
# | 1|null| 1|null|null|
# | 1| 0|null|null| 1|
# +----+----+----+----+----+
# list of columns that should not be null
cols_to_check = ['col1', 'col3', 'col5']
# condition to check if any of the columns in `cols_to_check` has null
when_null_condition = reduce(lambda x, y: x | y, [func.col(k).isNull() for k in cols_to_check])
# Column<'(((col1 IS NULL) OR (col3 IS NULL)) OR (col5 IS NULL))'>
data_sdf. \
withColumn('null_flag',
func.when(when_null_condition, func.lit(1)).
otherwise(func.lit(0))
). \
show()
# +----+----+----+----+----+---------+
# |col1|col2|col3|col4|col5|null_flag|
# +----+----+----+----+----+---------+
# | 1| 1| 1| 1| 1| 0|
# | 1|null| 1|null|null| 1|
# | 1| 0|null|null| 1| 1|
# +----+----+----+----+----+---------+
Use the null_flag = 1 filter to create your error_table dataframe.

Related

PySpark - GroupBy and aggregation with multiple conditions

I want to group and aggregate data with several conditions. The dataframe contains a product id, fault codes, date and a fault type. Here, I prepared a sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DateType
from datetime import datetime, date
data = [("prod_001","fault_01",date(2020, 6, 4),"minor"),
("prod_001","fault_03",date(2020, 7, 2),"minor"),
("prod_001","fault_09",date(2020, 7, 14),"minor"),
("prod_001","fault_01",date(2020, 7, 14),"minor"),
("prod_001",None,date(2021, 4, 6),"major"),
("prod_001","fault_02",date(2021, 6, 22),"minor"),
("prod_001","fault_09",date(2021, 8, 1),"minor"),
("prod_002","fault_01",date(2020, 6, 13),"minor"),
("prod_002","fault_05",date(2020, 7, 11),"minor"),
("prod_002",None,date(2020, 8, 1),"major"),
("prod_002","fault_01",date(2021, 4, 15),"minor"),
("prod_002","fault_02",date(2021, 5, 11),"minor"),
("prod_002","fault_03",date(2021, 5, 13),"minor"),
]
schema = StructType([ \
StructField("product_id",StringType(),True), \
StructField("fault_code",StringType(),True), \
StructField("date",DateType(),True), \
StructField("fault_type", StringType(), True), \
])
df = spark.createDataFrame(data=data,schema=schema)
display(df)
In general I would like to do a grouping based on the product_id and a following aggregation of the fault_codes (to lists) for the dates. Some specialties here are the continuing aggregation to a list until the fault_type changes from minor to major. In this case the major tagged row will adopt the last state of the aggregation (see screenshot). Within one product_id the aggregation to a list should then start from new (with the following fault_code which is flagged as minor).
see target output here
In some other posts I found the following code snippet which I already tried. Unfortunately I didnt make it to the full aggregation with all conditions yet.
df.sort("product_id", "date").groupby("product_id", "date").agg(F.collect_list("fault_code"))
Edit:
Got a little bit closer with Window.partitionBy() but still not able to start the collect_list() from new once the fault_type changes to major with the following code:
df_test = df.sort("product_id", "date").groupby("product_id", "date", "fault_type").agg(F.collect_list("fault_code")).withColumnRenamed('collect_list(fault_code)', 'fault_code_list')
window_function = Window.partitionBy("product_id").rangeBetween(Window.unboundedPreceding, Window.currentRow).orderBy("date")
df_test = df_test.withColumn("new_version_v2", F.collect_list("fault_code_list").over(Window.partitionBy("product_id").orderBy("date"))) \
.withColumn("new_version_v2", F.flatten("new_version_v2"))
Does someone know how to do that?

Your edit is close. This is not as simple and I only come up with a solution that works but not so neat.
lagw = Window.partitionBy('product_id').orderBy('date')
grpw = Window.partitionBy(['product_id', 'grp']).orderBy('date').rowsBetween(Window.unboundedPreceding, 0)
df = (df.withColumn('grp', F.sum(
(F.lag('fault_type').over(lagw).isNull()
| (F.lag('fault_type').over(lagw) == 'major')
).cast('int')).over(lagw))
.withColumn('fault_code', F.collect_list('fault_code').over(grpw)))
df.orderBy(['product_id', 'grp']).show()
# +----------+----------------------------------------+----------+----------+---+
# |product_id| fault_code| date|fault_type|grp|
# +----------+----------------------------------------+----------+----------+---+
# | prod_001|[fault_01] |2020-06-04| minor| 1|
# | prod_001|[fault_01, fault_03] |2020-07-02| minor| 1|
# | prod_001|[fault_01, fault_03, fault_09] |2020-07-14| minor| 1|
# | prod_001|[fault_01, fault_03, fault_09, fault_01]|2020-07-14| minor| 1|
# | prod_001|[fault_01, fault_03, fault_09, fault_01]|2021-04-06| major| 1|
# | prod_001|[fault_02] |2021-06-22| minor| 2|
# | prod_001|[fault_02, fault_09] |2021-08-01| minor| 2|
# | prod_002|[fault_01] |2020-06-13| minor| 1|
# | prod_002|[fault_01, fault_02] |2020-07-11| minor| 1|
...
Explanation:
First I create grp column to categorize the consecutive "minor" + following "major". I use sum and lag to see if the previous row was "major", then I increment, otherwise, I keep the same value as the previous row.
# If cond is True, sum 1, if False, sum 0.
F.sum((cond).cast('int'))
df.orderBy(['product_id', 'date']).select('product_id', 'date', 'fault_type', 'grp').show()
+----------+----------+----------+---+
|product_id| date|fault_type|grp|
+----------+----------+----------+---+
| prod_001|2020-06-04| minor| 1|
| prod_001|2020-07-02| minor| 1|
| prod_001|2020-07-14| minor| 1|
| prod_001|2020-07-14| minor| 1|
| prod_001|2021-04-06| major| 1|
| prod_001|2021-06-22| minor| 2|
| prod_001|2021-08-01| minor| 2|
| prod_002|2020-06-13| minor| 1|
| prod_002|2020-07-11| minor| 1|
...
Once this grp is generated, I can partition by product_id and grp to apply collect_list.

One possible approach is using Pandas UDF with applyInPandas.
Define a "normal" Python function
Input is a Pandas dataframe and output is another dataframe.
The dataframe's size doesn't matter
def grp(df):
df['a'] = 'AAA'
df = df[df['fault_code'] == 'fault_01']
return df[['product_id', 'a']]
Test this function with actual Pandas dataframe
The only thing to remember is this dataframe is just a subset of your actual dataframe
grp(df.where('product_id == "prod_001"').toPandas())
product_id a
0 prod_001 AAA
3 prod_001 AAA
Apply this function into Spark dataframe with applyInPandas
(df
.groupBy('product_id')
.applyInPandas(grp, schema)
.show()
)
+----------+---+
|product_id| a|
+----------+---+
| prod_001|AAA|
| prod_001|AAA|
| prod_002|AAA|
| prod_002|AAA|
| prod_002|AAA|
| prod_002|AAA|
| prod_002|AAA|
+----------+---+

PySpark Working with Delta tables - For Loop Optimization with Union

I'm currently working in databricks and have a delta table with 20+ columns. I basically need to take a value from 1 column in each row, send it to an api which returns two values/columns, and then create the other 26 to merge the values back to the original delta table. So input is 28 columns and output is 28 columns. Currently my code looks like:
from pyspark.sql.types import *
from pyspark.sql import functions as F
import requests, uuid, json
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.functions import col,lit
from functools import reduce
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled", "true")
spark.sql('set spark.sql.execution.arrow.pyspark.enabled = true')
spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")
spark.conf.set("spark.sql.parquet.compression.codec","gzip")
spark.conf.set("spark.sql.inMemorycolumnarStorage.compressed","true")
spark.conf.set("spark.databricks.optimizer.dynamicFilePruning","true");
output=spark.sql("select * from delta.`table`").cache()
SeriesAppend=[]
for i in output.collect():
#small mapping fix
if i['col1']=='val1':
var0='a'
elif i['col1']=='val2':
var0='b'
elif i['col1']=='val3':
var0='c'
elif i['col1']=='val4':
var0='d'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list=list(req_var-var0)
#subscription info
headers = {header}
body = [{
'text': i['col2']
}]
if len(i['col2'])<500:
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
dumps=json.dumps(response[0])
loads = json.loads(dumps)
json_rdd = sc.parallelize(loads)
json_df = spark.read.json(json_rdd)
json_df = json_df.withColumn('col1',lit(i['col1']))
json_df = json_df.withColumn('col2',lit(i['col2']))
json_df = json_df.withColumn('col3',lit(i['col3']))
...
SeriesAppend.append(json_df)
else:
pass
Series_output=reduce(DataFrame.unionAll, SeriesAppend)
SAMPLE DF with only 3 columns:
df = spark.createDataFrame(
[
("a", "cat","owner1"), # create your data here, be consistent in the types.
("b", "dog","owner2"),
("c", "fish","owner3"),
("d", "fox","owner4"),
("e", "rat","owner5"),
],
["col1", "col2", "col3"]) # add your column names here
I really just need to write the response + other column values to a delta table, so dataframes are not necessarily required, but haven't found a faster way than the above. Right now, I can run 5 inputs, which returns 15 in 25.3 seconds without the unionAll. With the inclusion of the union, it turns into 3 minutes.
The final output would look like:
df = spark.createDataFrame(
[
("a", "cat","owner1","MI", 48003), # create your data here, be consistent in the types.
("b", "dog","owner2", "MI", 48003),
("c", "fish","owner3","MI", 48003),
("d", "fox","owner4","MI", 48003),
("e", "rat","owner5","MI", 48003),
],
["col1", "col2", "col3", "col4", "col5"]) # add your column names here
How can I make this faster in spark?

As mentioned in my comments, you should use UDF to distribute more workload to workers instead of collect and let a single machine (driver) to run it all. It's simply wrong approach and not scalable.
# This is your main function, pure Python and you can unittest it in any way you want.
# The most important about this function is:
# - everything must be encapsulated inside the function, no global variable works here
def req(col1, col2):
if col1 == 'val1':
var0 = 'a'
elif col1 == 'val2':
var0 = 'b'
elif col1 == 'val3':
var0 = 'c'
elif col1 == 'val4':
var0 = 'd'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list = list(req_var - var0)
#subscription info
headers = {header} # !!! `header` must available **inside** this function, global won't work
body = [{
'text': col2
}]
if len(col2) < 500:
# !!! same as `header`, `constructed_url` must available **inside** this function, global won't work
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
return (response.col4, response.col5)
else:
return None
# Now you wrap the function above into a Spark UDF.
# I'm using only 2 columns here as input, but you can use as many columns as you wish.
# Same as output, I'm using only a tuple with 2 elements, you can make it as many items as you wish.
df.withColumn('temp', F.udf(req, T.ArrayType(T.StringType()))('col1', 'col2')).show()
# Output
# +----+----+------+------------------+
# |col1|col2| col3| temp|
# +----+----+------+------------------+
# | a| cat|owner1|[foo_cat, bar_cat]|
# | b| dog|owner2|[foo_dog, bar_dog]|
# | c|fish|owner3| null|
# | d| fox|owner4| null|
# | e| rat|owner5| null|
# +----+----+------+------------------+
# Now all you have to do is extract the tuple and assign to separate columns
# (and delete temp column to cleanup)
(df
.withColumn('col4', F.col('temp')[0])
.withColumn('col5', F.col('temp')[1])
.drop('temp')
.show()
)
# Output
# +----+----+------+-------+-------+
# |col1|col2| col3| col4| col5|
# +----+----+------+-------+-------+
# | a| cat|owner1|foo_cat|bar_cat|
# | b| dog|owner2|foo_dog|bar_dog|
# | c|fish|owner3| null| null|
# | d| fox|owner4| null| null|
# | e| rat|owner5| null| null|
# +----+----+------+-------+-------+

How to creat a pyspark DataFrame inside of a loop?

How to creat a pyspark DataFrame inside of a loop? In this loop in each iterate I am printing 2 values print(a1,a2). now I want to store all these value in a pyspark dataframe.

Initially, before the loop, you could create an empty dataframe with your preferred schema. Then, create a new df for each loop with the same schema and union it with your original dataframe. Refer the code below.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('a1', StringType(), True),
StructField('a2', StringType(), True)
])
df = spark.createDataFrame([],schema)
for i in range(1,5):
a1 = i
a2 = i+1
newRow = spark.createDataFrame([(a1,a2)], schema)
df = df.union(newRow)
print(df.show())
This gives me the below result where the values are appended to the df in each loop.
+---+---+
| a1| a2|
+---+---+
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|
+---+---+

Manually create a pyspark dataframe

I am trying to manually create a pyspark dataframe given certain data:
row_in = [(1566429545575348), (40.353977), (-111.701859)]
rdd = sc.parallelize(row_in)
schema = StructType(
[
StructField("time_epocs", DecimalType(), True),
StructField("lat", DecimalType(), True),
StructField("long", DecimalType(), True),
]
)
df_in_test = spark.createDataFrame(rdd, schema)
This gives an error when I try to display the dataframe, so I am not sure how to do this.
However, the Spark documentation seems to be a bit convoluted to me, and I got similar errors when I tried to follow those instructions.
Does anyone know how to do this?

Simple dataframe creation:
df = spark.createDataFrame(
[
(1, "foo"), # create your data here, be consistent in the types.
(2, "bar"),
],
["id", "label"] # add your column names here
)
df.printSchema()
root
|-- id: long (nullable = true)
|-- label: string (nullable = true)
df.show()
+---+-----+
| id|label|
+---+-----+
| 1| foo|
| 2| bar|
+---+-----+
According to official doc:
when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
"id int, label string", # add column names and types here
)
# Example with pyspark.sql.types
from pyspark.sql import types as T
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
T.StructType( # Define the whole schema within a StructType
[
T.StructField("id", T.IntegerType(), True),
T.StructField("label", T.StringType(), True),
]
),
)
df.printSchema()
root
|-- id: integer (nullable = true) # type is forced to Int
|-- label: string (nullable = true)
Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe's types :
import pandas as pd
import numpy as np
pdf = pd.DataFrame(
{
"col1": [np.random.randint(10) for x in range(10)],
"col2": [np.random.randint(100) for x in range(10)],
}
)
df = spark.createDataFrame(pdf)
df.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+

To elaborate/build off of #Steven's answer:
field = [
StructField("MULTIPLIER", FloatType(), True),
StructField("DESCRIPTION", StringType(), True),
]
schema = StructType(field)
multiplier_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Will create a blank dataframe.
We can now simply add a row to it:
l = [(2.3, "this is a sample description")]
rdd = sc.parallelize(l)
multiplier_df_temp = spark.createDataFrame(rdd, schema)
multiplier_df = wtp_multiplier_df.union(wtp_multiplier_df_temp)

This answer demonstrates how to create a PySpark DataFrame with createDataFrame, create_df and toDF.
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
You can also pass createDataFrame a RDD and schema to construct DataFrames with more precision:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
create_df from my Quinn project allows for the best of both worlds - it's concise and fully descriptive:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF doesn't offer any advantages over the other approaches:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+

With formatting
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()

Extending #Steven's Answer:
data = [(i, 'foo') for i in range(1000)] # random data
columns = ['id', 'txt'] # add your columns label here
df = spark.createDataFrame(data, columns)
Note: When schema is a list of column-names, the type of each column will be inferred from data.
If you want to specifically define schema then do this:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df1 = spark.createDataFrame(data, schema)
Outputs:
>>> df1
DataFrame[id: int, txt: string]
>>> df
DataFrame[id: bigint, txt: string]

for beginners, a full example importing data from file:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
ShortType,
StringType,
StructType,
StructField,
TimestampType,
)
import os
here = os.path.abspath(os.path.dirname(__file__))
spark = SparkSession.builder.getOrCreate()
schema = StructType(
[
StructField("id", ShortType(), nullable=False),
StructField("string", StringType(), nullable=False),
StructField("datetime", TimestampType(), nullable=False),
]
)
# read file or construct rows manually
df = spark.read.csv(os.path.join(here, "data.csv"), schema=schema, header=True)

How to handle the null/empty values on a dataframe Spark/Scala

I have a CSV file and I am processing its data.
I am working with data frames, and I calculate average, min, max, mean, sum of each column based on some conditions. The data of each column could be empty or null.
I have noticed that in some cases I got as max, or sum a null value instead of a number. Or I got in max() a number which is less that the output that the min() returns.
I do not want to replace the null/empty values with other.
The only thing I have done is to use these 2 options in CSV:
.option("nullValue", "null")
.option("treatEmptyValuesAsNulls", "true")
Is there any way to handle this issue? Have everyone faced this problem before? Is it a problem of data types?
I run something like this:
data.agg(mean("col_name"), stddev("col_name"),count("col_name"),
min("col_name"), max("col_name"))
Otherwise I can consider that it is a problem in my code.

I have done some research on this question, and the result shows that mean, max, min functions ignore null values. Below is the experiment code and results.
Environment: Scala, Spark 1.6.1 Hadoop 2.6.0
import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
val row1 =Row("1", 2.4, "2016-12-21")
val row2 = Row("1", None, "2016-12-22")
val row3 = Row("2", None, "2016-12-23")
val row4 = Row("2", None, "2016-12-23")
val row5 = Row("3", 3.0, "2016-12-22")
val row6 = Row("3", 2.0, "2016-12-22")
val theRdd = sc.makeRDD(Array(row1, row2, row3, row4, row5, row6))
val schema = StructType(StructField("key", StringType, false) ::
StructField("value", DoubleType, true) ::
StructField("d", StringType, false) :: Nil)
val df = sqlContext.createDataFrame(theRdd, schema)
df.show()
df.agg(mean($"value"), max($"value"), min($"value")).show()
df.groupBy("key").agg(mean($"value"), max($"value"), min($"value")).show()
Output:
+---+-----+----------+
|key|value| d|
+---+-----+----------+
| 1| 2.4|2016-12-21|
| 1| null|2016-12-22|
| 2| null|2016-12-23|
| 2| null|2016-12-23|
| 3| 3.0|2016-12-22|
| 3| 2.0|2016-12-22|
+---+-----+----------+
+-----------------+----------+----------+
| avg(value)|max(value)|min(value)|
+-----------------+----------+----------+
|2.466666666666667| 3.0| 2.0|
+-----------------+----------+----------+
+---+----------+----------+----------+
|key|avg(value)|max(value)|min(value)|
+---+----------+----------+----------+
| 1| 2.4| 2.4| 2.4|
| 2| null| null| null|
| 3| 2.5| 3.0| 2.0|
+---+----------+----------+----------+
From the output you can see that the mean, max, min functions on column 'value' of group key='1' returns '2.4' instead of null which shows that the null values were ignored in these functions. However, if the column contains only null values then these functions will return null values.

Contrary to one of the comments it is not true that nulls are ignored. Here is an approach:
max(coalesce(col_name,Integer.MinValue))
min(coalesce(col_name,Integer.MaxValue))
This will still have an issue if there were only null values: you will need to convert Min/MaxValue to null or whatever you want to use to represent "no valid/non-null entries".

To add to other answers:
Remember the null and NaN are different things to spark:
NaN is not a number and numeric aggregations on a column with NaN in it result in NaN
null is a missing value and numeric aggregations on a column with null ignore it as if the row wasn't even there
df_=spark.createDataFrame([(1, np.nan), (None, 2.0),(3,4.0)], ("a", "b"))
df_.show()
| a| b|
+----+---+
| 1|NaN|
|null|2.0|
| 3|4.0|
+----+---+
df_.agg(F.mean("a"),F.mean("b")).collect()
[Row(avg(a)=2.0, avg(b)=nan)]

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Checking 200 million data using Pyspark, takes a lot of time - pyspark

Related

PySpark - GroupBy and aggregation with multiple conditions

PySpark Working with Delta tables - For Loop Optimization with Union

How to creat a pyspark DataFrame inside of a loop?

Manually create a pyspark dataframe

How to handle the null/empty values on a dataframe Spark/Scala

Categories

Resources