I am very new to PySpark. I am trying to compare (if stmt) a Dataframe column value to a variable in Python and then take necessary action based on its result. However, I can not seem to get it right.
If "filename" column in the DF has more than 1 unique value, Or if the distinct DF "filename" value is not equal to the variable in_file, then "Pass", else "Fail"
in_file = "my_file.txt"
if df.select(df.filename).distinct().count() > 1 or df.select(df.filename).distinct().collect()[0][0] != in_file:
print("Fail")
else:
print("Pass")
What is the best/smart way to do similar comparisons between DF column and a variable in PySpark so that the program can branch based on the result of the comparison?
Related
In pyspark , i tried to do this
df = df.select(F.col("id"),
F.col("mp_code"),
F.col("mp_def"),
F.col("mp_desc"),
F.col("mp_code_desc"),
F.col("zdmtrt06_zstation").alias("station"),
F.to_timestamp(F.col("date_time"), "yyyyMMddHHmmss").alias("date_time_utc"))
df = df.groupBy("id", "mp_code", "mp_def", "mp_desc", "mp_code_desc", "station").min(F.col("date_time_utc"))
But, i have an issue
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
Here is an extract of the pyspark documentation
GroupedData.min(*cols)[source]
Computes the min value for each numeric column for each group.
New in version 1.3.0.
Parameters: cols : str
In other words, the min function does not support column arguments. It only works with column names (strings) like this:
df.groupBy("x").min("date_time_utc")
# you can also specify several column names
df.groupBy("x").min("y", "z")
Note that if you want to use a column object, you have to use agg:
df.groupBy("x").agg(F.min(F.col("date_time_utc")))
I'm trying to add exploded columns to a dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json, schema=None):
# SparkSessions are available with Spark 2.0+
reader = spark.read
if schema:
reader.schema(schema)
return reader.json(sc.parallelize([json]))
schema = StructType().add("a", MapType(StringType(), IntegerType()))
events = jsonToDataFrame("""
{
"a": {
"b": 1,
"c": 2
}
}
""", schema)
display(
events.withColumn("a", explode("a").alias("x", "y"))
)
However, I'm hitting the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got a
Any ideas?
In the end, I used the following:
display(
events.select(explode("a").alias("x", "y"), *[c for c in events.columns])
)
This approach uses select to specify the columns to return.
The first argument explodes the data:
explode("a").alias("x", "y")
The second argument specifies all existing columns should be included in the select:\
*[c for c in events.columns]
Note that I'm prefixing the list with * - this sends each column name as a separate parameter.
Simpler Method
The API docs specify:
Parameters
colsstr, Column, or list
column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.
We can simplify the first approach by passing in "*" to select all the columns:
display(
events.select("*", explode("a").alias("x", "y"))
)
I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format
I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.
Format
array<struct<key:string,value:struct<int_value:string,string_value:string>>>
Function to get a specific key values
def getValueFunc(searcharray, searchkey):
for val in searcharray:
if val["key"] == searchkey:
if val["value"]["string_value"] is not None:
actual = val["value"]["string_value"]
return actual
elif val["value"]["int_value"] is not None:
actual = val["value"]["int_value"]
return str(actual)
else:
return "---"
.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))
For Spark 2.40+, you can use SparkSQL's filter() function to find the first array element which matches key == serarchkey and then retrieve its value. Below is a Spark SQL snippet template(searchkey as a variable) to do the first part mentioned above.
stmt = '''filter(event_params, x -> x.key == "{}")[0]'''.format(searchkey)
Run the above stmt with expr() function, and assign the value (StructType) to a temporary column f1, and then use coalesce() function to retrieve the non-null value.
from pyspark.sql.functions import expr
df.withColumn('f1', expr(stmt)) \
.selectExpr("coalesce(f1.value.string_value, string(f1.value.int_value),'---') AS event_category") \
.show()
Let me know if you have any problem running the above code.
I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""
I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))