Extract and Replace values from duplicates rows in PySpark Data Frame - pyspark

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value

"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

Related

How debug spark dropduplicate and join function calls?

There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?
After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.

Pyspark remove columns with 10 null values

I am new to PySpark.
I have read a parquet file. I only want to keep columns that have atleast 10 values
I have used describe to get the count of not-null records for each column
How do I now extract the column names that have less than 10 values and then drop those columns before writing to a new file
df = spark.read.parquet(file)
col_count = df.describe().filter($"summary" == "count")
You can convert it into a dictionary and then filter out the keys(column names) based on their values (count < 10, the count is a StringType() which needs to be converted to int in the Python code):
# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')
# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()
# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]
# drop bad columns
df_new = df.drop(*bad_cols)
Some notes:
use #pault's approach if the information can not be retrieved directly from df.describe() or df.summary() etc.
you need to drop() instead of select() columns since describe()/summary() only include numeric and string columns, selecting columns from a list processed by df.describe() will lose columns of TimestampType(), ArrayType() etc

IndexOutOfBoundsException when writing dataframe into CSV

So, I'm trying to read an existing file, save that into a DataFrame, once that's done I make a "union" between that existing DataFrame and a new one I have already created, both have the same columns and share the same schema.
ALSO I CANNOT GIVE SIGNIFICANT NAME TO VARS NOR GIVE ANYMORE DATA BECAUSE OF RESTRICTIONS
val dfExist = spark.read.format("csv").option("header", "true").option("delimiter", ",").schema(schema).load(filePathAggregated3)
val df5 = df4.union(dfExist)
Once that's done I get the "start_ts" (a timestamp on Epoch format) that's duplicate in the union between the above dataframes (df4 and dfExist) and also I get rid of some characters I don't want
val df6 = df5.select($"start_ts").collect()
val df7 = df6.diff(df6.distinct).distinct.mkString.replace("[", "").replace("]", "")
Now I use this "start_ts" duplicate to filter the DataFrame and create 2 new DataFrames selecting the items of this duplicate timestamp, and the items that are not like this duplicate timestamp
val itemsNotDup = df5.filter(!$"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
val items = df5.filter($"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
And then I save in 2 different lists the avg_value and the Number_of_values
items.map(t => t.getAs[Double]("avg_value")).collect().foreach(saveList => listDataDF += saveList.toString)
items.map(t => t.getAs[Long]("Number_of_val")).collect().foreach(saveList => listDataDF2 += saveList.toString)
Now I make some maths with the values on the lists (THIS IS WHERE I'M GETTING ISSUES)
val newAvg = ((listDataDF(0).toDouble*listDataDF2(0).toDouble) - (listDataDF(1).toDouble*listDataDF2(1).toDouble)) / (listDataDF2(0) + listDataDF2(1)).toInt
val newNumberOfValues = listDataDF2(0).toDouble + listDataDF2(1).toDouble
Then save the duplicate timestamp (df7), the avg and the number of values into a list as a single item, this list transforms into a DataFrame and then I transform I get a new DataFrame with the columns how are supposed to be.
listDataDF3 += df7 + ',' + newAvg.toString + ',' + newNumberOfValues.toString + ','
val listDF = listDataDF3.toDF("value")
val listDF2 = listDF.withColumn("_tmp", split($"value", "\\,")).select(
$"_tmp".getItem(0).as("start_ts"),
$"_tmp".getItem(1).as("avg_value"),
$"_tmp".getItem(2).as("Number_of_val")
).drop("_tmp")
Finally I join the DataFrame without duplicates with the new DataFrame which have the duplicate timestamp and the avg of the duplicate avg values and the sum of number of values.
val finalDF = itemsNotDup.union(listDF2)
finalDF.coalesce(1).write.mode(SaveMode.Overwrite).format("csv").option("header","true").save(filePathAggregated3)
When I run this code in SPARK it gives me the error, I supposed it was related to empty lists (since it's giving me the error when making some maths with the values of the lists) but If I delete the line where I write to CSV, the code runs perfectly, also I saved the lists and values of the math calcs into files and they are not empty.
My supposition, is that, is deleting the file before reading it (because of how spark distribute tasks between workers) and that's why the list is empty therefore I'm getting this error when trying to make maths with those values.
I'm trying to be as clear as possible but I cannot give much more details, nor show any of the output.
So, how can I avoid this error? also I've been only 1 month with scala/spark so any code recommendation will be nice as well.
Thanks beforehand.
This error comes because of the Data. Any of your list does not contains columns as expected. When you refer to that index, the List gives this error to you
It was a problem related to reading files, I made a check (df.rdd.isEmpty) and wether the DF was empty I was getting this error. Made this as an if/else statement to check if the DF is empty, and now it works fine.

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark