How debug spark dropduplicate and join function calls? - scala

There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?

After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.

Related

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

IndexOutOfBoundsException when writing dataframe into CSV

So, I'm trying to read an existing file, save that into a DataFrame, once that's done I make a "union" between that existing DataFrame and a new one I have already created, both have the same columns and share the same schema.
ALSO I CANNOT GIVE SIGNIFICANT NAME TO VARS NOR GIVE ANYMORE DATA BECAUSE OF RESTRICTIONS
val dfExist = spark.read.format("csv").option("header", "true").option("delimiter", ",").schema(schema).load(filePathAggregated3)
val df5 = df4.union(dfExist)
Once that's done I get the "start_ts" (a timestamp on Epoch format) that's duplicate in the union between the above dataframes (df4 and dfExist) and also I get rid of some characters I don't want
val df6 = df5.select($"start_ts").collect()
val df7 = df6.diff(df6.distinct).distinct.mkString.replace("[", "").replace("]", "")
Now I use this "start_ts" duplicate to filter the DataFrame and create 2 new DataFrames selecting the items of this duplicate timestamp, and the items that are not like this duplicate timestamp
val itemsNotDup = df5.filter(!$"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
val items = df5.filter($"start_ts".like(df7)).select($"start_ts",$"avg_value",$"Number_of_val")
And then I save in 2 different lists the avg_value and the Number_of_values
items.map(t => t.getAs[Double]("avg_value")).collect().foreach(saveList => listDataDF += saveList.toString)
items.map(t => t.getAs[Long]("Number_of_val")).collect().foreach(saveList => listDataDF2 += saveList.toString)
Now I make some maths with the values on the lists (THIS IS WHERE I'M GETTING ISSUES)
val newAvg = ((listDataDF(0).toDouble*listDataDF2(0).toDouble) - (listDataDF(1).toDouble*listDataDF2(1).toDouble)) / (listDataDF2(0) + listDataDF2(1)).toInt
val newNumberOfValues = listDataDF2(0).toDouble + listDataDF2(1).toDouble
Then save the duplicate timestamp (df7), the avg and the number of values into a list as a single item, this list transforms into a DataFrame and then I transform I get a new DataFrame with the columns how are supposed to be.
listDataDF3 += df7 + ',' + newAvg.toString + ',' + newNumberOfValues.toString + ','
val listDF = listDataDF3.toDF("value")
val listDF2 = listDF.withColumn("_tmp", split($"value", "\\,")).select(
$"_tmp".getItem(0).as("start_ts"),
$"_tmp".getItem(1).as("avg_value"),
$"_tmp".getItem(2).as("Number_of_val")
).drop("_tmp")
Finally I join the DataFrame without duplicates with the new DataFrame which have the duplicate timestamp and the avg of the duplicate avg values and the sum of number of values.
val finalDF = itemsNotDup.union(listDF2)
finalDF.coalesce(1).write.mode(SaveMode.Overwrite).format("csv").option("header","true").save(filePathAggregated3)
When I run this code in SPARK it gives me the error, I supposed it was related to empty lists (since it's giving me the error when making some maths with the values of the lists) but If I delete the line where I write to CSV, the code runs perfectly, also I saved the lists and values of the math calcs into files and they are not empty.
My supposition, is that, is deleting the file before reading it (because of how spark distribute tasks between workers) and that's why the list is empty therefore I'm getting this error when trying to make maths with those values.
I'm trying to be as clear as possible but I cannot give much more details, nor show any of the output.
So, how can I avoid this error? also I've been only 1 month with scala/spark so any code recommendation will be nice as well.
Thanks beforehand.
This error comes because of the Data. Any of your list does not contains columns as expected. When you refer to that index, the List gives this error to you
It was a problem related to reading files, I made a check (df.rdd.isEmpty) and wether the DF was empty I was getting this error. Made this as an if/else statement to check if the DF is empty, and now it works fine.

Spark scala - Output number of rows after sql insert operation

I have simple question, which I can't implement. Let's say I have following code:
...
val df = sparkSession.sql("INSERT OVERWRITE TABLE $DB_NAME.$TABLE_NAME PARTITION (partition) SELECT test, 1 as partition")
val rowCount = df.rdd.count()
printf(s"$rowCount rows has been inserted")
...
How to get the number of rows inserted by spark sql?
Before and after insert statement take the count of available rows and subtract it after insertion you will get the count if rows inserted.

Best way to gain performance when doing a join count using spark and scala

i have a requirement to validate an ingest operation , bassically, i have two big files within HDFS, one is avro formatted (ingested files), another one is parquet formatted (consolidated file).
Avro file has this schema:
filename, date, count, afield1,afield2,afield3,afield4,afield5,afield6,...afieldN
Parquet file has this schema:
fileName,anotherField1,anotherField1,anotherField2,anotherFiel3,anotherField14,...,anotherFieldN
If i try to load both files in a DataFrame and then try to use a naive join-where, the job in my local machine takes more than 24 hours!, which is unaceptable.
ingestedDF.join(consolidatedDF).where($"filename" === $"fileName").count()
¿Which is the best way to achieve this? ¿dropping colums from the DataFrame before doing the join-where-count? ¿calculating the counts per dataframe and then join and sum?
PD
I was reading about map-side-joint technique but it looks that this technique would work for me if there was a small file able to fit in RAM, but i cant assure that, so, i would like to know which is the prefered way from the community to achieve this.
http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
I would approach this problem by stripping down the data to only the field I'm interested in (filename), making a unique set of the filename with the source it comes from (the origin dataset).
At this point, both intermediate datasets have the same schema, so we can union them and just count. This should be orders of magnitude faster than using a join on the complete data.
// prepare some random dataset
val data1 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.8).map(i => (s"file$i", i, "rubbish"))
val data2 = (1 to 100000).filter(_ => scala.util.Random.nextDouble<0.7).map(i => (s"file$i", i, "crap"))
val df1 = sparkSession.createDataFrame(data1).toDF("filename", "index", "data")
val df2 = sparkSession.createDataFrame(data2).toDF("filename", "index", "data")
// select only the column we are interested in and tag it with the source.
// Lets make it distinct as we are only interested in the unique file count
val df1Filenames = df1.select("filename").withColumn("df", lit("df1")).distinct
val df2Filenames = df2.select("filename").withColumn("df", lit("df2")).distinct
// union both dataframes
val union = df1Filenames.union(df2Filenames).toDF("filename","source")
// let's count the occurrences of filename, by using a groupby operation
val occurrenceCount = union.groupBy("filename").count
// we're interested in the count of those files that appear in both datasets (with a count of 2)
occurrenceCount.filter($"count"===2).count