How can I choose which duplicate rows to be dropped? - scala

I'm trying to merge a new dataset in with an old dataset, I have a Seq[String] of primary keys for each table type, and an old dataframe and a new dataframe with the same schema.
If the primary key column values match, I want to replace the row in the old dataframe with the row in the new dataframe, if they don't match, I want to add the row in.
I have this so far:
val finalFrame: DataFrame = oldDF.withColumn("old/new",lit("1"))
.union(newDF.withColumn("old/new",lit("2")))
.dropDuplicates(primaryKeySet)
I add a literal column of 1's and 2's to keep track of which rows are which, union them together, and drop the duplicates based on the Seq[String] of primary key column names. The problem with this solution is that it doesn't let me specify which duplicates are dropped from the table, if I could specify that duplicates with "1" are dropped that would be optimal, but I'm open to alternate solutions.

Pounded my head on it a little longer and figured out a trick. My primary keys were a sequence, and so couldn't be straight taken into a partitionBy in a window function, so I did this:
val windowFunction = Window.partitionBy(primaryKeySet.head, primaryKeySet.tail: _*).orderBy(desc("old/new"))
val duplicateFreeFinalDF = finalFrame.withColumn("rownum", row_number.over(windowFunction)).where("rownum = 1").drop("rownum").drop("old/new")
Essentially just used vararg expansion so partitionBy would take my list, and then a rownum window function so I could make sure to get the most recent copy in case of a duplicate.

Related

Spark Scala, grabbing the max value of 1 column, but keep all columns

I have a dataframe with 3 columns (customer, associations, timestamp).
I want to grab the latest customer by looking at timestamps.
Attempt
val rdd = readRdd.select(col("value"))
val val_columns = Seq("value.timestamp").map(x => last(col(x)).alias(x))
rdd.orderBy("value.timestamp")
.groupBy("value.customer")
.agg(val_columns.head, val_columns.tail: _*)
.show()
I believe the above code is working, but trying to figure out how to include all columns (ie. associations). If I understand correctly, adding it into the groupby would mean I'm grabbing the latest combination of customer and associations combined, but I only want to grab latest off the customer column and not look at multiple columns together.
Edit:
I might be onto something by adding:
val val_columns = Seq("value.lastRefresh", "value.associations")
.map(x => last(col(x)).alias(x))
Curious on thoughts.
If you want to return the latest customer data by the timestamp column, you can just order your dataframe by value.timestamp and apply limit(1):
import org.apache.spark.sql.functions._
df.orderBy(desc("value.timestamp")).limit(1).show()

How to drop duplicates records that have the same value on a specific column and retain the one with the highest timestamp using pyspark

I have tried the code below. The idea of the script below is to order the records by id and timestamp and arrange it in descending order by processed_timestamp but when I try to run the query it is not arranging the record in descending order by processed timestamp. It even drops the latest record and retain the older record which should not be the case
df2 = df_concat.orderBy("id", "processed_timestamp", f.col("processed_timestamp").desc()).dropDuplicates(["id"])
I also tried the approach below but when I tried to convert it back to dataframe, the table schema is now different and the records are now residing in a single column and seperated by a comma. It also drops the latest record and retain the older record which should not be the case
def selectRowByTimeStamp(x,y):
if x.processed_timestamp > y.processed_timestamp:
return x
return y
dataMap = df_concat.rdd.map(lambda x: (x.id, x))
newdata = dataMap.reduceByKey(selectRowByTimeStamp)
I'm not sure if I'm correctly understanding how the above code works.
If it was not for a simple mistake, your code would work as expected.
You should not use the column name "processed_timestamp" twice:
df2 = df_concat.orderBy(
"id", f.col("processed_timestamp").desc()
).dropDuplicates(["id"])
Your code is sorting the DataFrame by processed_timestamp in ascending order because the raw column name comes first.

Scala - How to append a column to a DataFrame preserving the original column name?

I have a basic DataFrame containing all the data and several derivative DataFrames that I've been subsequently creating from the basic DF making grouping, joins etc.
Every time I want to append a column to the last DataFrame containing the most relevant data I have to do something like this:
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date")
As you may see I have to change the original column name to new_date_
But I want the column name to remain the same.
However if I don't change the name the column gets dropped. So renaming is just a not too pretty workaround.
How can I preserve the original column name when appending the column?
As far as I know you can not create two columns with the same name in a DataFrame transformation. I rename the new column to the olderĀ“s name like
val theMostRelevantFinalDf = olderDF.withColumn("new_date_", to_utc_timestamp(unix_timestamp(col("new_date"))
.cast(TimestampType), "UTC").cast(StringType)).drop($"new_date").withColumnRenamed("new_date_", "new_date")

how to remove a column from dataframe which dont have any value ( scala)

Problem statement
I have a table called employee from which i am creating a data-frame .There are some columns which don't have any record.I want to remove that columns from data frame.i also don't know how many columns of the data frame has no record in it.
You cannot remove a column from the dataFrame AFAIK !
What you can do is make another dataframe from the old dataFrame and extract the column names that you actually want to !
Example:
oldDFSchema like this(id,name,badColumn,email)
then
val newDf=oldDF.select("id","name","email")
Or there is one more thing that you can use is :
the .drop() function on dataframe that takes the column names and drops them and returns you a new dataframe !
You can find about it here : https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.sql.Dataset#drop(col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
I hope this might solve your use case !

Is it inefficient to manually iterate Spark SQL data frames and create column values?

In order to run a few ML algorithms, I need to create extra columns of data. Each of these columns involves some fairly intense calculations that involves keeping moving averages and recording information as you go through each row (and updating it meanwhile). I've done a mock through with a simple Python script and it works, and I am currently looking to translate it to a Scala Spark script that could be run on a larger data set.
The issue is it seems that for these to be highly efficient, using Spark SQL, it is preferred to use the built in syntax and operations (which are SQL-like). Encoding the logic in a SQL expression seems to be a very thought-intensive process, so I'm wondering what the downsides will be if I just manually create the new column values by iterating through each row, keeping track of variables and inserting the column value at the end.
You can convert an rdd into dataframe. Then use map on the data frame and process each row as you wish. If you need to add new column, then you can use, withColumn. However this will only allow one column to be added and it happens for the entire dataframe. If you want more columns to be added, then inside map method,
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)