Spark: group only part of the rows in a DataFrame - scala

From a given DataFrame, I'dl like to group only few rows together, and keep the other rows in the same dataframe.
My current solution is:
val aggregated = mydf.filter(col("check").equalTo("do_aggregate")).groupBy(...).agg()
val finalDF = aggregated.unionByName(mydf.filter(col("check").notEqual("do_aggregate")))
However I'd like to find a more eleguant and performant way.

Use a derived column to group by, depending on the check.
mydf.groupBy(when(col("check").equalTo("do_aggregate"), ...).otherwise(monotonically_increasing_id)).agg(...)
If you have a unique key in the dataframe, use that instead of monotonically_increasing_id.

Related

Spark Scala, grabbing the max value of 1 column, but keep all columns

I have a dataframe with 3 columns (customer, associations, timestamp).
I want to grab the latest customer by looking at timestamps.
Attempt
val rdd = readRdd.select(col("value"))
val val_columns = Seq("value.timestamp").map(x => last(col(x)).alias(x))
rdd.orderBy("value.timestamp")
.groupBy("value.customer")
.agg(val_columns.head, val_columns.tail: _*)
.show()
I believe the above code is working, but trying to figure out how to include all columns (ie. associations). If I understand correctly, adding it into the groupby would mean I'm grabbing the latest combination of customer and associations combined, but I only want to grab latest off the customer column and not look at multiple columns together.
Edit:
I might be onto something by adding:
val val_columns = Seq("value.lastRefresh", "value.associations")
.map(x => last(col(x)).alias(x))
Curious on thoughts.
If you want to return the latest customer data by the timestamp column, you can just order your dataframe by value.timestamp and apply limit(1):
import org.apache.spark.sql.functions._
df.orderBy(desc("value.timestamp")).limit(1).show()

Compare two dataframes row count. Assign dataframe with high row count to a new dataframe object

I have two physical nodes that are not synchronised.
Both nodes produce captured data. (Two nodes technology was put in place for resilience).
I am facing following challenge:
nodes produce two identical files (timestamps may not be the same, no unique identifier in order to remove duplicates). Both frames share the same schema.
Is there a way to write in data frame using pyspark something like:
df3= case
when df1.count()<df2.count() then df2,
when df1.count()>df2.count() then df1,
ELSE df1
Resolved following case by defining "comparison" function.
def compare(df1, df2):
if df1.count() > df2.count():
return df1
if df1.count() < df2.count():
return df2
else:
return df1
Seems possibility to work with dataframes as an object can be achieved via functions

save dropped duplicates in pyspark RDD

From here, Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame, we learned how to drop duplicated observations based on some specific variables. What if I want to save those duplicate observations in form of RDD, how shall I do? I guess rdd.substract() may be not efficient if RDD contains billions of observations. So besides using rdd.substract(), is there any other way I can use?
If you need both the datasets, one having only the distinct values and the other having the duplicates, you should use subtract. That will provide an accurate result. In case you need only the duplicates, you can use sql to get that.
df.createOrReplaceTempView('mydf')
df2 = spark.sql("select *,row_number() over(partition by <<list of columns used to identify duplicates>> order by <<any column/s not used to identify duplicates>>) as row_num from mydf having row_num>1").drop('row_num')

How to pass a group of RelationalGroupedDataset to a function?

I am reading a csv as a Data Frame by below:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("D:/ModelData.csv")
Then I group by three columns as below which returns a RelationalGroupedDataset
df.groupBy("col1", "col2","col3")
And I want each grouped data frame to be send through the below function
def ModelFunction(daf: DataFrame) = {
//do some calculation
}
For example if I have col1 having 2 unique (0,1) values and col2 having 2 unique values(1,2) and col3 having 3 unique values(1,2,3) Then i would like to pass each combination grouping to the Model function Like for col1=0 ,col2=1,col3=1 I will havea dataframe and I want to pass that to the ModelFunction and so on for each combination of the three columns.
I tried
df.groupBy("col1", "col2","col3").ModelFunction();
But it throw an error.
.
Any help is appreciated.
The short answer is that you cannot do that. You can only do aggregate functions on RelationalGroupedDataset (either ones you write as UDAF or built in ones in org.apache.spark.sql.functions)
The way I see it you have several options:
Option 1: The amount of data for each unique combination is small enough and not skewed too much compared to other combinations.
In this case you can do:
val grouped = df.groupBy("col1", "col2","col3").agg(collect_list(struct(all other columns)))
grouped.as[some case class to represent the data including the combination].map[your own logistic regression function).
Option 2: If the total number of combinations is small enough you can do:
val values: df.select("col1", "col2", "col3").distinct().collect()
and then loop through them creating a new dataframe from each combination by doing a filter.
Option 3: Write your own UDAF
This would probably not be good enough as the data comes in a stream without the ability to do iteration, however, if you have an implemenation of logistic regression which matches you can try to write a UDAF to do this. See for example: How to define and use a User-Defined Aggregate Function in Spark SQL?

Add list as column to Dataframe in pyspark

I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution.
You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches:
convert dataframe to local by collect() or toLocalIterator() and for each row add corresponding value from the list OR
convert list to dataframe adding an extra column (with keys from dataframe) and then join them both