PySpark group by collect_list over a window - pyspark

I have a data frame with multiple columns. I'm trying to aggregate few columns using collect_list grouped on id, over a window function. I'm trying some thing like this:
exprs = [(collect_list(x).over(window)).alias(f"{x}_list") for x in cols]
df = df.groupBy('id').agg(*exprs)
I'm getting the below error:
expression is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get
If I do the same for a single column, instead of for multiple columns it is working.

I found a way for this. I guess, window functions wont work for agg(*exprs) operations. So, I modified the above to
for col_name in cols:
df = df.withColumn(col_name + "_list", collect_list(col(col_name)).over(window_spec))
This served my purpose.
Thank you.

Related

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

I have a dataset which is getting grouped by multiple variables where we finding aggregates like mean , std dev etc. Now i want to find Mean of a variables excluding the top 1 percentile of data
I am trying something like
df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),expr('percentile(value, array(0.99))')[0].alias('99_percentile'),mean(when(col('value')<=col('99_percentile'),col('value')))
But it seems spark cannot use the agg name which is defined in the same group statement.
I even tried this ,
~df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),mean(when(col('value')<=expr('percentile(value, array(0.99))')[0],col('value')))~
But it throws below error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
I hope some one would be able to answer this
Update :
I try doing the otherway
Here's a straightforward modification of your code. It will aggregate df twice. As far as I can tell, that's what is required.
df_final=(
df.join(df
.groupby(groupbyElement)
.agg(expr('percentile(value, array(0.99))')[0].alias('99_percentile'),
on=["groupbyElement"], how="left"
)
.groupby(groupbyElement)
.agg(mean('value').alias('Mean'),
stddev('value').alias('Stddev'),
mean(when(col('value')<=col('99_percentile'), col('value')))
)

save dropped duplicates in pyspark RDD

From here, Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame, we learned how to drop duplicated observations based on some specific variables. What if I want to save those duplicate observations in form of RDD, how shall I do? I guess rdd.substract() may be not efficient if RDD contains billions of observations. So besides using rdd.substract(), is there any other way I can use?
If you need both the datasets, one having only the distinct values and the other having the duplicates, you should use subtract. That will provide an accurate result. In case you need only the duplicates, you can use sql to get that.
df.createOrReplaceTempView('mydf')
df2 = spark.sql("select *,row_number() over(partition by <<list of columns used to identify duplicates>> order by <<any column/s not used to identify duplicates>>) as row_num from mydf having row_num>1").drop('row_num')

Comparing columns in two data frame in spark

I have two dataframes, both of them contain different number of columns.
I need to compare three fields between them to check if those are equal.
I tried following approach but its not working.
if(df_table_stats("rec_cnt").equals(df_aud("REC_CNT")) || df_table_stats("hashcount").equals(df_aud("HASH_CNT")) || round(df_table_stats("hashsum"),0).equals(round(df_aud("HASH_TTL"),0)))
{
println("Job executed succefully")
}
df_table_stats("rec_cnt"), this returns Column rather than actual value hence condition becoming false.
Also, please explain difference between df_table_stats.select("rec_cnt") and df_table_stats("rec_cnt").
Thanks.
Use sql and inner join both df , with your conditions .
Per my comment, the syntax you're using are simple column references, they don't actually return data. Assuming you MUST use Spark for this, you'd want a method that actually returns the data, known in Spark as an action. For this case you can use take to return the first Row of data and extract the desired columns:
val tableStatsRow: Row = df_table_stats.take(1).head
val audRow: Row = df_aud.take(1).head
val tableStatsRecCount = tableStatsRow.getAs[Int]("rec_cnt")
val audRecCount = audRow.getAs[Int]("REC_CNT")
//repeat for the other values you need to capture
However, Spark definitely is overkill if this is all you're using it for. You could use a simple JDBC library for Scala like ScalikeJDBC to do these queries and capture the primitives in the results.

How to pass a group of RelationalGroupedDataset to a function?

I am reading a csv as a Data Frame by below:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("D:/ModelData.csv")
Then I group by three columns as below which returns a RelationalGroupedDataset
df.groupBy("col1", "col2","col3")
And I want each grouped data frame to be send through the below function
def ModelFunction(daf: DataFrame) = {
//do some calculation
}
For example if I have col1 having 2 unique (0,1) values and col2 having 2 unique values(1,2) and col3 having 3 unique values(1,2,3) Then i would like to pass each combination grouping to the Model function Like for col1=0 ,col2=1,col3=1 I will havea dataframe and I want to pass that to the ModelFunction and so on for each combination of the three columns.
I tried
df.groupBy("col1", "col2","col3").ModelFunction();
But it throw an error.
.
Any help is appreciated.
The short answer is that you cannot do that. You can only do aggregate functions on RelationalGroupedDataset (either ones you write as UDAF or built in ones in org.apache.spark.sql.functions)
The way I see it you have several options:
Option 1: The amount of data for each unique combination is small enough and not skewed too much compared to other combinations.
In this case you can do:
val grouped = df.groupBy("col1", "col2","col3").agg(collect_list(struct(all other columns)))
grouped.as[some case class to represent the data including the combination].map[your own logistic regression function).
Option 2: If the total number of combinations is small enough you can do:
val values: df.select("col1", "col2", "col3").distinct().collect()
and then loop through them creating a new dataframe from each combination by doing a filter.
Option 3: Write your own UDAF
This would probably not be good enough as the data comes in a stream without the ability to do iteration, however, if you have an implemenation of logistic regression which matches you can try to write a UDAF to do this. See for example: How to define and use a User-Defined Aggregate Function in Spark SQL?

Append a column to Data Frame in Apache Spark 1.3

Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.