How would I use an ANY condition to filter if any rows of a group have a 0 value? - scala

Say I have this dataframe...
var df = Seq(("Steve",1),("Steve",0),("Michael",3),("Michael",2),("Katherine",4),("Katherine",0),("Devin",0),("Devin",0)).toDF("name","score")
I want to return the unique names where NONE of their scores are equals to zero. So in this case, the only name that would be returned would be Michael, since both of his scores above zero.
Thanks so much!

When you want a condition to apply on several rows, you need to use either groupBy or Window functions
In your case, you can group by column "name", aggregate the list of scores for each name and then filter out all the records where list of score contains 0. Your code would be:
import org.apache.spark.sql.functions.{col, collect_set, array_contains, not}
df.groupBy("name")
.agg(collect_set(col("score")).as("all_scores"))
.filter(not(array_contains(col("all_scores"), 0)))
.select("name")

Related

How to do aggregation on dataframe to get distinct count of column

How do I apply where condition on dataframe ,example I need to groupBy on one column and count the distinct values in the column based on certain where condition.I need to do this where condition for multiple columns
I tried the below way.Please let me know how Can I do this.
case class testRdd(name:String,id:Int,price:Int)
val Cols = testRdd.toDF().groupBy("id").agg( countDistinct("name").when(col("price")>0,1).otherwise(0)
This will not work,or Is there a way to do something like ? Thanks in advance
testRdd.toDF().groupBy("id").agg(if(col("price")>0)countDistinct("name"))
Here is an alternative approach to #Robin's answer, namely introducing an additional boolean column to group
df.groupBy($"id",when($"price">0,true).otherwise(false).as("positive_price"))
.agg(
countDistinct($"name")
)
.where($"positive_price")
.show
testRDD.select("name","id").where($"price">0).distinct.groupBy($"id").agg( count("name")).show

How to filter out entries from List[Map[String,String]]?

I want to filter out those entries that have operation_id equal to "0".
val operations_seen_qty = parsed.flatMap(_.lift("operation_id")).toSet.size.toString
parsed is List[Map[String,String]].
How can I do it?
This is my draft, but I think that I am in contrast selecting only those entries that have operation_id equal to 0:
val operations_seen_qty = parsed.flatMap(_.lift("operation_id")).filter(p=>p.equals("0")).toSet.size.toString
The final objective is to count the number of unique operation_id values that are not equal to "0".
If I understand correctly, you only want to retain those entries whose entry id is NOT equal to "0". In this case, the function in the filter should be p=>!p.equals("0") or p=>p!="0".
Filter will retain the entries fulfill the predicate. What you did is exactly the opposite.

Take first n records from dataframe grouped by unique id

I have my Dataset like this
As you see is ordered by rating and userId I need to get a new Dataframe with only the top 2 results of each group by unique user_id I've tried to
dataframe.groupBy("user_id").agg(someUdfFuntion)
I tried to use rank function but it seems not to work,I tried to filter the dataframe but no result how could I accomplish this?
Try:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val window = Window.partitionBy("userId").orderBy($"rating".desc)
dataframe.withColumn("r", row_number.over(window)).where($"r" <= n)

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark

How to use orderby() with descending order in Spark window functions?

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)