How to do aggregation on dataframe to get distinct count of column - scala

How do I apply where condition on dataframe ,example I need to groupBy on one column and count the distinct values in the column based on certain where condition.I need to do this where condition for multiple columns
I tried the below way.Please let me know how Can I do this.
case class testRdd(name:String,id:Int,price:Int)
val Cols = testRdd.toDF().groupBy("id").agg( countDistinct("name").when(col("price")>0,1).otherwise(0)
This will not work,or Is there a way to do something like ? Thanks in advance
testRdd.toDF().groupBy("id").agg(if(col("price")>0)countDistinct("name"))

Here is an alternative approach to #Robin's answer, namely introducing an additional boolean column to group
df.groupBy($"id",when($"price">0,true).otherwise(false).as("positive_price"))
.agg(
countDistinct($"name")
)
.where($"positive_price")
.show

testRDD.select("name","id").where($"price">0).distinct.groupBy($"id").agg( count("name")).show

Related

How would I use an ANY condition to filter if any rows of a group have a 0 value?

Say I have this dataframe...
var df = Seq(("Steve",1),("Steve",0),("Michael",3),("Michael",2),("Katherine",4),("Katherine",0),("Devin",0),("Devin",0)).toDF("name","score")
I want to return the unique names where NONE of their scores are equals to zero. So in this case, the only name that would be returned would be Michael, since both of his scores above zero.
Thanks so much!
When you want a condition to apply on several rows, you need to use either groupBy or Window functions
In your case, you can group by column "name", aggregate the list of scores for each name and then filter out all the records where list of score contains 0. Your code would be:
import org.apache.spark.sql.functions.{col, collect_set, array_contains, not}
df.groupBy("name")
.agg(collect_set(col("score")).as("all_scores"))
.filter(not(array_contains(col("all_scores"), 0)))
.select("name")

Dataframe column substring based on the value during join

I have a dataframe with column having values like "COR//xxxxxx-xx-xxxx" or "xxxxxx-xx-xxxx"
I need to compare this column with another column in a different dataframe based on the column value.
If column value have "COR//xxxxx-xx-xxxx", I need to use substring("column", 4, length($"column")
If the column value have "xxxxx-xx-xxxx", I can compare directly without using substring.
For example:
val DF1 = DF2.join(DF3, upper(trim($"column1".substr(4, length($"column1")))) === upper(trim(DF3("column1"))))
I am not sure how to add the condition while joining. Could anyone please let me know how can we achieve this in Spark dataframe?
You can try adding a new column based on the conditions and join on the new column. Something like this.
val data = List("COR//xxxxx-xx-xxxx", "xxxxx-xx-xxxx")
val DF2 = ps.sparkSession.sparkContext.parallelize(data).toDF("column1")
val DF4 = DF2.withColumn("joinCol", when(col("column1").like("%COR%"),
expr("substring(column1, 6, length(column1)-1)")).otherwise(col("column1")) )
DF4.show(false)
The new column will have values like this.
+------------------+-------------+
|column1 |joinCol |
+------------------+-------------+
|COR//xxxxx-xx-xxxx|xxxxx-xx-xxxx|
|xxxxx-xx-xxxx |xxxxx-xx-xxxx|
+------------------+-------------+
You can now join based on the new column added.
val DF1 = DF4.join(DF3, upper(trim(DF4("joinCol"))) === upper(trim(DF3("column1"))))
Hope this helps.
Simply create a new column to use in the join:
DF2.withColumn("column2",
when($"column1" rlike "COR//.*",
$"column1".substr(lit(4), length($"column1")).
otherwise($"column1"))
Then use column2 in the join. It is also possible to add the whole when clause directly in the join but it would look very messy.
Note that to use a constant value in substr you need to use lit. And if you want to remove the whole "COR//" part, use 6 instead of 4.

Create new column based on equality between existing columns

While it seems a trivial task, I haven't been able to find a tidy solution for it. I want to add a new (integer) column, nCol to a dataframe, the value of which is determined by comparing two existing columns (both String type) of the dataframe, eCol1 and eCol2
something like:
df(nCol) = {
if df(eCol1) == df(eCol2) then 1
else 0
}
I believe it could be done with the help of user-defined functions (UDFs). But isn't there tidier way for such a trivial task?
You need to work with Dataframe DSL when/otherwise, to test equality use ===:
df
.withColumn("newCol", when(df(eCol1) === df(eCol2),1).otherwise(0))

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)