Spark: replace all smaller values than X by their sum - pyspark

I have a dataframe, that has a type and a sub type (broadly speaking).
Say something like:
What I'd like to do, is for each type, sum all values that are smaller than X (say 100 here), and replace them with one row where sub-type would be "other"
I.e.
Using window over(Type), I guess I could do two dfs (<100, >=100), where the first I'd sum, pick one row and hack it to get the "Other" single row, and union the result with the >= one. But it seems a rather clumsy way to do it?
(apologies, I don't have access to pyspark right now to do some code).

The way I would propose takes into account the need to have a key to apply an aggregation valid for each row, or you would 'loose' the one with value >= 100.
Therefore, the idée is to add a column that identify rows to be aggregated, and provide the other ones with a unique key. After wards, you'll have to clean the result according to the expected result.
Here is what I propose:
df = df \
.withColumn("to_agg",
F.when(F.col("Value") < 100, "Other")
.otherwise(F.concat(F.col("Type"), F.lit("-"), F.col("Sub-Type")))
) \
.withColumn("sum_other",
F.sum(F.col("Value")).over(Window.partitionBy("Type", "to_agg"))) \
.withColumn("Sub-Type",
F.when(F.col("to_agg") == "Other", F.col("to_agg"))
.otherwise(F.col("Column_4"))) \
.withColumn("Value", F.col("sum_other")) \
.drop("to_agg", "sum_other") \
.dropDuplicates(("Type", "Sub-Type")) \
.orderBy(F.col("Type").asc(), F.col("Value").desc())
Note: the solution to use a groupBy is also valid and is simpler but you will have only the columns used in the statement and the result. That's the reason why I prefer using a window function and enable to keep all other columns from the original dataset.

You could simply replace Sub-Type by other for all rows with Value < 100 and then groupby and sum:
(
df
.withColumn('Sub-Type', F.when(F.col('Value') < 100, 'Other').otherwise(F.col('Sub-Type')
.groupby('Type', 'Sub-Type')
.agg(
F.sum('Value').alias('Value')
)
)

Related

Replace randomly RDD values to null with scala spark

I have a csv file containing almost 15000 records. Each line contain 3 types of data devided by a tab (\t). I Actually want to replace randomly the second column value into null ! Maybe I will keep 8000 as they are and replace 7000 values to null.
Any help please with scala (spark)?
This is how it looks like:
read data as dataframe
generate a new column say rnd which is a random number from 0 to 1
make col2 = col2 when rnd < 0.5 (if you want to make 50% values null) else null
import org.apache.spark.sql.functions.{lit, rand, when}
import spark.implicits._
spark.read.option("header", "true").option("sep", "\t").csv(<your_path>)
.withColumn("rnd", rand())
.withColumn("col2", when($"rnd" < 0.5, $"col2").otherwise(lit(null).cast(<col2_datatype_here>)))
#amelie, notice $ in front of "rnd" in when condition in my answer.
You were supposed to do a column comparison, not value comparison.
PS: Not able to put comment as I am a stackoverflow newbie, hence a separate answer.

How to use Max function with Where conditions

I am writing code to select the maximum value from a column that does not equal two other large values. The maximum will always be the 3rd largest value. The two largest values are place holders, (int) in year month format 999912, and 999901.
I have tried using Max and Filter together with no luck.
val maxSurvey = s.max("SurveyMonth").filter(survey("SurveyMonth") =!= "999912" && survey("SurveyMonth") =!= "999901")
I expect the current result to be 201902.
You need select the max, but your code is wrong in the filter too, if you need Max, why you compare SurveyMonth with a String?
After changes your code will look like:
val maxSurvey = s.filter('SurveyMonth =!= 999912 && 'SurveyMonth =!= 999901).select(max('SurveyMonth))

Variable substitution in scala

I have two dataframes in scala both having data from two different tables but of same structure (srcdataframe and tgttable). I have to join these two based on composite primary key and select few columns and append two columns the code for which is as below:
for(i <- 2 until numCols) {
srcdataframe.as("A")
.join(tgttable.as("B"), $"A.INSTANCE_ID" === $"B.INSTANCE_ID" &&
$"A.CONTRACT_LINE_ID" === $"B.CONTRACT_LINE_ID", "inner")
.filter($"A." + srcColnm(i) =!= $"B." + srcColnm(i))
.select($"A.INSTANCE_ID",
$"A.CONTRACT_LINE_ID",
"$"+"\""+"A."+srcColnm(i)+"\""+","+"$"+"\""+"B."+srcColnm(i)+"\"")
.withColumn("MisMatchedCol",lit("\""+srcColnm(i)+"\""))
.withColumn("LastRunDate",current_timestamp.cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch");
hiveSQLContext.sql("Insert into table xxxx.f2f_Mismatch1 select t.* from (select * from IPF_1M_Mismatch) t");}
Here are the things am trying to do:
Inner join of srcdataframe and tgttable based on instance_id and contract_line_id.
Select only instance_id, contract_line_id, mismatched_col_values, hardcode of mismatched_col_nm, timestamp.
srcColnm(i) is an array of strings which contains the non-primary keys to be compared.
However, I am not able to resolve the variables inside the dataframe in the for loop. I tried looking up for solutions here and here. I got to know that it may be because of the way spark substitutes the variables only at compile time, in this case I'm not sure how to resolve it.
Instead of creating columns with $, you can simply use strings or the col() function. I would also recommend performing the join outside of the for as it's an expensive operation. Slightly changed code, the main difference to solve your problem is in the select:
val df = srcdataframe.as("A")
.join(tgttable.as("B"), Seq("INSTANCE_ID", "CONTRACT_LINE_ID"), "inner")
for(columnName <- srcColnm) {
df.filter(col("A." + columnName) =!= col("B." + columnName))
.select("INSTANCE_ID", "CONTRACT_LINE_ID", "A." + columnName, "B." + columnName)
.withColumn("MisMatchedCol", lit(columnName))
.withColumn("LastRunDate", current_timestamp().cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch")
// Hive command
}
Regarding the problem in select:
$ is short for the col() function, it's selecting a column in the dataframe by name. The problem in the select is that the two first arguments col("A.INSTANCE_ID") and col("A.CONTRACT_LINE_ID") are two columns ($replaced bycol()` for clarity).
However, the next two arguments are strings. It is not possible to mix these two, either all arguments should be columns or all are strings. As you used "A."+srcColnm(i) to build up the column name $ can't be used, however, you could have used col("A."+srcColnm(i)).

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark

How to use orderby() with descending order in Spark window functions?

I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)