I am writing code to select the maximum value from a column that does not equal two other large values. The maximum will always be the 3rd largest value. The two largest values are place holders, (int) in year month format 999912, and 999901.
I have tried using Max and Filter together with no luck.
val maxSurvey = s.max("SurveyMonth").filter(survey("SurveyMonth") =!= "999912" && survey("SurveyMonth") =!= "999901")
I expect the current result to be 201902.
You need select the max, but your code is wrong in the filter too, if you need Max, why you compare SurveyMonth with a String?
After changes your code will look like:
val maxSurvey = s.filter('SurveyMonth =!= 999912 && 'SurveyMonth =!= 999901).select(max('SurveyMonth))
Related
I need to join two spark dataframes (df1, df2) based on two conditions.
(df1.lob == df2.lob)
df1.qtr >= df2.qtr, if no match (meaning df1.qtr is smaller than all in df2) then use df2.qtr = 10.
Naively I might try to write this as
df1.join(on=[(df1.lob == df2.lob) & (df1.qtr >= df2.qtr)], how="left")
But I think this will match on any qtr in df2 that is less than or equal to the df1.qtr. I need it to match the largest one not bigger than df1 qtr, so meaning it will prioritize exact matches and then look for next biggest.
Also not sure how to handle the case where df1.qtr is smaller than all in df2.qtr.
I have a dataframe, that has a type and a sub type (broadly speaking).
Say something like:
What I'd like to do, is for each type, sum all values that are smaller than X (say 100 here), and replace them with one row where sub-type would be "other"
I.e.
Using window over(Type), I guess I could do two dfs (<100, >=100), where the first I'd sum, pick one row and hack it to get the "Other" single row, and union the result with the >= one. But it seems a rather clumsy way to do it?
(apologies, I don't have access to pyspark right now to do some code).
The way I would propose takes into account the need to have a key to apply an aggregation valid for each row, or you would 'loose' the one with value >= 100.
Therefore, the idée is to add a column that identify rows to be aggregated, and provide the other ones with a unique key. After wards, you'll have to clean the result according to the expected result.
Here is what I propose:
df = df \
.withColumn("to_agg",
F.when(F.col("Value") < 100, "Other")
.otherwise(F.concat(F.col("Type"), F.lit("-"), F.col("Sub-Type")))
) \
.withColumn("sum_other",
F.sum(F.col("Value")).over(Window.partitionBy("Type", "to_agg"))) \
.withColumn("Sub-Type",
F.when(F.col("to_agg") == "Other", F.col("to_agg"))
.otherwise(F.col("Column_4"))) \
.withColumn("Value", F.col("sum_other")) \
.drop("to_agg", "sum_other") \
.dropDuplicates(("Type", "Sub-Type")) \
.orderBy(F.col("Type").asc(), F.col("Value").desc())
Note: the solution to use a groupBy is also valid and is simpler but you will have only the columns used in the statement and the result. That's the reason why I prefer using a window function and enable to keep all other columns from the original dataset.
You could simply replace Sub-Type by other for all rows with Value < 100 and then groupby and sum:
(
df
.withColumn('Sub-Type', F.when(F.col('Value') < 100, 'Other').otherwise(F.col('Sub-Type')
.groupby('Type', 'Sub-Type')
.agg(
F.sum('Value').alias('Value')
)
)
Data
I want to apply groupby for column1 and want to calculate the percentage of passed and failed percentage for each 1 and as well count
Example ouput I am looking for
Using pyspark I am doing the below code but I am only getting the percentage
levels = ["passed", "failed","blocked"]
exprs = [avg((col("Column2") == level).cast("double")*100).alias(level)
for level in levels]
df = sparkSession.read.json(hdfsPath)
result1 = df1.select('Column1','Column2').groupBy("Column1").agg(*exprs)
You would need to explicitly calculate the counts, and then do some string formatting to combine the percentages in the counts into a single column.
from pyspark.sql.functions import avg, col, count, concat, lit
levels = ["passed", "failed","blocked"]
# percentage aggregations
pct_exprs = [avg((col("Column2") == level).cast("double")*100).alias('{}_pct'.format(level))
for level in levels]
# count aggregations
count_exprs = [sum((col("Column2") == level).cast("int")).alias('{}_count'.format(level))
for level in levels]
# combine all aggregations
exprs = pct_exprs + count_exprs
# string formatting select expressions
select_exprs = [
concat(
col('{}_pct'.format(level)).cast('string'),
lit('('),
col('{}_count'.format(level)).cast('string'),
lit(')')
).alias('{}_viz'.format(level))
for level in levels
]
df = sparkSession.read.json(hdfsPath)
result1 = (
df1
.select('Column1','Column2')
.groupBy("Column1")
.agg(*exprs)
.select('Column1', *select_exprs)
)
NB: it seems like you are trying to use Spark to make a nice visualization of the results of your calculations, but I don't think Spark is well-suited for this task. If you have few enough records that you can see all of them at once, you might as well work locally in Pandas or something similar. And if you have enough records that using Spark makes sense, then you can't see all of them at once anyway so it doesn't matter too much whether they look nice.
I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))
I need a window function that partitions by some keys (=column names), orders by another column name and returns the rows with top x ranks.
This works fine for ascending order:
def getTopX(df: DataFrame, top_x: String, top_key: String, top_value:String): DataFrame ={
val top_keys: List[String] = top_key.split(", ").map(_.trim).toList
val w = Window.partitionBy(top_keys(1),top_keys.drop(1):_*)
.orderBy(top_value)
val rankCondition = "rn < "+top_x.toString
val dfTop = df.withColumn("rn",row_number().over(w))
.where(rankCondition).drop("rn")
return dfTop
}
But when I try to change it to orderBy(desc(top_value)) or orderBy(top_value.desc) in line 4, I get a syntax error. What's the correct syntax here?
There are two versions of orderBy, one that works with strings and one that works with Column objects (API). Your code is using the first version, which does not allow for changing the sort order. You need to switch to the column version and then call the desc method, e.g., myCol.desc.
Now, we get into API design territory. The advantage of passing Column parameters is that you have a lot more flexibility, e.g., you can use expressions, etc. If you want to maintain an API that takes in a string as opposed to a Column, you need to convert the string to a column. There are a number of ways to do this and the easiest is to use org.apache.spark.sql.functions.col(myColName).
Putting it all together, we get
.orderBy(org.apache.spark.sql.functions.col(top_value).desc)
Say for example, if we need to order by a column called Date in descending order in the Window function, use the $ symbol before the column name which will enable us to use the asc or desc syntax.
Window.orderBy($"Date".desc)
After specifying the column name in double quotes, give .desc which will sort in descending order.
Column
col = new Column("ts")
col = col.desc()
WindowSpec w = Window.partitionBy("col1", "col2").orderBy(col)