PySpark Aggregation and Group By

PySpark Aggregation and Group By - pyspark

I have seen multiple posts but the aggregation is done on multiple columns , but I want the aggregation based on col OPTION_CD, based on the following condition:
If have conditions attached to the dataframe query, which is giving me the error 'DataFrame' object has no attribute '_get_object_id'
IF NULL(STRING AGG(OPTION_CD,'' order by OPTION_CD),'').
What I can understand is that if OPTION_CD col is null then place a blank else append the OPTION_CD in one row separated by a blank.Following is the sample table :
first there is filteration to get only 1 and 2 from COl 1, then the result should be like this :
Following is the query that I am writing on my dataframe
df_result = df.filter((df.COL1 == 1)|(df.COL1 == 2)).select(df.COL1,df.COL2,(when(df.OPTION_CD == "NULL", " ").otherwise(df.groupBy(df.OPTION_CD))).agg(
collect_list(df.OPTION_CD)))
But not getting the desired results. Can anyone help in this? I am using pyspark.

You do not express your question clearly enough but I will make a try to answer it.
You need to understand that a dataframe column can have only one data type for all the rows. If you initial data are integers, then you can not check for string equality with the empty string but rather with Null value.
Also collect list returns an array of integers, so you cannot have [7 , 5] in one row and "'" in another row. In any way does this work for you?
from pyspark.sql.functions import col, collect_list
listOfTuples = [(1, 3, 1),(2, 3, 2),(1, 4, 5),(1, 4, 7),(5, 5, 8),(4, 1, 3),(2,4,None)]
df = spark.createDataFrame(listOfTuples , ["A", "B", "option"])
df.show()
>>>
+---+---+------+
| A| B|option|
+---+---+------+
| 1| 3| 1|
| 2| 3| 2|
| 1| 4| 5|
| 1| 4| 7|
| 5| 5| 8|
| 4| 1| 3|
| 2| 4| null|
+---+---+------+
dfFinal = df.filter((df.A == 1)|(df.A == 2)).groupby(['A','B']).agg(collect_list(df['option']))
dfFinal.show()
>>>
+---+---+--------------------+
| A| B|collect_list(option)|
+---+---+--------------------+
| 1| 3| [1]|
| 1| 4| [5, 7]|
| 2| 3| [2]|
| 2| 4| []|
+---+---+--------------------+

Related

Calculate number of columns with missing values per each row in PySpark

Let see we have the following data set
columns = ['id', 'dogs', 'cats']
values = [(1, 2, 0),(2, None, None),(3, None,9)]
df = spark.createDataFrame(values,columns)
df.show()
+----+----+----+
| id|dogs|cats|
+----+----+----+
| 1| 2| 0|
| 2|null|null|
| 3|null| 9|
+----+----+----+
I would like to calculate number ("miss_nb") and percents ("miss_pt") of columns with missing values per rows and get the following table
+----+-------+-------+
| id|miss_nb|miss_pt|
+----+-------+-------+
| 1| 0| 0.00|
| 2| 2| 0.67|
| 3| 1| 0.33|
+----+-------+-------+
The number of columns should be any (non-fixed list).
How to do it?
Thanks!

A sum of typedLit columns evaluates to NULL

I am trying to create a sum column by taking the sum of the row values of a set of columns in a dataframe. So I followed the following method to do it.
val temp_data = spark.createDataFrame(Seq(
(1, 5),
(2, 4),
(3, 7),
(4, 6)
)).toDF("A", "B")
val cols = List(col("A"), col("B"))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
So this methods works fine and produce the expected output. However, I want to create the cols variable without specifying the column names explicitly. Therefore I've used typedLit as follows.
val cols2 = temp_data.columns.map(x=>typedLit(x)).toList
when I look at cols and cols2 they look identical.
cols: List[org.apache.spark.sql.Column] = List(A, B)
cols2: List[org.apache.spark.sql.Column] = List(A, B)
However, when I use cols2 to create my sum column, it doesn't work the way I expect it to work.
temp_data.withColumn("sum", cols2.reduce(_ + _)).show
+---+---+----+
| A| B| sum|
+---+---+----+
| 1| 5|null|
| 2| 4|null|
| 3| 7|null|
| 4| 6|null|
+---+---+----+
Does anyone have any idea what I'm doing wrong here? Why doesn't the second method work like the first method?

lit or typedLit is not a replacement for Column. What your code does it creates a list of string literals - "A" and "B"
temp_data.select(cols2: _*).show
+---+---+
| A| B|
+---+---+
| A| B|
| A| B|
| A| B|
| A| B|
+---+---+
and asks for their sums - hence the result is undefined.
You might use TypedColumn here:
import org.apache.spark.sql.TypedColumn
val typedSum: TypedColumn[Any, Int] = cols.map(_.as[Int]).reduce{
(x, y) => (x + y).as[Int]
}
temp_data.withColumn("sum", typedSum).show
but it doesn't provide any practical advantage over standard Column here.

You are trying with typedLit which is not right and like other answer mentioned you don't have to use a function with TypedColumn. You can simply use map transformation on columns of dataframe to convert it to List(Col)
Change your cols2 statement to below and try.
val cols = temp_data.columns.map(f=> col(f))
temp_data.withColumn("sum", cols.reduce(_ + _)).show
You will get below output.
+---+---+---+
| A| B|sum|
+---+---+---+
| 1| 5| 6|
| 2| 4| 6|
| 3| 7| 10|
| 4| 6| 10|
+---+---+---+
Thanks

pyspark: drop columns that have same values in all rows

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()

You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

Dynamically passing Query String for selecting columns in PySpark Dataframe method selectExpr()

I am generating a query string dynamically as follows and passing it to selectExpr().
queryString=''''category_id as cat_id','category_department_id as cat_dpt_id','category_name as cat_name''''
df.selectExpr(queryString)
As per document
selectExpr(*expr) :
Projects a set of SQL expressions and returns a new DataFrame.
This is a variant of select() that accepts SQL expressions.
The issue is that the variable "queryString" is being treated as a single string instead of three separate columns ( and rightly so ). Following is the error:
: org.apache.spark.sql.catalyst.parser.ParseException:
.........
== SQL ==
'category_id as cat_id', 'category_department_id as cat_dpt_id', 'category_name as cat_name'
------------------------^^^
Is there any way I can pass the dynamically generated "queryString" as an argument of selectExpr().

If possible, while generating your query string, try to put the individual column expressions in a list right away instead of concatenating them into one string.
If not possible, you can split your query string to have seperated column expressions which can be passed to selectExpr.
# generate some dummy data
data= pd.DataFrame(np.random.randint(0, 5, size=(5, 3)), columns=list("abc"))
df = spark.createDataFrame(data)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 4|
| 1| 2| 1|
| 3| 3| 2|
| 3| 2| 2|
| 2| 0| 2|
+---+---+---+
# create example query string
query_string="'a as aa','b as bb','c as cc'"
# split and pass
column_expr = query_string.replace("'", "").split(",")
df.selectExpr(column_expr).show()
+---+---+---+
| aa| bb| cc|
+---+---+---+
| 1| 1| 4|
| 1| 2| 1|
| 3| 3| 2|
| 3| 2| 2|
| 2| 0| 2|
+---+---+---+

Split large array columns into multiple columns - Pyspark

I have:
+---+-------+-------+
| id| var1| var2|
+---+-------+-------+
| a|[1,2,3]|[1,2,3]|
| b|[2,3,4]|[2,3,4]|
+---+-------+-------+
I want:
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
The solution provided by How to split a list to multiple columns in Pyspark?
df1.select('id', df1.var1[0], df1.var1[1], ...).show()
works, but some of my arrays are very long (max 332).
How can I write this so that it takes account of all length arrays?

This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.
from pyspark.sql import functions as F
df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])
columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()
df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark Aggregation and Group By - pyspark

Related

Calculate number of columns with missing values per each row in PySpark

A sum of typedLit columns evaluates to NULL

pyspark: drop columns that have same values in all rows

Dynamically passing Query String for selecting columns in PySpark Dataframe method selectExpr()

Split large array columns into multiple columns - Pyspark

Categories

Resources