pyspark: drop columns that have same values in all rows - pyspark

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()

You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

Related

Calculate number of columns with missing values per each row in PySpark

Let see we have the following data set
columns = ['id', 'dogs', 'cats']
values = [(1, 2, 0),(2, None, None),(3, None,9)]
df = spark.createDataFrame(values,columns)
df.show()
+----+----+----+
| id|dogs|cats|
+----+----+----+
| 1| 2| 0|
| 2|null|null|
| 3|null| 9|
+----+----+----+
I would like to calculate number ("miss_nb") and percents ("miss_pt") of columns with missing values per rows and get the following table
+----+-------+-------+
| id|miss_nb|miss_pt|
+----+-------+-------+
| 1| 0| 0.00|
| 2| 2| 0.67|
| 3| 1| 0.33|
+----+-------+-------+
The number of columns should be any (non-fixed list).
How to do it?
Thanks!

How to compare value of one row with all the other rows in PySpark on grouped values

Problem statement
Consider the following data (see code generation at the bottom)
+-----+-----+-------+--------+
|index|group|low_num|high_num|
+-----+-----+-------+--------+
| 0| 1| 1| 1|
| 1| 1| 2| 2|
| 2| 1| 3| 3|
| 3| 2| 1| 3|
+-----+-----+-------+--------+
Then for a given index, I want to count how many times that one indexes high_num is greater than low_num for all low_num in the group.
For instance, consider the second row with index: 1. Index: 1 is in group: 1 and the high_num is 2. high_num on index 1 is greater than the high_num on index 0, equal to low_num, and smaller than the one on index 2. So the high_num of index: 1 is greater than low_num across the group once, so then I want the value in the answer column to say 1.
Dataset with desired output
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+
Dataset generation code
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.getOrCreate()
)
## Example df
## Note the inclusion of "desired" which is the desired output.
df = spark.createDataFrame(
[
(0, 1, 1, 1, 0),
(1, 1, 2, 2, 1),
(2, 1, 3, 3, 2),
(3, 2, 1, 3, 1)
],
schema=["index", "group", "low_num", "high_num", "desired"]
)
Pseudocode that might have solved the problem
A pseusocode might look like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w_spec = Window.partitionBy("group").rowsBetween(
Window.unboundedPreceding, Window.unboundedFollowing)
## F.collect_list_when does not exist
## F.current_col does not exist
## Probably wouldn't work like this anyways
ddf = df.withColumn("Counts",
F.size(F.collect_list_when(
F.current_col("high_number") > F.col("low_number"), 1
).otherwise(None).over(w_spec))
)
You can do a filter on the collect_list, and check its size:
import pyspark.sql.functions as F
df2 = df.withColumn(
'desired',
F.expr('size(filter(collect_list(low_num) over (partition by group), x -> x < high_num))')
)
df2.show()
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+

Improve the efficiency of Spark SQL in repeated calls to groupBy/count. Pivot the outcome

I have a Spark DataFrame consisting of columns of integers. I want to tabulate each column and pivot the outcome by the column names.
In the following toy example, I start with this DataFrame df
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 1| 1| 1| 0| 2|
| 1| 1| 1| 1| 1|
| 2| 2| 2| 3| 3|
| 0| 0| 0| 0| 1|
| 1| 1| 1| 0| 0|
| 3| 3| 3| 2| 2|
| 0| 1| 1| 1| 0|
+---+---+---+---+---+
Each cell can only contain one of {0, 1, 2, 3}. Now I want to tabulate the counts in each column. Ideally, I would have a column for each label (0, 1, 2, 3), and a row for each column. I do:
val output = df.columns.map(cs => df.select(cs).groupBy(cs).count().orderBy(cs).
withColumnRenamed(cs, "severity").
withColumnRenamed("count", "counts").withColumn("window", lit(cs))
)
I get an Array of DataFrames, one for each row of the df. Each of these dataframes has 4 rows (one for each outcome). Then I do:
val longOutput = output.reduce(_ union _) // flatten the array to produce one dataframe
longOutput.show()
to collapse the Array.
+--------+------+------+
|severity|counts|window|
+--------+------+------+
| 0| 2| a|
| 1| 3| a|
| 2| 1| a|
| 3| 1| a|
| 0| 1| b|
| 1| 4| b|
| 2| 1| b|
| 3| 1| b|
...
And finally, I pivot on the original column names
longOutput.cache()
val results = longOutput.groupBy("window").pivot("severity").agg(first("counts"))
results.show()
+------+---+---+---+---+
|window| 0| 1| 2| 3|
+------+---+---+---+---+
| e| 2| 2| 2| 1|
| d| 3| 2| 1| 1|
| c| 1| 4| 1| 1|
| b| 1| 4| 1| 1|
| a| 2| 3| 1| 1|
+------+---+---+---+---+
However the reduction piece took 8 full seconds on the toy example. It ran for over 2 hours on my actual data which had 1000 columns and 400,000 rows before I terminated it. I am running locally on a machine with 12 cores and 128G of RAM. But clearly, what I'm doing is slow on even a small amount of data, so machine size is not in itself the problem. The column groupby/count took only 7 minutes on the full data set. But then I can't do anything with that Array[DataFrame].
I tried several ways of avoiding union. I tried writing out my array to disk, but that failed due to a memory problem after several hours of effort. I also tried to adjust memory allowances on Zeppelin
So I need a way of doing the tabulation that does not give me an Array of DataFrames, but rather a simple data frame.
The problem with your code is that you trigger one spark job per column and then a big union. In general, it's much faster to try and keep everything within the same one.
In your case, instead of dividing the work, you could explode the dataframe to do everything in one pass like this:
df
.select(array(df.columns.map(c => struct(lit(c) as "name", col(c) as "value") ) : _*) as "a")
.select(explode('a))
.select($"col.name" as "name", $"col.value" as "value")
.groupBy("name")
.pivot("value")
.count()
.show()
This first line is the only one that's a bit tricky. It creates an array of tuples where each column name is mapped to its value. Then we explode it (one line per element of the array) and finally compute a basic pivot.

Split large array columns into multiple columns - Pyspark

I have:
+---+-------+-------+
| id| var1| var2|
+---+-------+-------+
| a|[1,2,3]|[1,2,3]|
| b|[2,3,4]|[2,3,4]|
+---+-------+-------+
I want:
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
The solution provided by How to split a list to multiple columns in Pyspark?
df1.select('id', df1.var1[0], df1.var1[1], ...).show()
works, but some of my arrays are very long (max 332).
How can I write this so that it takes account of all length arrays?
This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.
from pyspark.sql import functions as F
df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])
columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()
df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+

PySpark Aggregation and Group By

I have seen multiple posts but the aggregation is done on multiple columns , but I want the aggregation based on col OPTION_CD, based on the following condition:
If have conditions attached to the dataframe query, which is giving me the error 'DataFrame' object has no attribute '_get_object_id'
IF NULL(STRING AGG(OPTION_CD,'' order by OPTION_CD),'').
What I can understand is that if OPTION_CD col is null then place a blank else append the OPTION_CD in one row separated by a blank.Following is the sample table :
first there is filteration to get only 1 and 2 from COl 1, then the result should be like this :
Following is the query that I am writing on my dataframe
df_result = df.filter((df.COL1 == 1)|(df.COL1 == 2)).select(df.COL1,df.COL2,(when(df.OPTION_CD == "NULL", " ").otherwise(df.groupBy(df.OPTION_CD))).agg(
collect_list(df.OPTION_CD)))
But not getting the desired results. Can anyone help in this? I am using pyspark.
You do not express your question clearly enough but I will make a try to answer it.
You need to understand that a dataframe column can have only one data type for all the rows. If you initial data are integers, then you can not check for string equality with the empty string but rather with Null value.
Also collect list returns an array of integers, so you cannot have [7 , 5] in one row and "'" in another row. In any way does this work for you?
from pyspark.sql.functions import col, collect_list
listOfTuples = [(1, 3, 1),(2, 3, 2),(1, 4, 5),(1, 4, 7),(5, 5, 8),(4, 1, 3),(2,4,None)]
df = spark.createDataFrame(listOfTuples , ["A", "B", "option"])
df.show()
>>>
+---+---+------+
| A| B|option|
+---+---+------+
| 1| 3| 1|
| 2| 3| 2|
| 1| 4| 5|
| 1| 4| 7|
| 5| 5| 8|
| 4| 1| 3|
| 2| 4| null|
+---+---+------+
dfFinal = df.filter((df.A == 1)|(df.A == 2)).groupby(['A','B']).agg(collect_list(df['option']))
dfFinal.show()
>>>
+---+---+--------------------+
| A| B|collect_list(option)|
+---+---+--------------------+
| 1| 3| [1]|
| 1| 4| [5, 7]|
| 2| 3| [2]|
| 2| 4| []|
+---+---+--------------------+