get the number of unique values in pyspark column - pyspark

I have a PySpark dataframe with a column URL in it. All I want to know is how many distinct values are there. I just need the number of total distinct values.
I have tried the following
df.select("URL").distinct().show()
This gives me the list and count of all unique values, and I only want to know how many are there overall. I want something like this - col(URL) has x distinct values.

Use distinct().count() to get count of distinct values.
df.select("URL").distinct().count()
Example:
#sample data
df=spark.createDataFrame([(1,),(2,),(1,)],['id'])
df.show()
#+---+
#| id|
#+---+
#| 1|
#| 2|
#| 1|
#+---+
#to list out 20 distinct values
df.select('id').distinct().show()
#+---+
#| id|
#+---+
#| 1|
#| 2|
#+---+
#to get count of distinct values
df.select('id').distinct().count()
#2

Related

Index with groupby PySpark

I'm trying to translate the below pandas code to PySpark. But I'm having trouble with these two points:
But there is index in Spark DataFrame?
How can I group in level=0 like that?
I didn't find anything good in the documentation. If you have a hint, I'll be really grateful!
df.set_index('var1', inplace=True)
df['varGrouped'] = df.groupby(level=0)['var2'].min()
df.reset_index(inplace=True)
pandas_df.groupby(level=0) would group the pandas_df by the first index field (in case of multiindex data). Since there is only 1 index field based on the provided code, your code is a simple group by the var1 field. The same can be replicated in pyspark with a groupBy() and taking the min of var2.
However, the aggregation result is stored in a new column within the same dataframe. So, the number of rows don't depreciate. This can be replicated by using min window function.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
data_sdf. \
withColumn('grouped_var', func.min('var2').over(wd.partitionBy('var1')))
withColumn helps you add/replace columns.
Here's an example using sample data.
data_sdf.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 1| 3|
# | 2| 5|
# | 2| 4|
# +---+---+
data_sdf. \
withColumn('grouped_res', func.min('b').over(wd.partitionBy('a'))). \
show()
# +---+---+-----------+
# | a| b|grouped_res|
# +---+---+-----------+
# | 1| 2| 2|
# | 1| 3| 2|
# | 2| 5| 4|
# | 2| 4| 4|
# +---+---+-----------+
But there is index in Spark DataFrame?
i think the index in pandas doesn't exist in spark since spark is not designed to do row level manipulation.
How can I group in level=0 like that?
instead of group by level, you group directly by the columns which identifies the granularity level.

How do I coalesce rows in pyspark?

In PySpark, there's the concept of coalesce(colA, colB, ...) which will, per row, take the first non-null value it encounters from those columns. However, I want coalesce(rowA, rowB, ...) i.e. the ability to, per column, take the first non-null value it encounters from those rows. I want to coalesce all rows within a group or window of rows.
For example, given the following dataset, I want to coalesce rows per category and ordered ascending by date.
+---------+-----------+------+------+
| category| date| val1| val2|
+---------+-----------+------+------+
| A| 2020-05-01| null| 1|
| A| 2020-05-02| 2| null|
| A| 2020-05-03| 3| null|
| B| 2020-05-01| null| null|
| B| 2020-05-02| 4| null|
| C| 2020-05-01| 5| 2|
| C| 2020-05-02| null| 3|
| D| 2020-05-01| null| 4|
+---------+-----------+------+------+
What I should get as the output is...
+---------+-----------+------+------+
| category| date| val1| val2|
+---------+-----------+------+------+
| A| 2020-05-01| 2| 1|
| B| 2020-05-01| 4| null|
| C| 2020-05-01| 5| 2|
| D| 2020-05-01| null| 4|
+---------+-----------+------+------+
First, I'll give the answer. Then, I'll point out the important bits.
from pyspark.sql import Window
from pyspark.sql.functions import col, dense_rank, first
df = ... # dataframe from question description
window = (
Window
.partitionBy("category")
.orderBy(col("date").asc())
)
window_unbounded = (
window
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
cols_to_merge = [col for col in df.columns if col not in ["category", "date"]]
merged_cols = [first(col, True).over(window_unbounded).alias(col) for col in cols_to_merge]
df_merged = (
df
.select([col("category"), col("date")] + merged_cols)
.withColumn("rank_col", dense_rank().over(window))
.filter(col("rank_col") == 1)
.drop("rank_col")
)
The row-wise analogue to coalesce is the aggregation function first. Specifically, we use first with ignorenulls = True so that we find the first non-null value.
When we use first, we have to be careful about the ordering of the rows it's applied to. Because groupBy doesn't allow us to maintain order within the groups, we use a Window.
The window itself must be unbounded on both ends rather than the default unbounded preceding to current row, else we'll end up with the first aggregation potentially running on subsets of our groups.
After we aggregate over the window, we alias the column back to its original name to keep the column names consistent.
We use a single select statement of cols rather than a for loop with df.withColumn(col, ...) because the select statement greatly reduces the query plan depth. Should you use the looped withColumn, you might hit a stack overflow error if you have too many columns.
Finally, we run a dense_rank over our window --- this time using the window with the default range --- and filter to only the first ranked rows. We use dense rank here, but we could use any ranking function, whatever fits our needs.

How do i filter bad or corrupted rows from a spark data frame after casting

df1
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80|
| 03| spark| 1|
| 04| 300| 1|
+-------+-------+-----+
after casting Score to int and hits to float I get the below dataframe:
df2
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80.0|
| 03| Null| 1.0|
| 04| 300| 1.0|
+-------+-------+-----+
Now I want to extract only the bad records , bad records mean that null produced after casting.
I want to do the operations only on existing dataframe. Please help me out if there is any build-in way to get the bad records after casting.
Please also consider this is sample dataframe. The solution should solve for any number of columns and any scenario.
I tried by separating the null records from both dataframes and compare them. Also i have thought of adding another column with number of nulls and then compare the both dataframes if number of nulls is grater in df2 than in df1 then those are bad one. But i think these solutions are pretty old school.
I would like to know the better way to resolve it.
You can use a custom function/udf to convert string to integer and map non integer values to specific number eg. -999999999.
Later you can filter on -999999999 to identify originally non integer records.
def udfInt(value):
if value is None:
return None
elif value.isdigit():
return int(value)
else:
return -999999999
spark.udf.register('udfInt', udfInt)
df.selectExpr("*",
"udfInt(Score) AS new_Score").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 01| 100|null| 100|
#| 02| null| 80| null|
#| 03|spark| 1|-999999999|
#| 04| 300| 1| 300|
#+---+-----+----+----------+
Filter on -999999999 to identify non integer (bad records)
df.selectExpr("*","udfInt(Score) AS new_Score").filter("new_score == -999999999").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 03|spark| 1|-999999999|
#+---+-----+----+----------+
The same way you can have customized udf for float conversion.

pyspark: drop columns that have same values in all rows

Related question: How to drop columns which have same values in all rows via pandas or spark dataframe?
So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.
However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?
Thanks
You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.
# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()
# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]
# drop the selected column
df.drop(*cols_to_drop).show()
You can use approx_count_distinct function (link) to count the number of distinct elements in a column. In case there is just one distinct, the remove the corresponding column.
Creating the DataFrame
from pyspark.sql.functions import approx_count_distinct
myValues = [(1,2,2,0),(2,2,2,0),(3,2,2,0),(4,2,2,0),(3,1,2,0)]
df = sqlContext.createDataFrame(myValues,['value1','value2','value3','value4'])
df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 1| 2| 2| 0|
| 2| 2| 2| 0|
| 3| 2| 2| 0|
| 4| 2| 2| 0|
| 3| 1| 2| 0|
+------+------+------+------+
Couting number of distinct elements and converting it into dictionary.
count_distinct_df=df.select([approx_count_distinct(x).alias("{0}".format(x)) for x in df.columns])
count_distinct_df.show()
+------+------+------+------+
|value1|value2|value3|value4|
+------+------+------+------+
| 4| 2| 1| 1|
+------+------+------+------+
dict_of_columns = count_distinct_df.toPandas().to_dict(orient='list')
dict_of_columns
{'value1': [4], 'value2': [2], 'value3': [1], 'value4': [1]}
#Storing those keys in the list which have just 1 distinct key.
distinct_columns=[k for k,v in dict_of_columns.items() if v == [1]]
distinct_columns
['value3', 'value4']
Drop the columns having distinct values
df=df.drop(*distinct_columns)
df.show()
+------+------+
|value1|value2|
+------+------+
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 3| 1|
+------+------+

Spark SQL DataFrame - distinct() vs dropDuplicates()

I was looking at the DataFrame API, i can see two different methods doing the same functionality for removing duplicates from a data set.
I can understand dropDuplicates(colNames) will remove duplicates considering only the subset of columns.
Is there any other differences between these two methods?
The main difference is the consideration of the subset of columns which is great!
When using distinct you need a prior .select to select the columns on which you want to apply the duplication and the returned Dataframe contains only these selected columns while dropDuplicates(colNames) will return all the columns of the initial dataframe after removing duplicated rows as per the columns.
Let's assume we have the following spark dataframe
+---+------+---+
| id| name|age|
+---+------+---+
| 1|Andrew| 25|
| 1|Andrew| 25|
| 1|Andrew| 26|
| 2| Maria| 30|
+---+------+---+
distinct() does not accept any arguments which means that you cannot select which columns need to be taken into account when dropping the duplicates. This means that the following command will drop the duplicate records taking into account all the columns of the dataframe:
df.distinct().show()
+---+------+---+
| id| name|age|
+---+------+---+
| 1|Andrew| 26|
| 2| Maria| 30|
| 1|Andrew| 25|
+---+------+---+
Now in case you want to drop the duplicates considering ONLY id and name you'd have to run a select() prior to distinct(). For example,
>>> df.select(['id', 'name']).distinct().show()
+---+------+
| id| name|
+---+------+
| 2| Maria|
| 1|Andrew|
+---+------+
But in case you wanted to drop the duplicates only over a subset of columns like above but keep ALL the columns, then distinct() is not your friend.
dropDuplicates() will drop the duplicates detected over the provided set of columns, but it will also return all the columns appearing in the original dataframe.
df.dropDuplicates().show()
+---+------+---+
| id| name|age|
+---+------+---+
| 1|Andrew| 26|
| 2| Maria| 30|
| 1|Andrew| 25|
+---+------+---+
dropDuplicates() is thus more suitable when you want to drop duplicates over a selected subset of columns, but also want to keep all the columns:
df.dropDuplicates(['id', 'name']).show()
+---+------+---+
| id| name|age|
+---+------+---+
| 2| Maria| 30|
| 1|Andrew| 25|
+---+------+---+
For more details refer to the article distinct() vs dropDuplicates() in Python
From javadoc, there is no difference between distinc() and dropDuplicates().
dropDuplicates
public DataFrame dropDuplicates()
Returns a new DataFrame that contains only the unique rows from this
DataFrame. This is an alias for distinct.
dropDuplicates() was introduced in 1.4 as a replacement for distinct(), as you can use it's overloaded methods to get unique rows based on subset of columns.