How to find duplicated columns with all values in spark dataframe? - scala

I'm preprocessing my data(2000K+ rows), and want to count the duplicated columns in a spark dataframe, for example:
id | col1 | col2 | col3 | col4 |
----+--------+-------+-------+-------+
1 | 3 | 999 | 4 | 999 |
2 | 2 | 888 | 5 | 888 |
3 | 1 | 777 | 6 | 777 |
In this case, the col2 and col4's values are the same, which is my interest, so let the count +1.
I had tried toPandas(), transpose, and then duplicateDrop() in pyspark, but it's too slow.
Is there any function could solve this?
Any idea will be appreciate, thank you.

So you want to count the number of duplicate values based on the columns col2 and col4? This should do the trick below.
val dfWithDupCount = df.withColumn("isDup", when($"col2" === "col4", 1).otherwise(0))
This will create a new dataframe with a new boolean column saying that if col2 is equal to col4, then enter the value 1 otherwise 0.
To find the total number of rows, all you need to do is do a group by based on isDup and count.
import org.apache.spark.sql.functions._
val groupped = df.groupBy("isDup").agg(sum("isDup")).toDF()
display(groupped)
Apologies if I misunderstood you. You could probably use the same solution if you were trying to match any of the columns together, but that would require nested when statements.

Related

Filtering rows based on a complex condition in a dataframe with spark scala

I have to filter rows from a dataframe based on some conditions input by a user. I read a csv file to get each condition as a string. Then use the code below to do the filtering.
dataframe.filter(expr(condition))
This works for simple conditions that do not contain '&', '|'. Ex: "col1 != 0", "col2 > 2".
Now the requirement is to make it work for complex conditions as well.
Following is an example
dataframe
col1 | col2 | col3 | col4
-----+------+------+------
1.0 | 10 | 100 | x
0.2 | 20 | 200 | y
1.3 | 30 | 300 | x
complex condition: "(col1 < 1.0) & (col2 + col3) >= 200) & (col4 != 'x')"
Is there a way to do it?
You can transform & and | into and and or. If the complex condition is deeply nested, you can just do it recursively.
Here is the list of Spark SQL functions for your reference:
https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)

Calculate frequency of column in data frame using spark sql

I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.

How to remove records with their count per group below a threshold?

Here's the DataFrame:
id | sector | balance
---------------------------
1 | restaurant | 20000
2 | restaurant | 20000
3 | auto | 10000
4 | auto | 10000
5 | auto | 10000
How to find the count of each sector type and remove the records with sector type count below a specific LIMIT?
The following:
dataFrame.groupBy(columnName).count()
gives me the number of times a value appears in that column.
How to do it in Spark and Scala using DataFrame API?
You can use SQL Window to do so.
import org.apache.spark.sql.expressions.Window
yourDf.withColumn("count", count("*")
.over(Window.partitionBy($"colName")))
.where($"count">2)
// .drop($"count") // if you don't want to keep count column
.show()
For your given dataframe
import org.apache.spark.sql.expressions.Window
dataFrame.withColumn("count", count("*")
.over(Window.partitionBy($"sector")))
.where($"count">2)
.show()
You should see results like this:
id | sector | balance | count
------------------------------
3 | auto | 10000 | 3
4 | auto | 10000 | 3
5 | auto | 10000 | 3
Don't know if it is the best way. But this worked for me.
def getRecordsWithColumnFrequnecyLessThanLimit(dataFrame: DataFrame, columnName: String, limit: Integer): DataFrame = {
val g = dataFrame.groupBy(columnName)
.count()
.filter("count<" + limit)
.select(columnName)
.rdd
.map(r => r(0)).collect()
dataFrame.filter(dataFrame(columnName) isin (g:_*))
}
Since it's a dataframe you can use SQL query like
select sector, count(1)
from TABLE
group by sector
having count(1) >= LIMIT