show all the matched string in pyspark dataframe - pyspark

I wanted to show all the filtered results of similar matched string.
# Since most of the stackoverflow questionaire-s and also answerer-s are all super SMART and leave out all the necessary import libraries and required codes before using pyspark so that the readers can crack their minds in researching more instead of getting direct answer, I share the codes from beginning as below in order for future reviewers.
# Import libraries
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
import numpy as np
# Initiate the session
spark = SparkSession\
# sc = SparkContext()
sc =SparkContext.getOrCreate()
# Create dataframe 1
sdataframe_temp = spark.createDataFrame([
['a', 'b', 'c']
# Create Dataframe 2
sdataframe_temp2 = spark.createDataFrame([
['a', 'b', 'c']
# Combine dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
# Filter out the columns based on respective rules
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
| a| b|
| 2| 2|
Expected output:
| a| b|
| 2| 2|
| 4| 6|
| 5| 7|
Can anyone please give some suggestions or improvement?

Here's a way using unionByName:
df = (sdataframe_temp1
.where("c == 'yes'")
| a| b|
| 2| 2|
| 4| 6|
| 5| 7|

you should change last line of code. for col function you should import from pyspark.sql.functions
from pyspark.sql.functions import *
.filter(col('c') == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
.filter(sdataframe_union_1_2['c'] == 'yes')\
.select(['a', 'b'])\ # I wish to stick with dataframe method if possible.
you have to select data from sdataframe_union_1_2 and you are selecting from sdataframe_temp that's why you are getting one record.


How to creat a pyspark DataFrame inside of a loop?

How to creat a pyspark DataFrame inside of a loop? In this loop in each iterate I am printing 2 values print(a1,a2). now I want to store all these value in a pyspark dataframe.
Initially, before the loop, you could create an empty dataframe with your preferred schema. Then, create a new df for each loop with the same schema and union it with your original dataframe. Refer the code below.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('a1', StringType(), True),
StructField('a2', StringType(), True)
df = spark.createDataFrame([],schema)
for i in range(1,5):
a1 = i
a2 = i+1
newRow = spark.createDataFrame([(a1,a2)], schema)
df = df.union(newRow)
This gives me the below result where the values are appended to the df in each loop.
| a1| a2|
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 5|

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
| id|num|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
| id|num|
| 1| 1|
| 2| 2|
| 3| 3|
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
count("*").over(Window.partitionBy(cols: _*))
.where(col("my_super_secret_count") === 1)
You might want to leverage the spark-daria library to keep this logic out of your codebase.

PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
The results are:
so what are the difference between the two and when should I use one and not the other?
The doc says:
Returns a Column based on the given column name.
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
| A|
| a|
| b|
| c|
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
| A|new|
| a| a|
| b| b|
| c| c|
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
| A|new|
| a| A|
| b| A|
| c| A|
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.
To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))

How to combine where and groupBy in Spark's DataFrame?

How can I use aggregate functions in a where clause in Apache Spark 1.6?
Consider the following DataFrame
| id|letter|
| 1| a|
| 2| b|
| 3| b|
How can I select all rows where letter occurs more than once, i.e. the expected output would be
| id|letter|
| 2| b|
| 3| b|
This does obviously not work:
My example its about count, but I'd like to be able to use other aggregate functions (the results thereof) as well.
Just for counting,I just came up with the following solution:
.where(size($"ids") > 1)
.withColumn("id", explode($"ids"))
You can use left semi join:
broadcast(df.groupBy($"letter").count.where($"count" > 1)),
or window functions:
import org.apache.spark.sql.expressions.Window
.withColumn("count", count($"*").over(Window.partitionBy("letter")))
.where($"count" > 1)
In Spark 2.0 or later you can Bloom filter but it is not available in 1.x

Pyspark Dataframe Apply function to two columns

Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
df2= 'b'
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?
Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
| a| b|distance|
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
| a| b|
| 5| 6|
| 1| 3|
| 2| 3|