Pyspark Autonumber over a partitioning column - pyspark

I have a column in my data frame that is sensitive. I need to replace the sensitive value with a number, but have to do it so that the distinct counts of the column in question stays accurate. I was thinking of a sql function over a window partition. But couldn't find a way.
A sample dataframe is below.
df = (sc.parallelize([
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"2345"},
{"sensitive_id":"2345"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"}
]).toDF()
.cache()
)
I would like to create a dataframe like below.
What is a way to get this done.

You are looking for dense_rank function :
df.withColumn(
"non_sensitive_id",
F.dense_rank().over(Window.partitionBy().orderBy("sensitive_id"))
).show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+

This is another way of doing this, may not be very efficient because join() will involve a shuffle -
Creating the DataFrame -
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
df = sqlContext.createDataFrame([(1234,),(1234,),(1234,),(2345,),(2345,),(6789,),(6789,),(6789,),(6789,)],['sensitive_id'])
Creating a DataFrame of distinct elements and labeling them 1,2,3... and finally joining the two dataframes.
df_distinct = df.select('sensitive_id').distinct().withColumn('non_sensitive_id', row_number().over(Window.orderBy('sensitive_id')))
df = df.join(df_distinct, ['sensitive_id'],how='left').orderBy('sensitive_id')
df.show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+

Related

after joining two dataframes pick all columns from one dataframe on basis of primary key

I've two dataframes, I need to update records in df1 based on new updates available in df2 in pyspark.
DF1:
df1=spark.createDataFrame([(1,2),(2,3),(3,4)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 3| 4|
+---+----+
DF2:
df2=spark.createDataFrame([(1,4),(2,5)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
+---+----+
then, I'm trying to join the two dataframes.
join_con=(df1["id"] == df2["id"])
jdf=df1.join(df2,join_con,"left")
+---+----+----+----+
| id|val1| id|val1|
+---+----+----+----+
| 1| 2| 1| 4|
| 3| 4|null|null|
| 2| 3| 2| 5|
+---+----+----+----+
Now, I want to pick all columns from df2 if df2["id"] is not null, otherwise pick all columns of df1.
something like:
jdf.filter(df2.id is null).select(df1["*"])
union
jdf.filter(df2.id is not null).select(df2["*"])
so resultant DF can be:
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
+---+----+
Can someone please help with this?
Your selection expression can be a coalesce between the column in df2 followed by df1.
from pyspark.sql import functions as F
df1=spark.createDataFrame([(1,2),(2,3),(3,4), (4, 1),],["id","val1"])
df2=spark.createDataFrame([(1,4),(2,5), (4, None),],["id","val1"])
selection_expr = [F.when(df2["id"].isNotNull(), df2[c]).otherwise(df1[c]).alias(c) for c in df2.columns]
jdf.select(selection_expr).show()
"""
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
| 4|null|
+---+----+
"""
Try with coalesce function as this function gets first non null values.
expr=zip(df2.columns,df1.columns)
e1=[coalesce(df2[f[0]],df1[f[1]]).alias(f[0]) for f in expr]
jdf.select(*e1).show()
#+---+----+
#| id|val1|
#+---+----+
#| 1| 4|
#| 2| 5|
#| 3| 4|
#+---+----+

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Pyspark get count in aggregate table

I have a table that looks like this:
+-------------+-----+
| PULocationID| fare|
+-------------+-----+
| 1| 5|
| 1| 15|
| 2| 2|
+-------------+-----+
I want to get a table that looks like this:
+-------------+----------+------+
| PULocationID| avg_fare | count|
+-------------+----------+------+
| 1| 10| 2|
| 2| 2| 1|
+-------------+----------+------+
Here is what I'm trying:
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg"},
{"PULocationID": "count"}
)
If I take out the count line, it works fine getting the avg column. But I need to get the count also of how many rows had that particular PULocationID
NOTE: I can't add any other imports other than pyspark.sql.functions import col
Thanks for the help!
I was so close, I was just formatting it as two dictionaries instead of one.
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg","PULocationID":"count"}
)
This should be the working solution for you - use avg() and count()
df = spark.createDataFrame([(1,5),(1,15),(2,2)],[ "PULocationID","fare"])
df.show()
df_group = df.groupBy("PULocationID").agg(F.avg("fare").alias("avg_fare"), F.count("PULocationID").alias("count"))
df_group.show()
**Input**
+------------+----+
|PULocationID|fare|
+------------+----+
| 1| 5|
| 1| 15|
| 2| 2|
+------------+----+
Output
+------------+--------+-----+
|PULocationID|avg_fare|count|
+------------+--------+-----+
| 1| 10.0| 2|
| 2| 2.0| 1|
+------------+--------+-----+

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

Compare two dataframes and update the values

I have two dataframes like following.
val file1 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file1.csv")
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
+---+-------+-----+-----+-------+
val file2 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file2.csv")
file2.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 70| 5| 0|
+---+-------+-----+-----+-------+
Now I am comparing two dataframes and filtering out the mismatch values like this.
val columns = file1.schema.fields.map(_.name)
val selectiveDifferences = columns.map(col => file1.select(col).except(file2.select(col)))
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
+-----+
|mark1|
+-----+
| 10|
+-----+
I need to add the extra row into the dataframe, 1 for the mismatch value from the dataframe 2 and update the version number like this.
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
| 3| Teju | 70| 5| 1|
+---+-------+-----+-----+-------+
I am struggling to achieve the above step and it is my expected output. Any help would be appreciated.
You can get your final dataframe by using except and union as following
val count = file1.count()
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
file1.union(file2.except(file1)
.withColumn("version", lit(1)) //changing the version
.withColumn("id", (row_number.over(Window.orderBy("id")))+lit(count)) //changing the id number
)
lit, row_number and window functions are used to generate the id and versions
Note : use of window function to generate the new id makes the process inefficient as all the data would be collected in one executor for generating new id