Spark adding indexes to dataframe and append other dataset that doesn't have index - scala

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.

If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

Related

Unable to merge the data from two tables

I have two tables, the first one contain all the records and second table contains few records, i am trying to put left join to get all the records of first table and data columns of the records that matches in the second table.
Table A
id_step|id_workflow|id_action|
-------+-----------+---------+
1| 1| 11|
6| 1| 11|
7| 1| 11|
8| 1| 12|
9| 1| 12|
10| 1| 12|
Table B
id_step|id_client|id_process|id_workflow|is_approved|action_by|action_date|
-------+---------+----------+-----------+-----------+---------+-----------+
1| 10680| 10| 1|true | | |
I am looking to get the below result
Expected Output
id_step|id_client|id_workflow|id_action|is_approved|action_by|action_date|
-------+---------+---------- +---------+-----------+---------+-----------+
1| 10680| 1| 11|true | | |
6| 10680| 1| 11|pending | | |
7| 10680| 1| 11|pending | | |
8| 10680| 1| 11|pending | | |
9| 10680| 1| 11|pending | | |
10| 10680| 1| 11|pending | | |
I did try query
select
x.id_step,
y.id_client,
x.id_workflow,
x.id_action,
(case when y.is_approved is null then 'pending' else y.is_approved::text end ) as is_approved,
y.action_date,
y.action_by
from nw_adsys_wfx_config as x
left join nw_adsys_cli_wfx_process as y
on x.id_step = y.id_step
where x.id_workflow = 1
and y.id_process = 10
but i am getting this output
id_step|id_client|id_workflow|id_action|is_approved|action_date|action_by|
-------+---------+-----------+---------+-----------+-----------+---------+
1| 10680| 1| 11|true | | |
The query contains a left join but also a where condition on both tables which breaks the intend of the left join.
Instead, you can move the condition to the join
...
from nw_adsys_wfx_config as x
left join nw_adsys_cli_wfx_process as y
on x.id_step = y.id_step
AND y.id_process = 10
where x.id_workflow = 1

Create a new column that marks customers

My goal is to aggregate over the customerID (count), create a new Column and mark the customer which return often an article. How can I do that? (using Databricks, pyspark)
train.select("itemID","customerID","returnShipment").show(10)
+------+----------+--------------+
|itemID|customerID|returnShipment|
+------+----------+--------------+
| 186| 794| 0|
| 71| 794| 1|
| 71| 794| 1|
| 32| 850| 1|
| 32| 850| 1|
| 57| 850| 1|
| 2| 850| 1|
| 259| 850| 1|
| 603| 850| 1|
| 259| 850| 1|
+------+----------+--------------+
You can define a threshold value and then compare this threshold value to the sum of returnShipments for each customerID:
from pyspark.sql import functions as F
threshold=5
df.groupBy("customerID")\
.sum("returnShipment") \
.withColumn("mark", F.col("sum(returnShipment)") > threshold) \
.show()

Pyspark Autonumber over a partitioning column

I have a column in my data frame that is sensitive. I need to replace the sensitive value with a number, but have to do it so that the distinct counts of the column in question stays accurate. I was thinking of a sql function over a window partition. But couldn't find a way.
A sample dataframe is below.
df = (sc.parallelize([
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"1234"},
{"sensitive_id":"2345"},
{"sensitive_id":"2345"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"},
{"sensitive_id":"6789"}
]).toDF()
.cache()
)
I would like to create a dataframe like below.
What is a way to get this done.
You are looking for dense_rank function :
df.withColumn(
"non_sensitive_id",
F.dense_rank().over(Window.partitionBy().orderBy("sensitive_id"))
).show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+
This is another way of doing this, may not be very efficient because join() will involve a shuffle -
Creating the DataFrame -
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
df = sqlContext.createDataFrame([(1234,),(1234,),(1234,),(2345,),(2345,),(6789,),(6789,),(6789,),(6789,)],['sensitive_id'])
Creating a DataFrame of distinct elements and labeling them 1,2,3... and finally joining the two dataframes.
df_distinct = df.select('sensitive_id').distinct().withColumn('non_sensitive_id', row_number().over(Window.orderBy('sensitive_id')))
df = df.join(df_distinct, ['sensitive_id'],how='left').orderBy('sensitive_id')
df.show()
+------------+----------------+
|sensitive_id|non_sensitive_id|
+------------+----------------+
| 1234| 1|
| 1234| 1|
| 1234| 1|
| 2345| 2|
| 2345| 2|
| 6789| 3|
| 6789| 3|
| 6789| 3|
| 6789| 3|
+------------+----------------+

Sum over another column returning 'col should be column error'

I'm trying to add a new column where by it shows the sum of a double (things to sum column) based on the respective ID in the ID column. this however is currently throwing the 'col should be column error'
df = df.withColumn('sum_column', (df.groupBy('id').agg({'thing_to_sum': 'sum'})))
Example Data Set:
| id | thing_to_sum | sum_column |
|----|--------------|------------
| 1 | 5 | 7 |
| 1 | 2 | 7 |
| 2 | 4 | 4 |
Any help on this would be greatly appreciated.
Also any reference on the most efficient way to do this would also be appreciated.
You can register any DataFrame as a temporary table to query it via SQLContext.sql.
myValues = [(1,5),(1,2),(2,4),(2,3),(2,1)]
df = sqlContext.createDataFrame(myValues,['id','thing_to_sum'])
df.show()
+---+------------+
| id|thing_to_sum|
+---+------------+
| 1| 5|
| 1| 2|
| 2| 4|
| 2| 3|
| 2| 1|
+---+------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, thing_to_sum, sum(thing_to_sum) over (partition by id) as sum_column from table_view'
)
df1.show()
+---+------------+----------+
| id|thing_to_sum|sum_column|
+---+------------+----------+
| 1| 5| 7|
| 1| 2| 7|
| 2| 4| 8|
| 2| 3| 8|
| 2| 1| 8|
+---+------------+----------+
Think i found the solution to my own question but advice would still be appreciated:
sum_calc = F.sum(df.thing_to_sum).over(Window.partitionBy("id"))
df = df.withColumn("sum_column", sum_calc)

Compare two dataframes and update the values

I have two dataframes like following.
val file1 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file1.csv")
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
+---+-------+-----+-----+-------+
val file2 = spark.read.format("csv").option("sep", ",").option("inferSchema", "true").option("header", "true").load("file2.csv")
file2.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 70| 5| 0|
+---+-------+-----+-----+-------+
Now I am comparing two dataframes and filtering out the mismatch values like this.
val columns = file1.schema.fields.map(_.name)
val selectiveDifferences = columns.map(col => file1.select(col).except(file2.select(col)))
selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})
+-----+
|mark1|
+-----+
| 10|
+-----+
I need to add the extra row into the dataframe, 1 for the mismatch value from the dataframe 2 and update the version number like this.
file1.show()
+---+-------+-----+-----+-------+
| id| name|mark1|mark2|version|
+---+-------+-----+-----+-------+
| 1| Priya | 80| 99| 0|
| 2| Teju | 10| 5| 0|
| 3| Teju | 70| 5| 1|
+---+-------+-----+-----+-------+
I am struggling to achieve the above step and it is my expected output. Any help would be appreciated.
You can get your final dataframe by using except and union as following
val count = file1.count()
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
file1.union(file2.except(file1)
.withColumn("version", lit(1)) //changing the version
.withColumn("id", (row_number.over(Window.orderBy("id")))+lit(count)) //changing the id number
)
lit, row_number and window functions are used to generate the id and versions
Note : use of window function to generate the new id makes the process inefficient as all the data would be collected in one executor for generating new id