Unable to merge the data from two tables - postgresql

I have two tables, the first one contain all the records and second table contains few records, i am trying to put left join to get all the records of first table and data columns of the records that matches in the second table.
Table A
id_step|id_workflow|id_action|
-------+-----------+---------+
1| 1| 11|
6| 1| 11|
7| 1| 11|
8| 1| 12|
9| 1| 12|
10| 1| 12|
Table B
id_step|id_client|id_process|id_workflow|is_approved|action_by|action_date|
-------+---------+----------+-----------+-----------+---------+-----------+
1| 10680| 10| 1|true | | |
I am looking to get the below result
Expected Output
id_step|id_client|id_workflow|id_action|is_approved|action_by|action_date|
-------+---------+---------- +---------+-----------+---------+-----------+
1| 10680| 1| 11|true | | |
6| 10680| 1| 11|pending | | |
7| 10680| 1| 11|pending | | |
8| 10680| 1| 11|pending | | |
9| 10680| 1| 11|pending | | |
10| 10680| 1| 11|pending | | |
I did try query
select
x.id_step,
y.id_client,
x.id_workflow,
x.id_action,
(case when y.is_approved is null then 'pending' else y.is_approved::text end ) as is_approved,
y.action_date,
y.action_by
from nw_adsys_wfx_config as x
left join nw_adsys_cli_wfx_process as y
on x.id_step = y.id_step
where x.id_workflow = 1
and y.id_process = 10
but i am getting this output
id_step|id_client|id_workflow|id_action|is_approved|action_date|action_by|
-------+---------+-----------+---------+-----------+-----------+---------+
1| 10680| 1| 11|true | | |

The query contains a left join but also a where condition on both tables which breaks the intend of the left join.
Instead, you can move the condition to the join
...
from nw_adsys_wfx_config as x
left join nw_adsys_cli_wfx_process as y
on x.id_step = y.id_step
AND y.id_process = 10
where x.id_workflow = 1

Related

Create a new column that marks customers

My goal is to aggregate over the customerID (count), create a new Column and mark the customer which return often an article. How can I do that? (using Databricks, pyspark)
train.select("itemID","customerID","returnShipment").show(10)
+------+----------+--------------+
|itemID|customerID|returnShipment|
+------+----------+--------------+
| 186| 794| 0|
| 71| 794| 1|
| 71| 794| 1|
| 32| 850| 1|
| 32| 850| 1|
| 57| 850| 1|
| 2| 850| 1|
| 259| 850| 1|
| 603| 850| 1|
| 259| 850| 1|
+------+----------+--------------+
You can define a threshold value and then compare this threshold value to the sum of returnShipments for each customerID:
from pyspark.sql import functions as F
threshold=5
df.groupBy("customerID")\
.sum("returnShipment") \
.withColumn("mark", F.col("sum(returnShipment)") > threshold) \
.show()

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

Scala Spark Incrementing a column based on another column in dataframe without for loops

I have a dataframe like the one below. I want a new column called cutofftype - which instead of the current monotonically increasing number should reset to 1 every time the ID column changes .
df = df.orderBy("ID","date").withColumn("cutofftype",monotonically_increasing_id()+1)
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 5|
| 54500| 2016-05-09| 6|
| 54500| 2016-05-16| 7|
| 54500| 2016-05-23| 8|
| 54500| 2016-06-06| 9|
| 54500| 2016-06-13| 10|
+------+---------------+----------+
Target is this as below :
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 1|
| 54500| 2016-05-09| 2|
| 54500| 2016-05-16| 3|
| 54500| 2016-05-23| 4|
| 54500| 2016-06-06| 5|
| 54500| 2016-06-13| 6|
+------+---------------+----------+
I know this can be done with for loops - i want to do it without for loops >> Is there a way out ?
Simple partition by problem. You should use the window.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("ID").orderBy("date")
df.withColumn("cutofftype", row_number().over(w)).show()
+-----+----------+----------+
| ID| date|cutofftype|
+-----+----------+----------+
|54500|2016-05-02| 1|
|54500|2016-05-09| 2|
|54500|2016-05-16| 3|
|54500|2016-05-23| 4|
|54500|2016-06-06| 5|
|54500|2016-06-13| 6|
|54441|2016-06-20| 1|
|54441|2016-06-27| 2|
|54441|2016-07-04| 3|
|54441|2016-07-11| 4|
+-----+----------+----------+

Sum over another column returning 'col should be column error'

I'm trying to add a new column where by it shows the sum of a double (things to sum column) based on the respective ID in the ID column. this however is currently throwing the 'col should be column error'
df = df.withColumn('sum_column', (df.groupBy('id').agg({'thing_to_sum': 'sum'})))
Example Data Set:
| id | thing_to_sum | sum_column |
|----|--------------|------------
| 1 | 5 | 7 |
| 1 | 2 | 7 |
| 2 | 4 | 4 |
Any help on this would be greatly appreciated.
Also any reference on the most efficient way to do this would also be appreciated.
You can register any DataFrame as a temporary table to query it via SQLContext.sql.
myValues = [(1,5),(1,2),(2,4),(2,3),(2,1)]
df = sqlContext.createDataFrame(myValues,['id','thing_to_sum'])
df.show()
+---+------------+
| id|thing_to_sum|
+---+------------+
| 1| 5|
| 1| 2|
| 2| 4|
| 2| 3|
| 2| 1|
+---+------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, thing_to_sum, sum(thing_to_sum) over (partition by id) as sum_column from table_view'
)
df1.show()
+---+------------+----------+
| id|thing_to_sum|sum_column|
+---+------------+----------+
| 1| 5| 7|
| 1| 2| 7|
| 2| 4| 8|
| 2| 3| 8|
| 2| 1| 8|
+---+------------+----------+
Think i found the solution to my own question but advice would still be appreciated:
sum_calc = F.sum(df.thing_to_sum).over(Window.partitionBy("id"))
df = df.withColumn("sum_column", sum_calc)

PySpark: counting rows based on current row value

I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')