Sum over another column returning 'col should be column error' - pyspark

I'm trying to add a new column where by it shows the sum of a double (things to sum column) based on the respective ID in the ID column. this however is currently throwing the 'col should be column error'
df = df.withColumn('sum_column', (df.groupBy('id').agg({'thing_to_sum': 'sum'})))
Example Data Set:
| id | thing_to_sum | sum_column |
|----|--------------|------------
| 1 | 5 | 7 |
| 1 | 2 | 7 |
| 2 | 4 | 4 |
Any help on this would be greatly appreciated.
Also any reference on the most efficient way to do this would also be appreciated.

You can register any DataFrame as a temporary table to query it via SQLContext.sql.
myValues = [(1,5),(1,2),(2,4),(2,3),(2,1)]
df = sqlContext.createDataFrame(myValues,['id','thing_to_sum'])
df.show()
+---+------------+
| id|thing_to_sum|
+---+------------+
| 1| 5|
| 1| 2|
| 2| 4|
| 2| 3|
| 2| 1|
+---+------------+
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, thing_to_sum, sum(thing_to_sum) over (partition by id) as sum_column from table_view'
)
df1.show()
+---+------------+----------+
| id|thing_to_sum|sum_column|
+---+------------+----------+
| 1| 5| 7|
| 1| 2| 7|
| 2| 4| 8|
| 2| 3| 8|
| 2| 1| 8|
+---+------------+----------+

Think i found the solution to my own question but advice would still be appreciated:
sum_calc = F.sum(df.thing_to_sum).over(Window.partitionBy("id"))
df = df.withColumn("sum_column", sum_calc)

Related

Unable to merge the data from two tables

I have two tables, the first one contain all the records and second table contains few records, i am trying to put left join to get all the records of first table and data columns of the records that matches in the second table.
Table A
id_step|id_workflow|id_action|
-------+-----------+---------+
1| 1| 11|
6| 1| 11|
7| 1| 11|
8| 1| 12|
9| 1| 12|
10| 1| 12|
Table B
id_step|id_client|id_process|id_workflow|is_approved|action_by|action_date|
-------+---------+----------+-----------+-----------+---------+-----------+
1| 10680| 10| 1|true | | |
I am looking to get the below result
Expected Output
id_step|id_client|id_workflow|id_action|is_approved|action_by|action_date|
-------+---------+---------- +---------+-----------+---------+-----------+
1| 10680| 1| 11|true | | |
6| 10680| 1| 11|pending | | |
7| 10680| 1| 11|pending | | |
8| 10680| 1| 11|pending | | |
9| 10680| 1| 11|pending | | |
10| 10680| 1| 11|pending | | |
I did try query
select
x.id_step,
y.id_client,
x.id_workflow,
x.id_action,
(case when y.is_approved is null then 'pending' else y.is_approved::text end ) as is_approved,
y.action_date,
y.action_by
from nw_adsys_wfx_config as x
left join nw_adsys_cli_wfx_process as y
on x.id_step = y.id_step
where x.id_workflow = 1
and y.id_process = 10
but i am getting this output
id_step|id_client|id_workflow|id_action|is_approved|action_date|action_by|
-------+---------+-----------+---------+-----------+-----------+---------+
1| 10680| 1| 11|true | | |
The query contains a left join but also a where condition on both tables which breaks the intend of the left join.
Instead, you can move the condition to the join
...
from nw_adsys_wfx_config as x
left join nw_adsys_cli_wfx_process as y
on x.id_step = y.id_step
AND y.id_process = 10
where x.id_workflow = 1

how to solve following issue with apache spark with optimal solution

i need to solve the following problem without graphframe please help.
Input Dataframe
|-----------+-----------+--------------|
| ID | prev | next |
|-----------+-----------+--------------|
| 1 | 1 | 2 |
| 2 | 1 | 3 |
| 3 | 2 | null |
| 9 | 9 | null |
|-----------+-----------+--------------|
output dataframe
|-----------+------------|
| bill_id | item_id |
|-----------+------------|
| 1 | [1, 2, 3] |
| 9 | [9] |
|-----------+------------|
This is probably quite inefficient, but it works. It is inspired by how graphframes does connected components. Basically join with itself on the prev column until it doesn't get any lower, then group.
df = sc.parallelize([(1, 1, 2), (2, 1, 3), (3, 2, None), (9, 9, None)]).toDF(['ID', 'prev', 'next'])
df.show()
+---+----+----+
| ID|prev|next|
+---+----+----+
| 1| 1| 2|
| 2| 1| 3|
| 3| 2|null|
| 9| 9|null|
+---+----+----+
converged = False
count = 0
while not converged:
step = df.join(df.selectExpr('ID as prev', 'prev as lower_prev'), 'prev', 'left').cache()
print('step', count)
step.show()
converged = step.where('prev != lower_prev').count() == 0
df = step.selectExpr('ID', 'lower_prev as prev')
print('df', count)
df.show()
count += 1
step 0
+----+---+----+----------+
|prev| ID|next|lower_prev|
+----+---+----+----------+
| 2| 3|null| 1|
| 1| 2| 3| 1|
| 1| 1| 2| 1|
| 9| 9|null| 9|
+----+---+----+----------+
df 0
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
step 1
+----+---+----------+
|prev| ID|lower_prev|
+----+---+----------+
| 1| 3| 1|
| 1| 1| 1|
| 1| 2| 1|
| 9| 9| 9|
+----+---+----------+
df 1
+---+----+
| ID|prev|
+---+----+
| 3| 1|
| 1| 1|
| 2| 1|
| 9| 9|
+---+----+
df.groupBy('prev').agg(F.collect_set('ID').alias('item_id')).withColumnRenamed('prev', 'bill_id').show()
+-------+---------+
|bill_id| item_id|
+-------+---------+
| 1|[1, 2, 3]|
| 9| [9]|
+-------+---------+

Pyspark get count in aggregate table

I have a table that looks like this:
+-------------+-----+
| PULocationID| fare|
+-------------+-----+
| 1| 5|
| 1| 15|
| 2| 2|
+-------------+-----+
I want to get a table that looks like this:
+-------------+----------+------+
| PULocationID| avg_fare | count|
+-------------+----------+------+
| 1| 10| 2|
| 2| 2| 1|
+-------------+----------+------+
Here is what I'm trying:
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg"},
{"PULocationID": "count"}
)
If I take out the count line, it works fine getting the avg column. But I need to get the count also of how many rows had that particular PULocationID
NOTE: I can't add any other imports other than pyspark.sql.functions import col
Thanks for the help!
I was so close, I was just formatting it as two dictionaries instead of one.
result_table = trips.groupBy("PULocationID") \
.agg(
{"total_amount": "avg","PULocationID":"count"}
)
This should be the working solution for you - use avg() and count()
df = spark.createDataFrame([(1,5),(1,15),(2,2)],[ "PULocationID","fare"])
df.show()
df_group = df.groupBy("PULocationID").agg(F.avg("fare").alias("avg_fare"), F.count("PULocationID").alias("count"))
df_group.show()
**Input**
+------------+----+
|PULocationID|fare|
+------------+----+
| 1| 5|
| 1| 15|
| 2| 2|
+------------+----+
Output
+------------+--------+-----+
|PULocationID|avg_fare|count|
+------------+--------+-----+
| 1| 10.0| 2|
| 2| 2.0| 1|
+------------+--------+-----+

Scala Spark Incrementing a column based on another column in dataframe without for loops

I have a dataframe like the one below. I want a new column called cutofftype - which instead of the current monotonically increasing number should reset to 1 every time the ID column changes .
df = df.orderBy("ID","date").withColumn("cutofftype",monotonically_increasing_id()+1)
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 5|
| 54500| 2016-05-09| 6|
| 54500| 2016-05-16| 7|
| 54500| 2016-05-23| 8|
| 54500| 2016-06-06| 9|
| 54500| 2016-06-13| 10|
+------+---------------+----------+
Target is this as below :
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 1|
| 54500| 2016-05-09| 2|
| 54500| 2016-05-16| 3|
| 54500| 2016-05-23| 4|
| 54500| 2016-06-06| 5|
| 54500| 2016-06-13| 6|
+------+---------------+----------+
I know this can be done with for loops - i want to do it without for loops >> Is there a way out ?
Simple partition by problem. You should use the window.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("ID").orderBy("date")
df.withColumn("cutofftype", row_number().over(w)).show()
+-----+----------+----------+
| ID| date|cutofftype|
+-----+----------+----------+
|54500|2016-05-02| 1|
|54500|2016-05-09| 2|
|54500|2016-05-16| 3|
|54500|2016-05-23| 4|
|54500|2016-06-06| 5|
|54500|2016-06-13| 6|
|54441|2016-06-20| 1|
|54441|2016-06-27| 2|
|54441|2016-07-04| 3|
|54441|2016-07-11| 4|
+-----+----------+----------+

DataFrame reduce by

Need help on ... converting multiple rows into single row by keys. group by advise appreciated. Using pyspark Version:2
l = (1,1,'', 'add1' ),
(1,1,'name1', ''),
(1,2,'', 'add2'),
(1,2,'name2', ''),
(2,1,'', 'add21'),
(2,1,'name21', ''),
(2,2,'', 'add22'),
(2,2,'name22', '')
df = sqlContext.createDataFrame(l, ['Key1', 'Key2','Name', 'Address'])
df.show()
+----+----+------+-------+
|Key1|Key2| Name|Address|
+----+----+------+-------+
| 1| 1| | add1|
| 1| 1| name1| |
| 1| 2| | add2|
| 1| 2| name2| |
| 2| 1| | add21|
| 2| 1|name21| |
| 2| 2| | add22|
| 2| 2|name22| |
+----+----+------+-------+
I am stuck looking for output like
+----+----+------+-------+
|Key1|Key2| Name|Address|
+----+----+------+-------+
| 1| 1| name1 | add1|
| 1| 2| name2 | add2|
| 2| 1| name21| add21|
| 2| 2| name22| add22|
+----+----+------+-------+
Group by Key1 and Key2, and take the maximum value from Name and Address:
import pyspark.sql.functions as F
df.groupBy(['Key1', 'Key2']).agg(
F.max(df.Name).alias('Name'),
F.max(df.Address).alias('Address')
).show()
+----+----+------+-------+
|Key1|Key2| Name|Address|
+----+----+------+-------+
| 1| 1| name1| add1|
| 2| 2|name22| add22|
| 1| 2| name2| add2|
| 2| 1|name21| add21|
+----+----+------+-------+