Dataframe in pypark - How to apply aggregate functions to into two columns?

Dataframe in pypark - How to apply aggregate functions to into two columns? - pyspark

I'm using Dataframe in pyspark. I have one table like Table 1 bellow. I need to obtain Table 2. Where:
num_category - it is how many differents categories for each id
sum(count) - it is the sum of the third column in Table 1 for each id.
Example:
Table 1
id |category | count
1 | 4 | 1
1 | 3 | 2
1 | 1 | 2
2 | 2 | 1
2 | 1 | 1
Table 2
id |num_category| sum(count)
1 | 3 | 5
2 | 2 | 2
I try:
table1 = data.groupBy("id","category").agg(count("*"))
cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
Error:
1 table1 = data.groupBy("id","category").agg(count("*"))
---> 2 cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
TypeError: 'DataFrame' object is not callable

You can do multiple column aggregation on single grouped data,
data.groupby('id').agg({'category':'count','count':'sum'}).withColumnRenamed('count(category)',"num_category").show()
+---+-------+--------+
| id|num_cat|sum(cnt)|
+---+-------+--------+
| 1| 3| 5|
| 2| 2| 2|
+---+-------+--------+

Related

Ranking a pivoted column

I want to pivot a column and then rank the data from the pivoted column. Here is sample data:
| id | objective | metric | score |
|----|-----------|-------------|-------|
| 1 | Sales | Total Sales | 10 |
| 1 | Marketing | Total Reach | 4 |
| 2 | Sales | Total Sales | 2 |
| 2 | Marketing | Total Reach | 11 |
| 3 | Sales | Total Sales | 9 |
This would be my expected output after pivot + rank:
| id | Sales | Marketing |
|----|--------|-----------|
| 1 | 1 | 2 |
| 2 | 3 | 1 |
| 3 | 2 | 3 |
The ranking is based on sum(score) from each objective. An objective can have also have multiple metrics but that isn't included in the sample for simplicity.
I have been able to successfully pivot and count the scores like so:
pivot = (
spark.table('scoring_table')
.select('id', 'objective', 'metric', 'score')
.groupBy('id')
.pivot('objective')
.agg(
sf.sum('score').alias('score')
)
This then lets me see the total score per objective, but I'm unsure how to rank these. I have tried the following after aggregation:
.withColumn('rank', rank().over(Window.partitionBy('id', 'objective').orderBy(sf.col('score').desc())))
However objective is no longer callable from this point as it has been pivoted. I then tried this instead:
.withColumn('rank', rank().over(Window.partitionBy('id', 'Sales', 'Marketing').orderBy(sf.col('score').desc())))
But also the score column is no longer available. How can I rank these scores after pivoting the data?

You just need to order by the score after pivot:
from pyspark.sql import functions as F, Window
df2 = df.groupBy('id').pivot('objective').agg(F.sum('score')).fillna(0)
df3 = df2.select(
'id',
*[F.rank().over(Window.orderBy(F.desc(c))).alias(c) for c in df2.columns[1:]]
)
df3.show()
+---+---------+-----+
| id|Marketing|Sales|
+---+---------+-----+
| 2| 1| 3|
| 1| 2| 1|
| 3| 3| 2|
+---+---------+-----+

How to divide dataframe row's each value by row's total sum (data normalization) in pyspark?

I have a dataframe of user's preferences:
+-------+-----+-----+-----+
|user_id|Movie|Music|Books|
+-------+-----+-----+-----+
| 100 | 0 | 1 | 2 |
| 101 | 3 | 1 | 4 |
+-------+---------+-------+
How to 1) compute total sum of each row(user); 2) divide each value by that sum? so I get normalized preference values:
+-------+---- -+-------+-------+
|user_id| Movie| Music | Books |
+-------+----- +-------+-------+
| 100 | 0 | 0.33..| 0.66..|
| 101 |0.42..| 0.15..| 0.57..|
+-------+------+-------+-------+

# get column names that need to be normalized
cols = [col for col in df.columns if col != 'user_id']
# sum the columns by row
rowsum = sum([df[x] for x in cols])
# select user_id and normalize other columns by rowsum
df.select('user_id', *((df[x] / rowsum).alias(x) for x in cols)).show()
+-------+-----+------------------+------------------+
|user_id|Movie| Music| Books|
+-------+-----+------------------+------------------+
| 100| 0.0|0.3333333333333333|0.6666666666666666|
| 101|0.375| 0.125| 0.5|
+-------+-----+------------------+------------------+

get max from a column and compare to each item of a column

I have a dataframe such as:
id | value | date1 | date2
-------------------------------------
1 | 20 | 2015-09-01 | 2018-03-01
1 | 30 | 2019-04-04 | 2015-03-02
1 | 40 | 2014-01-01 | 2016-06-09
2 | 15 | 2014-01-01 | 2013-06-01
2 | 25 | 2019-07-18 | 2016-07-07
and want to return for each id the sum(value) where date1<max(date2) for that id. In the above example we will get:
id | sum_value
-----------
1 | 60
2 | 15
since for id 1 the max(date2) is 2018-03-01 and the first and third row fits the condition date1<max(date2) and therefore the value is sum of 20 and 40.
I have tried the code below but we can't use max outside the agg function.
df.withColumn('sum_value',F.when(F.col('date1')<F.max(F.col('date2')), value).otherwise(0))
.groupby(['id'])
Do you have any suggestions? The table is 2 billion rows so I am looking for other options than re-joining.

You can use a Window function. Direct translation of your requirements would be:
from pyspark.sql.functions import col, max as _max, sum as _sum
from pyspark.sql import Window
df.withColumn("max_date2", _max("date2").over(Window.partitionBy("id")))\
.where(col("date1") < col("max_date2"))\
.groupBy("id")\
.agg(_sum("value").alias("sum_value"))\
.show()
#+---+---------+
#| id|sum_value|
#+---+---------+
#| 1| 60.0|
#| 2| 15.0|
#+---+---------+

Find rows in relation with at least n rows in a different table without joins

I have a table as such (tbl):
+----+------+-----+
| pk | attr | val |
+----+------+-----+
| 0 | ohif | 4 |
| 1 | foha | 56 |
| 2 | slns | 2 |
| 3 | faso | 11 |
+----+------+-----+
And another table in n-to-1 relationship with tbl (tbl2):
+----+-----+
| pk | rel |
+----+-----+
| 0 | 0 |
| 1 | 1 |
| 2 | 0 |
| 3 | 2 |
| 4 | 2 |
| 5 | 3 |
| 6 | 1 |
| 7 | 2 |
+----+-----+
(tbl2.rel -> tbl.pk.)
I would like to select only the rows from tbl which are in relationship with at least n rows from tbl2.
I.e., for n = 2, I want this table:
+----+------+-----+
| pk | attr | val |
+----+------+-----+
| 0 | ohif | 4 |
| 1 | foha | 56 |
| 2 | slns | 2 |
+----+------+-----+
This is the solution I came up with:
SELECT DISTINCT ON (tbl.pk) tbl.*
FROM (
SELECT tbl.pk
FROM tbl
RIGHT OUTER JOIN tbl2 ON tbl2.rel = tbl.pk
GROUP BY tbl.pk
HAVING COUNT(tbl2.*) >= 2 -- n
) AS tbl_candidates
LEFT OUTER JOIN tbl ON tbl_candidates.pk = tbl.pk
Can it be done without selecting the candidates with a subquery and re-joining the table with itself?
I'm on Postgres 10. A standard SQL solution would be better, but a Postgres solution is acceptable.

OK, just join once, as below:
select
t1.pk,
t1.attr,
t1.val
from
tbl t1
join
tbl2 t2 on t1.pk = t2.rel
group by
t1.pk,
t1.attr,
t1.val
having(count(1)>=2) order by t1.pk;
pk | attr | val
----+------+-----
0 | ohif | 4
1 | foha | 56
2 | slns | 2
(3 rows)
Or just join once and use CTE(with clause), as below:
with tmp as (
select rel from tbl2 group by rel having(count(1)>=2)
)
select b.* from tmp t join tbl b on t.rel = b.pk order by b.pk;
pk | attr | val
----+------+-----
0 | ohif | 4
1 | foha | 56
2 | slns | 2
(3 rows)
Is the SQL clearer?

How to check for duplicate rows (a,b) ~ (b,a) in postgreSQL

Say I have two tables (ids1, ids2 with column headers id1,id2 respectively)
+-----+----+
| id1 | id2|
+-----+----+
| 1 | 3 |
| 2 | 4 |
| 3 | 1 |
| 4 | 2 |
| 5 | 0 |
+--------+
How do I remove the (a,b) ~ (b,a) rows so that I get
+-----+----+
| id1 | id2|
+-----+----+
| 1 | 3 |
| 2 | 4 |
| 5 | 0 |
-----------+
It doesn't matter whether I get (1,3) (2,4), (3,1) (4,2), or a combination of the two.
I am using postgreSQL-9.2.

SELECT LEAST(id1,id2),GREATEST(id1,id2) FROM t
GROUP by
GREATEST(id1,id2),LEAST(id1,id2)
FIDDLE

This might not be the most elegant of solutions, but it will work:
DELETE FROM foo
WHERE (id1, id2) in (
SELECT f1.id1, f1.id2
FROM foo f1
JOIN foo f2 ON (
f1.id1 = f2.id2 AND
f1.id2 = f2.id1 AND
(f1.id1, f1.id2) > (f2.id1, f2.id2))
);
The grater than condition in the join makes sure to not remove both matching rows.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Dataframe in pypark - How to apply aggregate functions to into two columns? - pyspark

You can do multiple column aggregation on single grouped data, data.groupby('id').agg({'category':'count','count':'sum'}).withColumnRenamed('count(category)',"num_category").show() +---+-------+--------+ | id|num_cat|sum(cnt)| +---+-------+--------+ | 1| 3| 5| | 2| 2| 2| +---+-------+--------+

Related

Ranking a pivoted column

How to divide dataframe row's each value by row's total sum (data normalization) in pyspark?

get max from a column and compare to each item of a column

Find rows in relation with at least n rows in a different table without joins

How to check for duplicate rows (a,b) ~ (b,a) in postgreSQL

Categories

Resources