I'm using Dataframe in pyspark. I have one table like Table 1 bellow. I need to obtain Table 2. Where:
num_category - it is how many differents categories for each id
sum(count) - it is the sum of the third column in Table 1 for each id.
Example:
Table 1
id |category | count
1 | 4 | 1
1 | 3 | 2
1 | 1 | 2
2 | 2 | 1
2 | 1 | 1
Table 2
id |num_category| sum(count)
1 | 3 | 5
2 | 2 | 2
I try:
table1 = data.groupBy("id","category").agg(count("*"))
cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
Error:
1 table1 = data.groupBy("id","category").agg(count("*"))
---> 2 cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)
TypeError: 'DataFrame' object is not callable
You can do multiple column aggregation on single grouped data,
data.groupby('id').agg({'category':'count','count':'sum'}).withColumnRenamed('count(category)',"num_category").show()
+---+-------+--------+
| id|num_cat|sum(cnt)|
+---+-------+--------+
| 1| 3| 5|
| 2| 2| 2|
+---+-------+--------+
Related
I want to pivot a column and then rank the data from the pivoted column. Here is sample data:
| id | objective | metric | score |
|----|-----------|-------------|-------|
| 1 | Sales | Total Sales | 10 |
| 1 | Marketing | Total Reach | 4 |
| 2 | Sales | Total Sales | 2 |
| 2 | Marketing | Total Reach | 11 |
| 3 | Sales | Total Sales | 9 |
This would be my expected output after pivot + rank:
| id | Sales | Marketing |
|----|--------|-----------|
| 1 | 1 | 2 |
| 2 | 3 | 1 |
| 3 | 2 | 3 |
The ranking is based on sum(score) from each objective. An objective can have also have multiple metrics but that isn't included in the sample for simplicity.
I have been able to successfully pivot and count the scores like so:
pivot = (
spark.table('scoring_table')
.select('id', 'objective', 'metric', 'score')
.groupBy('id')
.pivot('objective')
.agg(
sf.sum('score').alias('score')
)
This then lets me see the total score per objective, but I'm unsure how to rank these. I have tried the following after aggregation:
.withColumn('rank', rank().over(Window.partitionBy('id', 'objective').orderBy(sf.col('score').desc())))
However objective is no longer callable from this point as it has been pivoted. I then tried this instead:
.withColumn('rank', rank().over(Window.partitionBy('id', 'Sales', 'Marketing').orderBy(sf.col('score').desc())))
But also the score column is no longer available. How can I rank these scores after pivoting the data?
You just need to order by the score after pivot:
from pyspark.sql import functions as F, Window
df2 = df.groupBy('id').pivot('objective').agg(F.sum('score')).fillna(0)
df3 = df2.select(
'id',
*[F.rank().over(Window.orderBy(F.desc(c))).alias(c) for c in df2.columns[1:]]
)
df3.show()
+---+---------+-----+
| id|Marketing|Sales|
+---+---------+-----+
| 2| 1| 3|
| 1| 2| 1|
| 3| 3| 2|
+---+---------+-----+
I have a dataframe of user's preferences:
+-------+-----+-----+-----+
|user_id|Movie|Music|Books|
+-------+-----+-----+-----+
| 100 | 0 | 1 | 2 |
| 101 | 3 | 1 | 4 |
+-------+---------+-------+
How to 1) compute total sum of each row(user); 2) divide each value by that sum? so I get normalized preference values:
+-------+---- -+-------+-------+
|user_id| Movie| Music | Books |
+-------+----- +-------+-------+
| 100 | 0 | 0.33..| 0.66..|
| 101 |0.42..| 0.15..| 0.57..|
+-------+------+-------+-------+
# get column names that need to be normalized
cols = [col for col in df.columns if col != 'user_id']
# sum the columns by row
rowsum = sum([df[x] for x in cols])
# select user_id and normalize other columns by rowsum
df.select('user_id', *((df[x] / rowsum).alias(x) for x in cols)).show()
+-------+-----+------------------+------------------+
|user_id|Movie| Music| Books|
+-------+-----+------------------+------------------+
| 100| 0.0|0.3333333333333333|0.6666666666666666|
| 101|0.375| 0.125| 0.5|
+-------+-----+------------------+------------------+
I have a dataframe such as:
id | value | date1 | date2
-------------------------------------
1 | 20 | 2015-09-01 | 2018-03-01
1 | 30 | 2019-04-04 | 2015-03-02
1 | 40 | 2014-01-01 | 2016-06-09
2 | 15 | 2014-01-01 | 2013-06-01
2 | 25 | 2019-07-18 | 2016-07-07
and want to return for each id the sum(value) where date1<max(date2) for that id. In the above example we will get:
id | sum_value
-----------
1 | 60
2 | 15
since for id 1 the max(date2) is 2018-03-01 and the first and third row fits the condition date1<max(date2) and therefore the value is sum of 20 and 40.
I have tried the code below but we can't use max outside the agg function.
df.withColumn('sum_value',F.when(F.col('date1')<F.max(F.col('date2')), value).otherwise(0))
.groupby(['id'])
Do you have any suggestions? The table is 2 billion rows so I am looking for other options than re-joining.
You can use a Window function. Direct translation of your requirements would be:
from pyspark.sql.functions import col, max as _max, sum as _sum
from pyspark.sql import Window
df.withColumn("max_date2", _max("date2").over(Window.partitionBy("id")))\
.where(col("date1") < col("max_date2"))\
.groupBy("id")\
.agg(_sum("value").alias("sum_value"))\
.show()
#+---+---------+
#| id|sum_value|
#+---+---------+
#| 1| 60.0|
#| 2| 15.0|
#+---+---------+
I have a table as such (tbl):
+----+------+-----+
| pk | attr | val |
+----+------+-----+
| 0 | ohif | 4 |
| 1 | foha | 56 |
| 2 | slns | 2 |
| 3 | faso | 11 |
+----+------+-----+
And another table in n-to-1 relationship with tbl (tbl2):
+----+-----+
| pk | rel |
+----+-----+
| 0 | 0 |
| 1 | 1 |
| 2 | 0 |
| 3 | 2 |
| 4 | 2 |
| 5 | 3 |
| 6 | 1 |
| 7 | 2 |
+----+-----+
(tbl2.rel -> tbl.pk.)
I would like to select only the rows from tbl which are in relationship with at least n rows from tbl2.
I.e., for n = 2, I want this table:
+----+------+-----+
| pk | attr | val |
+----+------+-----+
| 0 | ohif | 4 |
| 1 | foha | 56 |
| 2 | slns | 2 |
+----+------+-----+
This is the solution I came up with:
SELECT DISTINCT ON (tbl.pk) tbl.*
FROM (
SELECT tbl.pk
FROM tbl
RIGHT OUTER JOIN tbl2 ON tbl2.rel = tbl.pk
GROUP BY tbl.pk
HAVING COUNT(tbl2.*) >= 2 -- n
) AS tbl_candidates
LEFT OUTER JOIN tbl ON tbl_candidates.pk = tbl.pk
Can it be done without selecting the candidates with a subquery and re-joining the table with itself?
I'm on Postgres 10. A standard SQL solution would be better, but a Postgres solution is acceptable.
OK, just join once, as below:
select
t1.pk,
t1.attr,
t1.val
from
tbl t1
join
tbl2 t2 on t1.pk = t2.rel
group by
t1.pk,
t1.attr,
t1.val
having(count(1)>=2) order by t1.pk;
pk | attr | val
----+------+-----
0 | ohif | 4
1 | foha | 56
2 | slns | 2
(3 rows)
Or just join once and use CTE(with clause), as below:
with tmp as (
select rel from tbl2 group by rel having(count(1)>=2)
)
select b.* from tmp t join tbl b on t.rel = b.pk order by b.pk;
pk | attr | val
----+------+-----
0 | ohif | 4
1 | foha | 56
2 | slns | 2
(3 rows)
Is the SQL clearer?
Say I have two tables (ids1, ids2 with column headers id1,id2 respectively)
+-----+----+
| id1 | id2|
+-----+----+
| 1 | 3 |
| 2 | 4 |
| 3 | 1 |
| 4 | 2 |
| 5 | 0 |
+--------+
How do I remove the (a,b) ~ (b,a) rows so that I get
+-----+----+
| id1 | id2|
+-----+----+
| 1 | 3 |
| 2 | 4 |
| 5 | 0 |
-----------+
It doesn't matter whether I get (1,3) (2,4), (3,1) (4,2), or a combination of the two.
I am using postgreSQL-9.2.
SELECT LEAST(id1,id2),GREATEST(id1,id2) FROM t
GROUP by
GREATEST(id1,id2),LEAST(id1,id2)
FIDDLE
This might not be the most elegant of solutions, but it will work:
DELETE FROM foo
WHERE (id1, id2) in (
SELECT f1.id1, f1.id2
FROM foo f1
JOIN foo f2 ON (
f1.id1 = f2.id2 AND
f1.id2 = f2.id1 AND
(f1.id1, f1.id2) > (f2.id1, f2.id2))
);
The grater than condition in the join makes sure to not remove both matching rows.