how to show only certain ID's in pyspark with aggregated values?

how to show only certain ID's in pyspark with aggregated values? - pyspark

I have two tables with different IDs. I want to join the tables with a join. How can I only display the three special IDs?
Table 1
sp_date | id | value |
-------------+-----+--------+
2021-05-07 | 15 | 1 |
2021-05-07 | 13 | 3 |
2021-05-07 | 15 | 4 |
2021-04-08 | 13 | 2 |
2021-04-08 | 10 | 8 |
Table 2
sp_date | id | value |
-------------+-----+--------+
2021-05-07 | 17 | 2 |
2021-04-08 | 12 | 7 |
2021-03-01 | 17 | 3 |
only the IDs (15 and 13) from table 1 and the ID (17) from table 2 should be displayed with their values. Id_1 represents the id 15. Id_2 represents the id 13 and Id_3 represents the id 17. the value of the ID's, which have the same date the values of them should be aggregate.
Output should be:
date | id_1 | id_2 | id_3 |
-------------+-------+------+-------+
2021-05-07 | 5 | 3 | 2 |
2021-04-08 | | 2 | |
2021-03-01 | | | 3 |
Pyspark Code
p_df = table 1
e_df = table 2
p_df = p_df.join(e_df, on=[
p_df['sp_date'] == e_df['sk_date']])
p_df = p_df.groupBy(
'sp_date', 'sk_date'
).agg(
F.sum(F.when(F.col('id') == 15, F.col('value'))).alias('id_1'),
F.sum(F.when(F.col('id') == 13, F.col('value'))).alias('id_2'),
F.sum(F.when(F.col('id') == 17, F.col('value'))).alias('id_3'),
).orderBy(
F.desc('sp_date'), F.desc('sk_date')
)
with this code I get values that are too high for the respective columns. what am I doing wrong?

If you want to select three special ID's from the tables. your approach/steps should be as follows:
Filtering before join:
p_df1 = select * from p_df where id in (13, 15,17)
e_df1 = select * from e_df where id in (13, 15,17)
Join on id column:
p_df = p_df1.join(e_df1, on=[p_df1['sp_date'] == e_df1['sk_date']])
aggregate using group by date, pivot by id, aggreate function = sum(val).
(refer the answer here)

Related

Get the max value for each column in a table

I have a table for player stats like so:
player_id | game_id | rec | rec_yds | td | pas_att | pas_yds | ...
--------------------------------------------------------
1 | 3 | 1 | 5 | 0 | 3 | 20 |
2 | 3 | 0 | 8 | 1 | 7 | 20 |
3 | 3 | 3 | 9 | 0 | 0 | 0 |
4 | 3 | 5 | 15 | 0 | 0 | 0 |
I want to return the max values for every column in the table except player_id and game_id.
I know I can return the max of one single column by doing something like so:
SELECT MAX(rec) FROM stats
However, this table has almost 30 columns, so I would just be repeating the query below, for all 30 stats, just replacing the name of the stat.
SELECT MAX(rec) as rec FROM stats
This would get tedious real quick, and wont scale.
Is there any way to kind of loop over columns, get every column in the table and return the max value like so:
player_id | game_id | rec | rec_yds | td | pas_att | pas_yds | ...
--------------------------------------------------------
4 | 3 | 5 | 15 | 1 | 7 | 20 |

You can get the maximum of multiple columns in a single query:
SELECT
MAX(rec) AS rec_max,
MAX(rec_yds) AS rec_yds_max,
MAX(td) AS td_max,
MAX(pas_att) AS pas_att_max,
MAX(pas_yds) AS pas_yds_max
FROM stats
However, there is no way to dynamically get an arbitrary number of columns. You could dynamically build the query by loading all column names of the table, then apply conditions such as "except player_id and game_id", but that cannot be done within the query itself.

Flattenning the Left Join outcome in PostgreSQL

I have eventtags, filtervalues.So I have something like:
eventtags:
event_id, key_id, value_id, event_date
filtervalues:
value_id, key,value, counts_seen
Let's say I've 2 events reporting with multiple key, value pairs in eventtags table
event_id | key_id | value_id | event_date
---------+--------+----------+-----------
1 | 20 | 32 | xx-xx-xxxx
1 | 21 | 34 | xx-xx-xxxx
2 | 20 | 35 | yy-yy-yyyy
2 | 21 | 39 | yy-yy-yyyy
Corresponding filter_value table is having data as below
values_id | key | value | counts_seen
----------+-------+-------+----------
32 | type | staff | 52
34 | tag | tag1 | 13
35 | type | user | 10
39 | tag | tag2 | 35
Now based on this I tried below query to consolidate the data from two tables
SELECT t.event_id as Event_Id,
DATE (t.event_date) as Event_Date,
v.key as Keys,
v.value as Values
FROM eventtags t
LEFT JOIN filtervalues as v ON t.value_id = v.id
This results in something like this
Event_Id | Keys | Values | Event_Date
---------+--------+----------+-----------
1 | type | staff | xx-xx-xxxx
1 | tag | tag1 | xx-xx-xxxx
2 | type | user | yy-yy-yyyy
2 | tag | tag2 | yy-yy-yyyy
I want the data to be in the below format
Event_Id | type | tag | Event_Date
---------+--------+---------+-----------
1 | staff | tag1 | xx-xx-xxxx
2 | user | tag2 | yy-yy-yyyy
What changes do I need to make on the query above to obtain this format?
Note: I cannot use Pivots since the system I'm working on, doesn't support them.
Any help is much appreciated

Try this for your scenario without pivot(crosstab):
SELECT t.event_id as Event_Id,
max(v.value) filter (where v.key='type') as "type",
max(v.value) filter (where v.key='tag') as "tag",
DATE (t.event_date) as Event_Date
FROM eventtags t
LEFT JOIN filtervalues as v ON t.value_id = v.id
group by t.event_id,t.event_date
DEMO
above will work only for PostgreSQL 9.4 and above.

Full Outer Join on two columns is omitting rows

Some background, I am making a table in Postgres 9.5 that counts the number of actions performed by a user and grouping these actions by month using date_trunc(). The counts for each individual action are divided into separate tables, following this format:
Feedback table:
id | month | feedback_counted
----+---------+-------------------
1 | 2 | 3
1 | 3 | 10
1 | 4 | 7
1 | 5 | 2
Comments table:
id | month | comments_counted
----+---------+-------------------
1 | 4 | 12
1 | 5 | 4
1 | 6 | 57
1 | 7 | 12
Ideally, I would like to do a FULL OUTER JOIN of these tables ON the "id" and "month" columns at the same time and produce this query:
Combined table:
id | month | feedback_counted | comments_counted
----+---------+--------------------+-------------------
1 | 2 | 3 |
1 | 3 | 10 |
1 | 4 | 7 | 12
1 | 5 | 2 | 4
1 | 6 | | 57
1 | 7 | | 12
However, my current query does not capture the feedback dates, displaying it like such:
Rollup table:
id | month | feedback_counted | comments_counted
----+---------+--------------------+-------------------
| | |
| | |
1 | 4 | 7 | 12
1 | 5 | 2 | 4
1 | 6 | | 57
1 | 7 | | 12
This is my current statement, note that it uses date_trunc in place of month. I add the action counts later, the main issue is somewhere here.
CREATE TABLE rollup_table AS
SELECT c.id, c.date_trunc
FROM comments_counted c FULL OUTER JOIN feedback_counted f
ON c.id = f.id AND c.date_trunc = f.date_trunc
GROUP BY c.id, c.date_trunc, f.id, f.date_trunc;
I'm a bit of a novice with SQL and am not sure how to fix this, any help would be appreciated.

Replace ON c.id = f.id AND c.month = f.month with USING(id, month).
SELECT id, month, feedback_counted, comments_counted
FROM comments c
FULL OUTER JOIN feedback f
USING(id, month);
id | month | feedback_counted | comments_counted
----+-------+------------------+------------------
1 | 2 | 3 |
1 | 3 | 10 |
1 | 4 | 7 | 12
1 | 5 | 2 | 4
1 | 6 | | 57
1 | 7 | | 12
(6 rows)
Test it in db<>fiddle.

USING() basically is the same as ON, just that if the 2 tables share the same column names, you can use USING() instead of ON to save some typing effort. That being said, using USING() won't work. In Postgresql (not sure about other sql versions), you still need to specify c.id, and c.month, even with USING(). And as long as you specify the columns, Postgresql will only pull the rows where the values of these columns exist. That's why you will have missing rows under the full outer join.
Here is a way that at least works for me.
SELECT COALESCE(c.id, f.id) AS id,
COALESCE(c.month, f.month) AS month,
feedback_counted,
comments_counted
FROM comments c
FULL OUTER JOIN feedback f
ON c.id = f.id AND c.month = f.month;

Typo3 TCA custom table

I have this situation, I have one offer, and that offer have n number of dates, and n number of options. So I have two additional tables for offer. And third one, which is a price, but price depends of date, and offer. And it is like this:
| | date 1 | date 2 | date 3 |
| offer 1 | price 11 | price 12 | price 13 |
| offer 2 | price 21 | price 22 | price 23 |
| offer 3 | price 31 | price 32 | price 33 |
Is there any way to create TCA custom field to insert all of this Price values at once?
So, basically I need one table with input fields and to store also uid of date and offer in it as reference.

Make more than one table... Tables with dynamic col count are horrible bad to maintain.
Table Offer:
uid | Name | Desc
1 | offer1 | This is some cool shit
2 | offer2 | dsadsad
3 | offer3 | sdadsdsadsada
Table Date:
uid | date
1 | 12.02.2014
2 | 12.03.2014
3 | 20.03.2014
Table Prices:
uid | date | offer | price
1 | 1 | 1 | price11
2 | 1 | 2 | price21
3 | 1 | 3 | price31
4 | 2 | 1 | price12
5 | 2 | 2 | price22
6 | 2 | 3 | price32
7 | 3 | 1 | price13
8 | 3 | 2 | price23
9 | 3 | 3 | price33
And then its straight forward...

Grouping in t-sql with latest dates

I have a table like this
Event ID | Contract ID | Event date | Amount |
----------------------------------------------
1 | 1 | 2009-01-01 | 100 |
2 | 1 | 2009-01-02 | 20 |
3 | 1 | 2009-01-03 | 50 |
4 | 2 | 2009-01-01 | 80 |
5 | 2 | 2009-01-04 | 30 |
For each contract I need to fetch the latest event and amount associated with the event and get something like this
Event ID | Contract ID | Event date | Amount |
----------------------------------------------
3 | 1 | 2009-01-03 | 50 |
5 | 2 | 2009-01-04 | 30 |
I can't figure out how to group the data correctly. Any ideas?
Thanks in advance.

SQL 2k5/2k8:
with cte_ranked as (
select *
, row_number() over (
partition by ContractId order by EvantDate desc) as [rank]
from [table])
select *
from cte_ranked
where [rank] = 1;
SQL 2k:
select t.*
from table as t
join (
select max(EventDate) as MaxDate
, ContractId
from table
group by ContractId) as mt
on t.ContractId = mt.ContractId
and t.EventDate = mt.MaxDate