Logical Grouping in Bigquery - group-by

I am trying to group data in bigquery that goes beyond simple aggregation. However I am not sure if what I'm trying to do is possible.
The idea behind the data:
One employee will be logged in and can perform multiple transactions. hits.eventInfo captures all of this data but the only field that separates the transactions from one another is a flight_search field which is done to look up a person's records before a transaction (I also thought about using the resetting hitNumber as a transaction separator, but its not always a clean reset per transaction).
My question is, is it possible to group by the fullVisitorId+VisitId, date and this logic where we would have all of the array_agg reset each time the flight_search field is fired? Currently, all the transactional data is in going into one array instead of separate arrays per transaction. Its then impossible to tell which fields go with which transaction. Further, taking the max is supposed to give me the last updates in each transaction, but it just gives me the last transaction because they are all together.
Example of my query below. I have to use array_agg or something like it since the subqueries can only have one return
WITH eventData AS (
SELECT
CONCAT(fullVisitorId, ' ', CAST(VisitId AS string)) sessionId,
date AS date,
hit.hour AS checkinHour,
hit.minute AS checkinMin,
(SELECT ARRAY_AGG(hit.eventInfo.eventAction) FROM UNNEST(hits) hit WHERE hit.eventInfo.eventCategory = 'pnr') AS pnr,
(SELECT ARRAY_AGG(STRUCT(hit.eventInfo.eventAction)) AS val FROM UNNEST(hits) hit WHERE hit.eventInfo.eventCategory = 'submit_checkin') AS names
FROM
`web-analytics.192016109.ga_sessions_20191223`,
UNNEST(hits) AS hit
## group by sessionId, date, hit.eventInfo.eventCategory ='flight_search'
)
SELECT
sessionId,
date,
MAX(checkinHour) chkHr,
MAX(checkinMin) AS chkMin,
# end of transaction
MAX(pnr[ORDINAL(ARRAY_LENGTH(pnr))]) AS pnr,
names.eventAction AS pax_name
FROM
eventData,
UNNEST (names) AS names
GROUP BY
sessionId,
date,
pax_name
Technically if I add a group by here, everything will break because Ill be asked to then group by hour, min and then hits which is an array...
Example test data
This is the original eventData as it is fed in from Google Analytics to BigQuery. I have simplified the displayed eventCategories. This is where the inner query is sourcing. A transaction is completed after the submit_checkin event happens. As we can see though, there is one pnr (identifier) but multiple people are checked-in for that pnr.
This is a sample of the output from eventData looks like. As you can see, the pnrs are grouped in one array and the names are in one array. Its not directly possible to see which were together in which transaction.
Lastly, here is the whole query output. I wrote on the picture what the expected result is.

If you want to see which information was tracked in the same hit you should keep the relation between them. But it seems they are not in the same hit with eventCategory being 'pnr' one time and 'submit_checkin' the other time.
I'm not sure it's intentional but you're also cross joining the table with hits ... and then you're array_agg()-ing the hits array per hit again. That seems wrong.
If you're staying on session scope then there is no need to group anything, because the table already comes with 1 row = 1 session.
This query prepares another window function
SELECT
fullVisitorId,
visitstarttime,
date,
ARRAY(
SELECT AS STRUCT
hitNumber,
IF(eventInfo.eventCategory='flight_search'
AND
LAG(eventInfo.eventCategory) OVER (ORDER BY hitnumber ASC) = 'submit_checkin', 1, 0
) as breakInfo,
eventInfo,
hour,
minute
FROM UNNEST(hits) hit
WHERE hit.eventInfo.eventCategory IN ('pnr', 'submit_checkin', 'flight_search')
ORDER BY hitnumber ASC
) AS myhits1,
ARRAY(SELECT AS STRUCT
*,
SUM(breakInfo) OVER (order by hitnumber) as arrayId
FROM (SELECT
hitNumber,
IF(eventInfo.eventCategory='flight_search'
AND
LAG(eventInfo.eventCategory) OVER (ORDER BY hitnumber ASC) = 'submit_checkin', 1, 0
) as breakInfo,
eventInfo,
hour,
minute
FROM UNNEST(hits) hit
WHERE hit.eventInfo.eventCategory IN ('pnr', 'submit_checkin', 'flight_search')
ORDER BY hitnumber ASC
)) AS myhits2
FROM
`web-analytics.192016109.ga_sessions_20191223`
This gives you a number as id to group by. You only need to feed the output that gets fed to the array function to yet another sub-query that finally groups it into arrays using array_agg() and group by arrrayId.

Related

From group by to window function postgres

my goal is to compare Event Engagement Rate = ( Event Subscribed / Event Participated ) for the newsletter source vs Email Source by using window functions
i Managed to do it by using group by
select source ,sum(event_subscribed/event_participated) as "engagement rate" from events
group by source
i keep failing with the windows function by having too many rows and wrong engagement rates
Thanks for your help.
First you need to understand the difference between the Aggregate functions and the Window variant of the same function. The aggregate functions builds groups as specified in the 'Group By' clause and reduces the result to a single row per group. The Window version also builds groups as specified in the 'Partition By' clause. However, it does not reduce the number of rows, instead it generated the same result in each of the rows within the group. See example. To reduce the Window version to a single row per group use Distinct on expression.
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the “first row” of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first.
Your query becomes
select distinct on (source)
source
, sum(event_subscribed/event_participated) as "engagement rate"
from events
order by source.

How to select max date value while selecting max value

I have the following sample from a table with students results with date for a school entry exam
First student passed exam - This is the most common record found for most students
Second student failed 1st time entry and passed second time based on the date
3rd student had a failed input entry and was corrected based on the Version
I need the results to like like the picture above, so we take into regard using the latest date and highest version!
My basic query thus far is
select studentid
,examdate --(Date)
,result -- (charvar)
from StudentEntryExam
How should I approach this issue?
demo:db<>fiddle
SELECT DISTINCT ON (studentid)
*
FROM mytable
ORDER BY studentid, examdate DESC, version DESC
DISTINCT ON returns the first record of an ordered group. In that case the groups are the studentids. You must find the correct order to set the required record first. So, you need to order by studentid, of course. Then you need the most recent examdate first, which can be achieved with DESC order. If there are two records on the same date, you need to order the highest version first as well using the DESC modifier, too.

postgresSQL How to do a SELECT clause with an condition iterating through a range of values?

Hy everyone. This is my first post on Stack Overflow so sorry if it is clumpsy in any way.
I work in Python and make postgresSQL requests to a google BigQuery database. The data structure looks like this :
sample of data
where time is represented in nanoseconds, and is not regularly spaced (it is captured real-time).
What I want to do is to select, say, the mean price over a minute, for each minute in a time range that i would like to give as a parameter.
This time range is currently a list of timestamps that I build externally, and I make sure they are separated by one minute each :
[1606170420000000000, 1606170360000000000, 1606170300000000000, 1606170240000000000, 1606170180000000000, ...]
My question is : how can I extract this list of mean prices given that list of time intervals ?
Ideally I'd expect something like
SELECT AVG(price) OVER( PARTITION BY (time BETWEEN time_intervals[i] AND time_intervals[i+1] for i in range(len(time_intervals))) )
FROM table_name
but I know that doesn't make sense...
My temporary solution is to aggregate many SELECT ... UNION DISTINCT clauses, one for each minute interval. But as you can imagine, this is not very efficient... (I need up to 60*24 = 1440 samples)
Now there very well may already be an answer to that question, but since I'm not even sure about how to formulate it, I found nothing yet. Every link and/or tip would be of great help.
Many thanks in advance.
First of all, your sample data appears to be at millisecond resolution, and you are looking for averages at minute (sixty-second) resolution.
Please try this:
select div(time, 60000000000) as minute,
pair,
avg(price) as avg_price
from your_table
group by div(time, 60000000000) as minute, pair
If you want to control the intervals as you said in your comment, then please try something like this (I do not have access to BigQuery):
with time_ivals as (
select tick,
lead(tick) over (order by tick) as next_tick
from unnest(
[1606170420000000000, 1606170360000000000,
1606170300000000000, 1606170240000000000,
1606170180000000000, ...]) as tick
)
select t.tick, y.pair, avg(y.price) as avg_price
from time_ivals t
join your_table y
on y.time >= t.tick
and y.time < t.next_tick
group by t.tick, y.pair;

How can I select only data within a specific window in KSQL?

I have a table with tumbling window, e.g.
CREATE TABLE total_transactions_per_1_days AS
SELECT
sender,
count(*) AS count,
sum(amount) AS total_amount,
histogram(recipient) AS recipients
FROM
completed_transactions
WINDOW TUMBLING (
SIZE 1 DAYS
)
Now I need to only select data from the current window, i.e. windowstart <= current time and windowend <= current time. Is it possible? I could not find any example.
Depends what you mean when you say 'select data' ;)
ksqlDB supports two main query types, (see https://docs.ksqldb.io/en/latest/concepts/queries/).
If what you want is a pull query, i.e. a traditional sql query where you want to pull back the current window as a one time result, then what you want may be possible, though pull queries are a recent feature and not fully featured yet. As of version 0.10 you can only look up a known key. For example, if sender is the key of the table, you could run a query like:
SELECT * FROM total_transactions_per_1_days
WHERE sender = some_value
AND WindowStart <= UNIX_TIMESTAMP()
AND WindowEnd >= UNIX_TIMESTAMP();
This would require the table to have processed data with a timestamp close to the current wall clock time for it to pull back data, i.e. if the system was lagging, or if you were processing historic or delayed data, this would not work.
Note: the above query will work on ksqlDB v0.10. Your success on older versions may vary.
There are plans to extend the functionality of pull queries. So keep an eye for updates to ksqlDB.

Compare 2 Tables When 1 Is Null in PostgreSQL

List item
I am kinda new in PostgreSQL and I have difficulty to get the result that I want.
In order to get the appropriate result, I need to make multiple joins and I have difficulty when counting grouping them in one query as well.
The table names as following: pers_person, pers_position, and acc_transaction.
What I want to accomplish is;
To see who was absent on which date comparing pers_person with acc_transaction for any record, if there any record its fine... but if record is null the person was definitely absent.
I want to count the absence by pers_person, how many times in month this person is absent.
Also the person hired_date should be considered, the person might be hired in November in October report this person should be filtered out.
pers_postition table is for giving position information of that person.
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
SELECT tr.create_time::date AS Date, pers.pin, tr.dept_name, tr.name, tr.last_name, pos.name, Count(*)
FROM acc_transaction AS tr
RIGHT JOIN pers_person as pers
ON tr.pin = pers.pin
LEFT JOIN pers_position as pos
ON pers.position_id=pos.id
WHERE tr.event_no = 0 AND DATE_PART('month', DATE)=10 AND DATE_PART('month', pr.hire_date::date)<=10 AND pr.pin IS DISTINCT FROM tr.pin
GROUP BY DATE
ORDER BY DATE
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
*This is report for octeber,
*Pin is ID number
I'd start by
changing the RIGHT JOIN for a LEFT JOIN as they works the same in reverse but it's confusing to figure them both in mind :
removing for now the pers_position table as it is used for added information purpose rather than changing any returned result
there is an unknown alias pr and I'd assume it is meant for pers (?), changing it accordingly
that leads to strange WHERE conditions, removing them
"pers.pin IS DISTINCT FROM pers.pin" (a field is never distinct from itself)
"AND DATE_PART('month', DATE)=10 " (always true when run in october, always false otherwise)
Giving the resulting query :
SELECT tr.create_time::date AS Date, pers.pin, tr.dept_name, tr.name, tr.last_name, Count(*)
FROM pers_person as pers
LEFT JOIN acc_transaction AS tr ON tr.pin = pers.pin
WHERE tr.event_no = 0
AND DATE_PART('month', pers.hire_date::date)<=10
GROUP BY DATE
ORDER BY DATE
At the end, I don't know if that answers the question, since the title says "Compare 2 Tables When 1 Is Null in PostgreSQL" and the content of the question says nothing about it.