Druid query to get "latest" value from third column - druid

I have a table in Druid, something like
Timestamp || UserId || Action
And I need to get the latest Action for each UserId. In MySQL I would do something like
Select * from users u1 inner join (
select UserId, max(Timestamp) as maxt from users group by UserId
) u2
on u1.UserId = u2.UserId and u1.Timestamp = u2.maxt
But Druid can't do joins and only very basic sub-selects.
I know the "right" answer is probably to denormalize the data at ingestion time, but unfortunately that's not an option as I don't "own" the ingestion part.
The only solution I have come up with so far is to retrieve all the results for both queries in Java code and do the join manually, but I will run into memory constraints when the dataset grows I would imagine.
I tried to look at materialized views, but that looks like it's still incubating and would require a hadoop cluster, so isn't really viable.
I tried to do something like
Select * from users u1 where concat(Timestamp, UserId) in (
select concat(UserId, max(Timestamp)) from users group by UserId
)
But it didn't like that either.
Any suggestions?

LATEST(expr)
Returns the latest value of expr, which must be numeric. If expr
comes from a relation with a timestamp column (like a Druid
datasource) then "latest" is the value last encountered with the
maximum overall timestamp of all values being aggregated. If expr
does not come from a relation with a timestamp, then it is simply the
last value encountered.
https://druid.apache.org/docs/0.20.0/querying/sql.html

Related

Pivot function without manually typing values in `for in`?

Documentation provides an example of using the pivot() function.
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN ('prop', 'rudder', 'wing')
);
I would like to use pivot() without having to manually specify each value of partname. I want all parts. I tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname);
That gave an error. Then tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN (select distinct partname from part)
);
That also threw an error.
How can I tell Redshift to include all values of partname in the pivot?
I don't think this can be done in a simple single query. This would mean that the query compiler would need to work without knowing how many output columns will be produced. I don't think it can do that.
You can do this in multiple queries - use a query to create the list of partnames and then use this to "generate" a second query that populates the IN list. So something needs issue these queries and generated the second. This can be some code external to Redshift (lots of options) or a stored procedure in Redshift. This code, no matter where it exists, should understand that Redshift has a max number of columns limit - 1,600.
The Redshift docs are fairly good on the topic of dynamic SQL for stored procedures. The EXECUTE statement will be used to fire off the second query in a stored procedure. See: https://docs.aws.amazon.com/redshift/latest/dg/c_PLpgSQL-statements.html

PostgreSQL how to GROUP BY single field from returned table

So I have complicated query, to simplify let it be like
SELECT
t.*,
SUM(a.hours) AS spent_hours
FROM (
SELECT
person.id,
person.name,
person.age,
SUM(contacts.id) AS contact_count
FROM
person
JOIN contacts ON contacts.person_id = person.id
) AS t
JOIN activities AS a ON a.person_id = t.id
GROUP BY t.id
Such query works fine in MySQL, but Postgres needs to know that GROUP BY field is unique, and despite it actually is, in this case I need to GROUP BY all returned fields from returned t table.
I can do that, but I don't believe that will work efficiently with big data.
I can't JOIN with activities directly in first query, as person can have several contacts which will lead query counting hours of activity several time for every joined contact.
Is there a Postgres way to make this query work? Maybe force to treat Postgres t.id as unique or some other solution that will make same in Postgres way?
This query will not work on both database system, there is an aggregate function in the inner query but you are not grouping it(unless you use window functions). Of course there is a special case for MySQL, you can use it with disabling "sql_mode=only_full_group_by". So, MySQL allows this usage because of it' s database engine parameter, but you cannot do that in PostgreSQL.
I knew MySQL allowed indeterminate grouping, but I honestly never knew how it implemented it... it always seemed imprecise to me, conceptually.
So depending on what that means (I'm too lazy to look it up), you might need one of two possible solutions, or maybe a third.
If you intent is to see all rows (perform the aggregate function but not consolidate/group rows), then you want a windowing function, invoked by partition by. Here is a really dumbed down version in your query:
.
SELECT
t.*,
SUM (a.hours) over (partition by t.id) AS spent_hours
FROM t
JOIN activities AS a ON a.person_id = t.id
This means you want all records in table t, not one record per t.id. But each row will also contain a sum of the hours for all values that value of id.
For example the sum column would look like this:
Name Hours Sum Hours
----- ----- ---------
Smith 20 120
Jones 30 30
Smith 100 120
Whereas a group by would have had Smith once and could not have displayed the hours column in detail.
If you really did only want one row per t.id, then Postgres will require you to tell it how to determine which row. In the example above for Smith, do you want to see the 20 or the 100?
There is another possibility, but I think I'll let you reply first. My gut tells me option 1 is what you're after and you want the analytic function.

How to extend dynamic schema with views in Hasura and Postgres?

So I am trying and struggling for few days to extend the schema with the custom groupby using something like this
I have a table with few fields like id, country, ip, created_at.
Then I am trying to get them as groups. For example, group the data based on date, hourly of date, or based on country, and based on country with DISTINCT ip.
I am zero with SQLs honestly. But I tried to play around and get what I want. Here's an example.
SELECT Hour(created_at) AS date,
COUNT(*) AS count
FROM session where CAST(created_at AS date) = '2021-04-05'
GROUP BY Hour(created_at)
ORDER BY date;
SELECT country,
count(*) AS count from (SELECT * FROM session where CAST(created_at AS date) <= '2021-05-12' GROUP BY created_at) AS T1
GROUP BY country;
SELECT country, COUNT(*) as count
FROM (SELECT DISTINCT ip, country FROM session) AS T1
GROUP BY country;
SELECT DATE(created_at) AS date,
COUNT(*) AS count
FROM session
GROUP BY DATE(created_at)
ORDER BY date;
Now I am struggling with two things.
How do I make the date as variables? I mean, if I want to group them for a particular date range/ or today's data hourly, or per quarter gap (more of configurable), how do I add the variables in Hasura's Raw SQL?
Also for this approach I have to add schema for each one of them? Like this
CREATE
OR REPLACE VIEW "public"."unique_session_counts_date" AS
SELECT
date(session.created_at) AS date,
count(*) AS count
FROM
session
GROUP BY
(date(session.created_at))
ORDER BY
(date(session.created_at));
Is there a way to make it more generalized? What I mean is, if it
was in Nodejs I could have done something like
return rawQuery(
`
select ${field} x, count(*) y
from ${table}
where website_id=$1
and created_at between $2 and $3
${domainFilter}
${urlFilter}
group by 1
order by 2 desc
`,
params,
);
In this case, based on whatever field and where clause I send, one query would do the trick for me. Can do something similar in hasura?
Thank you so much in advance.
How do I make the date as variables? I mean, if I want to group them for a particular date range/ or today's data hourly, or per quarter gap (more of configurable), how do I add the variables in Hasura's Raw SQL?
My first thought is this. If you're thinking about passing in variables via a GraphQL for example, the GraphQL would look something like:
query MyQuery {
unique_session_counts_date(where: {created_at: {_gte: "<start date here>", _lte: "<end date here>"}}) {
<...any fields, rollups, etc here...>
}
}
The underlying view/query would follow the group by and order by that you've detailed. Then you'd be able to submit a query of the graphql query and just pass in the pertinent parameters like the $1, $2, and $3 in the raqQuery call.
Also for this approach I have to add schema for each one of them?
The schema? The view? I don't think a view specifically would be required, if a multilevel select or similar query can handle it and perform then a view wouldn't particularly be needed.
That's my first stab at the problem. I'm going to try to work through this problem in a few hours via a Twitch stream # HasuraHQ if you can join, happy to walk through it live.

How can I get the total run time of a query in redshift, with a query?

I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100

PostgreSQL -must appear in the GROUP BY clause or be used in an aggregate function

I am getting this error in the pg production mode, but its working fine in sqlite3 development mode.
ActiveRecord::StatementInvalid in ManagementController#index
PG::Error: ERROR: column "estates.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT "estates".* FROM "estates" WHERE "estates"."Mgmt" = ...
^
: SELECT "estates".* FROM "estates" WHERE "estates"."Mgmt" = 'Mazzey' GROUP BY user_id
#myestate = Estate.where(:Mgmt => current_user.Company).group(:user_id).all
If user_id is the PRIMARY KEY then you need to upgrade PostgreSQL; newer versions will correctly handle grouping by the primary key.
If user_id is neither unique nor the primary key for the 'estates' relation in question, then this query doesn't make much sense, since PostgreSQL has no way to know which value to return for each column of estates where multiple rows share the same user_id. You must use an aggregate function that expresses what you want, like min, max, avg, string_agg, array_agg, etc or add the column(s) of interest to the GROUP BY.
Alternately you can rephrase the query to use DISTINCT ON and an ORDER BY if you really do want to pick a somewhat arbitrary row, though I really doubt it's possible to express that via ActiveRecord.
Some databases - including SQLite and MySQL - will just pick an arbitrary row. This is considered incorrect and unsafe by the PostgreSQL team, so PostgreSQL follows the SQL standard and considers such queries to be errors.
If you have:
col1 col2
fred 42
bob 9
fred 44
fred 99
and you do:
SELECT col1, col2 FROM mytable GROUP BY col1;
then it's obvious that you should get the row:
bob 9
but what about the result for fred? There is no single correct answer to pick, so the database will refuse to execute such unsafe queries. If you wanted the greatest col2 for any col1 you'd use the max aggregate:
SELECT col1, max(col2) AS max_col2 FROM mytable GROUP BY col1;
I recently moved from MySQL to PostgreSQL and encountered the same issue. Just for reference, the best approach I've found is to use DISTINCT ON as suggested in this SO answer:
Elegant PostgreSQL Group by for Ruby on Rails / ActiveRecord
This will let you get one record for each unique value in your chosen column that matches the other query conditions:
MyModel.where(:some_col => value).select("DISTINCT ON (unique_col) *")
I prefer DISTINCT ON because I can still get all the other column values in the row. DISTINCT alone will only return the value of that specific column.
After often receiving the error myself I realised that Rails (I am using rails 4) automatically adds an 'order by id' at the end of your grouping query. This often results in the error above. So make sure you append your own .order(:group_by_column) at the end of your Rails query. Hence you will have something like this:
#problems = Problem.select('problems.username, sum(problems.weight) as weight_sum').group('problems.username').order('problems.username')
#myestate1 = Estate.where(:Mgmt => current_user.Company)
#myestate = #myestate1.select("DISTINCT(user_id)")
this is what I did.