BigQuery internal error on a GROUP by query (trigrams) - group-by

I tried the following query with LIMIT 100 and got a "Resources exceeded during query execution" (otichyproject1:job_1mpw4aDtTHmbduBdKSBu5ty1DXY) so I tried to output it into a new table and allow large results. It ran much longer, but failed with 'internal error' (otichyproject1:job_6pFUlj2AzdROUyAU8nZ9dGdo3ms).
SELECT
ngram, decade, SUM(freq) totalfreq, SUM(books) totalbooks
FROM
trigram.trigrams3
GROUP BY
ngram, decade
The table trigrams3 is derived from the public trigram dataset and should be smaller (although the COUNT on the trigrams give weird results).
Any ideas on how to make this work?

First, let's see how big the result set is:
SELECT COUNT(*)
FROM (
SELECT ngram, decade, SUM(freq) totalfreq, SUM(books) totalbooks
FROM [otichyproject1:trigram.trigrams3]
GROUP EACH BY ngram, decade
)
837,369,607 - almost a billion registers to output, that's why we need "allowLargeResults".
Note that I used "GROUP EACH". "EACH" shouldn't be needed as it's in its way out, but it improves running time for me here.
Same with LIMIT 100, it works with "EACH":
SELECT
ngram, decade, SUM(freq) totalfreq, SUM(books) totalbooks
FROM
trigram.trigrams3
GROUP EACH BY
ngram, decade
LIMIT 100
And the query to output all results to a new table, runs in only 20 seconds if I try it with "EACH" and "AllowLargeResults":
SELECT ngram, decade, SUM(freq) totalfreq, SUM(books) totalbooks
FROM [otichyproject1:trigram.trigrams3]
GROUP EACH BY ngram, decade
So the short answer to this question is: Keep using "GROUP EACH" (for now).

Related

In TSQL, How do I add a count column that counts the number of rows in my query?

This can be done a number of ways, which I will explain at the end. For now, I have been given a work assignment that includes the following (simplified):
"Create a record each week to track the current status that has the following: account numbers (unique within each report), a random number (provided), their status (Green, Orange, or Blue), and make sure the record also has a column which tells me how many records their are this week."
I do not need code to generate a random number.
Columns: Account, RanNum, Status, NumberOfRowsThisWeek
How do I handle adding a column that determines the number of rows in my query and produces that number, static, within each row of that column?
I may try to tweak the request and apply a rising number. How would I go about doing it in this case?
Edit: SQL Server 2014
You are not telling us which database you are using.
In SQL Server, the newer versions at least, you have windowing function or analytical functions available, and they are also available in most other popular RDBMS
You could do what you want in SQL Server by adding this to your select
,count(*) over (partition by 1) as [NrOfRows]
An analytical function does the "standard" query, and then performs the windowing function on the result set.
The count above, counts the rows in the result set, partitioned by the constant 1, which is of course stable across all rows, so it gives the full rowcount.
It is perhaps not standard in all databases to allow a constant in that way, perhaps this would give a better result in some, I know it works in SQL Server:
,count(*) over (partition by (select 1 n)) as [NrOfRows]
it sounds like you want to do some kind of simple count() / group by query
select Account, RanNum, Status, count(*) as NumberOfRowsThisWeek
from tablename
group by Account, RanNum, Status
you my need to do
select Account, RanNum, Status, NumberOfRowsThisWeek
from (
select Account, Status, count(*) as NumberOfRowsThisWeek
from tablename
group by Account, Status
)
because the random number will confuse the group by by making every row unique.

How can I get the total run time of a query in redshift, with a query?

I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100

Most efficient way to retrieve rows of related data: subquery, or separate query with GROUP BY?

I have a very simple PostgreSQL query to retrieve the latest 50 news articles:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Now I also want to retrieve the latest 10 comments for each article as well. I can think of two ways to accomplish retrieving them and I'm not sure which one is best in the context of PostgreSQL:
Option 1:
Do a subquery directly for the comments in the original query and cast the result to an array:
SELECT headline, author_name, body,
ARRAY(
SELECT id, message, author_name,
FROM news_comments
WHERE news_id = n.id
ORDER BY DATE DESC
LIMIT 10
) AS comments
FROM news n
ORDER BY publish_date DESC
LIMIT 50
Obviously, in this case, application logic would need to be aware of which index in the array is which column, that's no problem.
The one problem I see with the method is not knowing how the query planner would execute it. Would this effectively turn into 51 queries?
Option 2:
Use the original very simple query:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Then via application logic, gather all of the news ids and use those in a separate query, row_number() would have to be used here in order to limit the number of results per news article:
SELECT *
FROM (
SELECT *,
row_number() OVER(
PARTITION BY author_id
ORDER BY author_id DESC
) AS rn
FROM (
SELECT *
FROM news_comment
WHERE news_id IN(123, 456, 789)
) s
) s
where rn <= 10
This approach is obviously more complicated, and I'm not sure if this would have to retrieve all comments for the scoped news articles first, then chop off the ones where the row count is great than 10.
Which option is best? Or is there an even better solution I have overlooked?
For context, this is a news aggregator site I've developed myself, I currently have about 40,000 news articles across several categories, with about 500,000 comments, so I'm looking for the best solution to help me keep growing.
You should investigate execution plan for your statements using at least EXPLAIN ANALYZE. This will provide you with plan chosen by the optimizer while executing the statement itself and giving you back actual run times and other statistics as well.
Another solution would be to use LATERAL subquery to retrieve 10 comments for each news in separate rows, but then again - you need to investigate and compare plans to choose the best approach that works for you:
SELECT
n.id, n.headline, n.uathor_name, n.body,
c.id, c.message, c.author_name
FROM news n
LEFT JOIN LATERAL (
SELECT id, message, author_name
FROM news_comments nc
WHERE n.id = nc.news_id
ORDER BY nc.date DESC
LIMIT 10
) c ON TRUE
ORDER BY publish_date DESC
LIMIT 50
When your query contains LATERAL cross-references for each row retrieved from news LATERAL is evaluated using the connection in WHERE clause. Thus making it a repeated execution and joining the information retrieved from it for each row from your source table news.
This approach would save the time needed for your application logic to deal with arrays coming out from option 1 while not having to issue many separate queries for each news like in option 2 saving you (in this case) time needed to open separate transactions, establish connections, retrieve rows etc...
It would be good to look for performance improvements by creating indexes and looking into planner cost constans and planner method configuration parameters that you can experiment with to understand the choice planner has made. More on the subject here.

Amazon redshift query planner

I'm facing a situation with Amazon Redshift that I haven't been able to explain to myself yet. Query planner seems not to be able to handle same table in subquery of two derived tables in a join.
I have essentially three tables, Source_A, Source_B, Target_1, Target_2 and a query like
SELECT a,b,c,d FROM
(
SELECT a,b FROM Source_A where date > (SELECT max(date) FROM Target_1)
)
INNER JOIN
(
SELECT c,d FROM Source_B where date > (SELECT max(date) FROM Target_2)
)
ON Source_A.a = Source_B.c
The query works fine as long as tables Target_1 and Target_2 are different tables. If I change the query so that Target_2 = Target_1, something happens. After the change, the query starts to take about 10 times longer time. And when I look at the performance monitor I can see that all this extra time is taken so that only the Leader Node is active.
When I take EXPLAIN of both options I see practically no difference in the output. All the steps are the same. But the is the difference that the EXPLAIN takes seconds in one and almost half an hour with the other one where the Target tables are the same.
So to summarise what I think I have observed is -- that on join, if I use same table in a subquery of each derived tables, the query planner goes nuts.

Stuck on a query and need support to improve the performance (if any) for the execution !(PostgreSQL)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am new to PostgreSQL and I am learning by taking few examples!
I am solving a queries in PostgreSQL and I came with few but got stuck at one point!
Given the sample data in the SQLFiddle below, I tried:
--6.find most sold product with with sales_id, product_name,quantity and sum(price)
select array_agg(s.sale_id),p.product_name,s.quantity,sum(s.price)
from products p
join sales s
on p.product_id=s.product_id;
but it fails with:
ERROR: column "p.product_name" must appear in the GROUP BY clause or be used in an aggregate function:
This is the SQL Fiddle with sample data.
I'm using PostgreSQL 9.2.
For all that it looks simple, this is quite an interesting problem.
The unsolved #6
There are two stages to this:
find the most sold product; and
display the required detail on that product
The question is badly written; it fails to specify whether you want
the product with the greatest number of sales, or the greatest
dollar sales value. I will assume the former, but it's easy to adapt the following queries to sort by total price instead.
UPDATE: #user2561626 found the simple solution I mentioned I was sure I was overlooking but couldn't think of: http://sqlfiddle.com/#!12/dbe7c/118 . Use the output of SUM in ORDER BY then LIMIT the result set.
The following are the complicated and roundabout ways I tried because I couldn't think of the simple way:
One way is to use a subquery with an ORDER BY and LIMIT to sort products by total number of sales, then pick the top one. You then join on that inner query to generate the desired product summary. In this case I join on sales twice, once in the inner query and once in the outer where I calculate more detail for just one product. It's possibly more efficient to join on it just once in the inner query and do more work, but that'll involve creating and discarding a bigger result set, so it's the sort of thing you'd tune based on your data distribution.
SELECT
array_agg(s.sale_id) AS sales_ids,
(SELECT p.product_name FROM products p WHERE p.product_id = pp.product_id) AS product_name,
sum(s.quantity) AS total_quantity,
sum(s.price) AS total_price
FROM
(
-- Find the product with the largest number of sales
-- If multiple products have the same sales an arbitrary candidate
-- is selected; extend the ORDER BY if you want to control which
-- one gets picked.
SELECT
s2.product_id, sum(s2.quantity) AS total_quantity
FROM sales s2
GROUP BY s2.product_id
ORDER BY 2 DESC
LIMIT 1
) AS pp
INNER JOIN sales s ON (pp.product_id = s.product_id)
GROUP BY s.product_id, pp.product_id;
I'm honestly not too sure how to phrase this in purely standard SQL (i.e. no LIMIT clause). You can use a CTE or multiple scans in subqueries to find the greatest number of sales and the product Id with the greatest number of sales, but that'll give you multiple results if you have more than one product with equal sales.
I can't help but feel I've totally forgotten the simple and obvious way to do this.
Comments on others:
--1.write the query find the products which are not soled
select *
from products
where product_id not in (select distinct PRODUCT_ID from sales );
Your solution is subtly incorrect, because there's no NOT NULL constraint on product_id in sales. It builds a list then filters on the list, but the list could contain NULL, and 2 NOT IN (1, NULL) is NULL, which in WHERE is treated as false.
It is much better to re-phrase this as WHERE NOT EXISTS (SELECT 1 FROM sales s WHERE s.product_id = products.product_id).
With #2 it's again better to use EXISTS, but PostgreSQL can optimize it into the better form automatically since it's semantically the same; the NULL issue doesn't apply for IN, only NOT IN. So your query is fine.
Question #7 highlights that this is an awful schema. You should never store split-up year/month/day like this; a sale would just have a single timestamptz field, and to get the year you'd use date_trunc or extract. That's not your fault, it's bad table design in the question. The question could also be clearer; I think you've answered it correctly as written, but they don't say whether or not years with no sales should be shown - presumably they assume there aren't any. If there are, you'd have to do a left outer join over a generate_series of dates to zero-fill empty years.
Question #8 is another bad question, frankly. "max price". Um. What? "Maximum price paid per item" would be "price/quantity". "Greatest total individual sale value for each product" would be what you wrote. The question seems to allow for either.
The Query Solution for Question#6 is ::
select array_agg(s.sale_id),p.product_name,sum(s.quantity) as Quantity ,sum(s.price) as Total_Price
from sales s,products p
where s.product_id =p.product_id
group by p.product_id
order by sum(s.quantity) desc limit 1 ;
Comment On Others
Question#9: #Robin Hood's
select s.sale_id,p.product_name,s.quantity,s.price
from products p,sales s
where p.product_id=s.product_id and p.product_name LIKE 'S%';
the 'S%' is a case Sensitive .. so it how it works..
Question#10: #Robin Hood's
Stored Procedure is:
CREATE OR REPLACE FUNCTION get_details()
RETURNS TABLE(sale_id integer,product_name varchar,quantity integer,price int) AS
$BODY$
BEGIN
RETURN QUERY
select s.sale_id,p.product_name,s.quantity,s.price
from products p
join sales s
on p.product_id =s.product_id ;
Exception WHEN no_data_found then
RAISE NOTICE 'No data available';
END
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
select * from get_details(); then you will get the result.
I need a help over these questions even !! i just want to add these queries to.
--Question#9
--9. select product details with sales_id, product_name,quantity and price those product names are started with letter ā€˜sā€™
--This selects my product details
select s.sale_id,p.product_name,s.quantity,s.price
from products p,sales s
where p.product_id=s.product_id ;
--This is'nt working to find those names which start with s.. is there any other way to solve this..
select s.sale_id,p.product_name,s.quantity,s.price
from products p,sales s
where p.product_id=s.product_id and product_name = 's%';
--10. write the stored procedure for extract all the sales and product details with sales_id, product_name,quantity and price with exception handling and raisint the notices