How to increase the speed sql query and get fastly record [closed] - postgresql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a large set of records in my database. I tried to run SQL query using where condition. I got the data and I show count of records also. It taking around 21 secs. I need to reduce the time.
Sql Query
select * from sample_rank where company_id = 1
Records Count
166270266
I added Index in db. but It's not fastly getting data. How to fix it and How to change the query

Assuming, going by your comments, that you only need to select two of the columns in your table:
SELECT col1, col2 FROM sample_rank WHERE company_id = 1;
then the following covering index is probably the best you can do here:
CREATE INDEX idx ON sample_rank (company_id, col1, col2);
The above index completely covers the entire query, meaning that, if used, your SQL engine can use the index alone to satisfy the entire query plan. I put "if" in bold, because depending on the cardinality of the data, the above index might not help the query run faster. For example, if you only have two company_id values, 1 and 2, with 50% of the record having each value, then your SQL engine might just decide that a full table scan would be faster than even resorting to use the index. As suggested in the comments, running EXPLAIN on your actual query would reveal more information.

Related

Get latest rows in PostgresSQL table ordered by Date: Index or Sort table?

I had a hard time titling this question but I hope its appropriate.
I have a table of transactions, and each transaction has a Date column (of type Date).
I want to run a query that gets the latest 100 transactions by date (simple enough with an ORDERBY query).
My question is, in order to make this an extremely cheap operation, would it make sense to sort my entire table so that I just need to select the top 100 rows every time, or do i simply create an index on the date column? Not sure if first option is even possible and or/good sql db practice.
You would add an index on the column with the date and query:
SELECT * FROM tab
ORDER BY datecol DESC
LIMIT 100;
The problem with your other idea is that there is no well-defined order in a table. Every UPDATE changes this "order", and even if you don't modify anything, a sequential scan need not start at the beginning of the table.

Doubts on Group By with ALias [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
SELECT
first_name || ' ' || last_name full_name,
SUM (amount) amount
FROM
payment
INNER JOIN customer USING (customer_id)
GROUP BY
full_name
ORDER BY amount;
Can i know how does this query works for group by full_name, by rigth the query should give error because of the order of the sql execution(https://stackoverflow.com/a/3841804/15279872). The output of the above query can be seen in this link (https://www.postgresqltutorial.com/postgresql-group-by/) under the Section 3 : Using PostgreSQL GROUP BY clause with the JOIN clause
Here is what the documentation says about GROUP BY in PostgreSQL:
In the SQL-92 standard, an ORDER BY clause can only use output column
names or numbers, while a GROUP BY clause can only use expressions
based on input column names. PostgreSQL extends each of these clauses
to allow the other choice as well (but it uses the standard's
interpretation if there is ambiguity). PostgreSQL also allows both
clauses to specify arbitrary expressions. Note that names appearing in
an expression will always be taken as input-column names, not as
output-column names.
SQL:1999 and later use a slightly different definition which is not
entirely upward compatible with SQL-92. In most cases, however,
PostgreSQL will interpret an ORDER BY or GROUP BY expression the same
way SQL:1999 does.
The “output column name” mentioned above can be an alias, so PostgreSQL allows aliases in the GROUP BY clause.

PostgreSQL performance tuning with table partitions

I am solving an performance issue on PostgreSQL 9.6 dbo based system. Intro:
12yo system, similar to banking system, with most queried primary table called transactions.
CREATE TABLE jrn.transactions (
ID BIGSERIAL,
type_id VARCHAR(200),
account_id INT NOT NULL,
date_issued DATE,
date_accounted DATE,
amount NUMERIC,
..
)
In the table transactions we store all transactions within a bank account. Field type_id determines the type of a transaction. Servers also as C# EntityFramework Discriminator column. Values are like:
card_payment, cash_withdrawl, cash_in, ...
14 types of transaction are known.
In generally, there are 4 types of queries (no. 3 and .4 are by far most frequent):
select single transaction like: SELECT * FROM jrn.transactions WHERE id = 3748734
select single transaction with JOIN to other transaction like: SELECT * FROM jrn.transactions AS m INNER JOIN jrn.transactions AS r ON m.refund_id = r.id WHERE m.id = 3748734
select 0-100, 100-200, .. transactions of given type like: SELECT * FROM jrn.transactions WHERE account_id = 43784 AND type_id = 'card_payment' LIMIT 100
several aggregate queries, like: SELECT SUM(amount), MIN(date_issued), MAX(date_issued) FROM jrn.transactions WHERE account_id = 3748734 AND date_issued >= '2017-01-01'
In last few month we had unexpected row count growth, now 120M.
We are thinking of table partitioning, following to PostgreSQL doc: https://www.postgresql.org/docs/10/static/ddl-partitioning.html
Options:
partition table by type_id into 14 partitions
add column year and partition table by year (or year_month) into 12 (or 144) partitions.
I am now restoring data into out test environment, I am going to test both options.
What do you consider the most appropriate partitioning rule for such situation? Any other options?
Thanks for any feedback / advice etc.
Partitioning won't be very helpful with these queries, since they won't perform a sequential scan, unless you forgot an index.
The only good reason I see for partitioning would be if you want to delete old rows efficiently; then partitioning by date would be best.
Based on your queries, you should have these indexes (apart from the primary key index):
CREATE INDEX ON jrn.transactions (account_id, date_issued);
CREATE INDEX ON jrn.transactions (refund_id);
The following index might be a good idea if you can sacrifice some insert performance to make the third query as fast as possible (you might want to test):
CREATE INDEX ON jrn.transactions (account_id, type_id);
What you have here is almost a perfect case for column-based storage as you may get it using a SAP HANA Database. However, as you explicitly have asked for a Postgres answer and I doubt that a HANA database will be within the budget limit, we will have to stick with Postgres.
Your two queries no. 3 and 4 go quite into different directions, so there won't be "the single answer" to your problem - you will always have to balance somehow between these two use cases. Yet, I would try to use two different techniques to approach each of them individually.
From my perspective, the biggest problem is the query no. 4, which creates quite a high load on your postgres server just because it is summing up values. Moreover, you are just summing up values over and over again, which most likely won't change often (or even at all), as you have said that UPDATEs nearly do not happen at all. I furthermore assume two more things:
transactions is INSERT-only, i.e. DELETE statements almost never happen (besides perhaps in cases of some exceptional administrative intervention).
The values of column date_issued when INSERTing typically are somewhere "close to today" - so you usually won't INSERT stuff way in the past.
Out of this, to prevent aggregating values over and over again unnecessarily, I would introduce yet another table: let's call it transactions_aggr, which is built up like this:
create table transactions_aggr (
account_id INT NOT NULL,
date_issued DATE,
sumamount NUMERIC,
primary key (account_id, date_issued)
)
which will give you a table of per-day preaggregated values.
To determine which values are already preaggregated, I would add another boolean-typed column to transactions, which indicates to me, which of the rows are contained in transactions_aggr and which are not (yet). The query no. 4 then would have to be changed in such a way that it reads only non-preaggregated rows from transactions, whilst the rest could come from transactions_aggr. To facilitate that you could define a view like this:
select account_id, date_issued, sum(amount) as sumamount from
(
select account_id, date_issued, sumamount as amount from transactions_aggr as aggr
union all
select account_id, date_issued, sum(amount) as amount from transactions as t where t.aggregated = false
)
group by account_id, date_issued
Needless to say that putting an index on transactions.aggregated (perhaps in conjunction with the account_id) could greatly help to improve the performance here.
Updating transactions_aggr can be done using multiple approaches:
You could use this as a one-time activity and only pre-aggregate the current set of ~120m rows once. This would at least reduce the load on your machine doing aggregations significantly. However, over time you will run into the same problem again. Then you may just re-execute the entire procedure, simply dropping transactions_aggr as a whole and re-create it from scratch (all the original data still is there in transactions).
You have a nice period somewhere during the week/month/in the night, where you have little or no queries are coming in. Then you can open a transaction, read all transactions WHERE aggregated = false and add them with UPDATEs to transactions_aggr. Keep in mind to then toggle aggregated to true (should be done in the same transaction). The tricky part of this, however, is that you must pay attention to what reading queries will "see" of this transaction: Depending on your requirements of accuracy during that timeframe of this "update job", you may have to consider switching the transaction isolation level to "READ_COMMITED" to prevent ghost reads.
On the matter of your query no. 3 you then could try to really go for the approach of partitioning based on type_id. However, I perceive your query as a little strange, as you are performing a LIMIT/OFFSET without ordering (e.g. there is no ORDER BY statement in place) having specified (NB: You are not saying that you would be using database cursors). This may lead to the effect that the implicit order, which is currently used, is changed, if you enable partitioning on the table. So be careful on side-effects which this may cause on your program.
And one more thing: Before really doing the partition split, I would first check on the data distribution concerning type_id by issuing
select type_id, count(*) from transactions group by type_id
Not that it turns out that, for example, 90% of your data is with card_payment - so that you will have a heavily uneven distribution amongst your partitions and the biggest performance hogging queries are those which would still go into this single "large partition".
Hope this helps a little - and good luck!

Performance of like 'query%' on multimillion rows, postgresql

We have a table with 10 million rows. We need to find first few rows with like 'user%' .
This query is fast if it matches at least 2 rows (It returns results in 0.5 sec). If it doesn't find any 2 rows matching with that criteria, it is taking at least 10 sec. 10 secs is huge for us (since we are using this auto suggestions, users will not wait for so long to see the suggestions.)
Query: select distinct(name) from user_sessions where name like 'user%' limit 2;
In the above query, the name column is of type citext and it is indexed.
Whenever you're working on performance, start by explaining your query. That'll show the the query optimizer's plan, and you can get a sense of how long it's spending doing various pieces. In particular, check for any full table scans, which mean the database is examining every row in the table.
Since the query is fast when it finds something and slow when it doesn't, it sounds like you are indeed hitting a full table scan. I believe you that it's indexed, but since you're doing a like, the standard string index can't be used efficiently. You'll want to check out varchar_pattern_ops (or text_pattern_ops, depending on the column type of name). You create that this way:
CREATE INDEX ON pattern_index_on_users_name ON users (name varchar_pattern_ops)
After creating an index, check EXPLAIN query to make sure it's being used. text_pattern_ops doesn't work with the citext extension, so in this case you'll have to index and search for lower(name) to get good case-insensitive performance:
CREATE INDEX ON pattern_index_on_users_name ON users (lower(name) text_pattern_ops)
SELECT * FROM users WHERE lower(name) like 'user%' LIMIT 2

Stuck on a query and need support to improve the performance (if any) for the execution !(PostgreSQL)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am new to PostgreSQL and I am learning by taking few examples!
I am solving a queries in PostgreSQL and I came with few but got stuck at one point!
Given the sample data in the SQLFiddle below, I tried:
--6.find most sold product with with sales_id, product_name,quantity and sum(price)
select array_agg(s.sale_id),p.product_name,s.quantity,sum(s.price)
from products p
join sales s
on p.product_id=s.product_id;
but it fails with:
ERROR: column "p.product_name" must appear in the GROUP BY clause or be used in an aggregate function:
This is the SQL Fiddle with sample data.
I'm using PostgreSQL 9.2.
For all that it looks simple, this is quite an interesting problem.
The unsolved #6
There are two stages to this:
find the most sold product; and
display the required detail on that product
The question is badly written; it fails to specify whether you want
the product with the greatest number of sales, or the greatest
dollar sales value. I will assume the former, but it's easy to adapt the following queries to sort by total price instead.
UPDATE: #user2561626 found the simple solution I mentioned I was sure I was overlooking but couldn't think of: http://sqlfiddle.com/#!12/dbe7c/118 . Use the output of SUM in ORDER BY then LIMIT the result set.
The following are the complicated and roundabout ways I tried because I couldn't think of the simple way:
One way is to use a subquery with an ORDER BY and LIMIT to sort products by total number of sales, then pick the top one. You then join on that inner query to generate the desired product summary. In this case I join on sales twice, once in the inner query and once in the outer where I calculate more detail for just one product. It's possibly more efficient to join on it just once in the inner query and do more work, but that'll involve creating and discarding a bigger result set, so it's the sort of thing you'd tune based on your data distribution.
SELECT
array_agg(s.sale_id) AS sales_ids,
(SELECT p.product_name FROM products p WHERE p.product_id = pp.product_id) AS product_name,
sum(s.quantity) AS total_quantity,
sum(s.price) AS total_price
FROM
(
-- Find the product with the largest number of sales
-- If multiple products have the same sales an arbitrary candidate
-- is selected; extend the ORDER BY if you want to control which
-- one gets picked.
SELECT
s2.product_id, sum(s2.quantity) AS total_quantity
FROM sales s2
GROUP BY s2.product_id
ORDER BY 2 DESC
LIMIT 1
) AS pp
INNER JOIN sales s ON (pp.product_id = s.product_id)
GROUP BY s.product_id, pp.product_id;
I'm honestly not too sure how to phrase this in purely standard SQL (i.e. no LIMIT clause). You can use a CTE or multiple scans in subqueries to find the greatest number of sales and the product Id with the greatest number of sales, but that'll give you multiple results if you have more than one product with equal sales.
I can't help but feel I've totally forgotten the simple and obvious way to do this.
Comments on others:
--1.write the query find the products which are not soled
select *
from products
where product_id not in (select distinct PRODUCT_ID from sales );
Your solution is subtly incorrect, because there's no NOT NULL constraint on product_id in sales. It builds a list then filters on the list, but the list could contain NULL, and 2 NOT IN (1, NULL) is NULL, which in WHERE is treated as false.
It is much better to re-phrase this as WHERE NOT EXISTS (SELECT 1 FROM sales s WHERE s.product_id = products.product_id).
With #2 it's again better to use EXISTS, but PostgreSQL can optimize it into the better form automatically since it's semantically the same; the NULL issue doesn't apply for IN, only NOT IN. So your query is fine.
Question #7 highlights that this is an awful schema. You should never store split-up year/month/day like this; a sale would just have a single timestamptz field, and to get the year you'd use date_trunc or extract. That's not your fault, it's bad table design in the question. The question could also be clearer; I think you've answered it correctly as written, but they don't say whether or not years with no sales should be shown - presumably they assume there aren't any. If there are, you'd have to do a left outer join over a generate_series of dates to zero-fill empty years.
Question #8 is another bad question, frankly. "max price". Um. What? "Maximum price paid per item" would be "price/quantity". "Greatest total individual sale value for each product" would be what you wrote. The question seems to allow for either.
The Query Solution for Question#6 is ::
select array_agg(s.sale_id),p.product_name,sum(s.quantity) as Quantity ,sum(s.price) as Total_Price
from sales s,products p
where s.product_id =p.product_id
group by p.product_id
order by sum(s.quantity) desc limit 1 ;
Comment On Others
Question#9: #Robin Hood's
select s.sale_id,p.product_name,s.quantity,s.price
from products p,sales s
where p.product_id=s.product_id and p.product_name LIKE 'S%';
the 'S%' is a case Sensitive .. so it how it works..
Question#10: #Robin Hood's
Stored Procedure is:
CREATE OR REPLACE FUNCTION get_details()
RETURNS TABLE(sale_id integer,product_name varchar,quantity integer,price int) AS
$BODY$
BEGIN
RETURN QUERY
select s.sale_id,p.product_name,s.quantity,s.price
from products p
join sales s
on p.product_id =s.product_id ;
Exception WHEN no_data_found then
RAISE NOTICE 'No data available';
END
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
select * from get_details(); then you will get the result.
I need a help over these questions even !! i just want to add these queries to.
--Question#9
--9. select product details with sales_id, product_name,quantity and price those product names are started with letter ‘s’
--This selects my product details
select s.sale_id,p.product_name,s.quantity,s.price
from products p,sales s
where p.product_id=s.product_id ;
--This is'nt working to find those names which start with s.. is there any other way to solve this..
select s.sale_id,p.product_name,s.quantity,s.price
from products p,sales s
where p.product_id=s.product_id and product_name = 's%';
--10. write the stored procedure for extract all the sales and product details with sales_id, product_name,quantity and price with exception handling and raisint the notices