When i use SELECT * FROM table, PostgreSQL is returning the data ordered by id. But when i use SELECT DISTINCT * FROM table, PostgreSQL is returning the same dataset as there are no duplicates but the order has been changed which is beyond my understanding.
How does PostgreSQL sort the data while using DISTINCT * and without specifying any ORDER BY clause.
If you put DISTINCT into a query, PostgreSQL sorts the result set by all result columns in order to eliminate duplicates. The sort order is “implementation defined” unless you add an explicit ORDER BY clause.
Two remarks:
without the DISTINCT, the table is returned in id order because you inserted it that way and performed no updates or deletes, and because there are no concurrent sequential scans on the table. You can never rely on an order in the result set unless you use ORDER BY.
DISTINCT can be very expensive on large result sets. Use it only if you are certain you need it.
I would like to understand when the offset and limit statements execute in a Postgresql query. Given a query with a format such as
select
a.*,
(-- some subquery here) as sub_query_result
from some_table a
where -- some condition
offset :offset
limit :limit
My understanding is the table will first be filtered using the where statement, and then the remaining rows will be projected into the form as defined by select statement.
Do the offset and limit statements execute after all operations have occurred in the select statement? Or does it apply the where, offset, and limit statements first and then the select part of the query?
I am hoping that it applies the where, offset, and limit statements first that if I had a result set say of 10,000 rows, and I only want the 2nd page of 1000, it would only execute the subquery 1000 times, for example.
A query with LIMIT but without ORDER BY makes a little sense. From the documentation:
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows.
When the ORDER BY clause is present, expressions in the select list (including subqueries or functions) must be evaluated for as many rows as needed to determine the proper order. In best scenarios the number of computed rows may be limited to the sum LIMIT + OFFSET, if the sum is less than the number of filtered rows. This means that (in some simplification) the greater OFFSET the longer the query is run:
The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient.
In some cases there may be optimizations when the planner will recognize an expression as immutable, but generally you should expect that a subquery will be executed at least LIMIT + OFFSET times. In Postgres 9.5 or earlier the number of computed rows may be even larger if the ordering is not based on an index.
I have a query :
select distinct(donorig_cdn),cerhue_num_rfa,max(cerhue_dt) from t_certif_hue
group by donorig_cdn,cerhue_num_rfa
order by donorig_cdn
it returns me some repeated ids with different cerhue_num_rfa
how do i return only one line for the repeated ids with cerhue_num_rfa that matches the max of date (cerhue_dt) .. and have at the end only 10 results instead of 15 ?
Postgres has SELECT DISTINCT ON to the rescue. It only returns the first row found for each value of the given column. So, all you need is an order that ensures the latest entry comes first. No need for grouping.
SELECT DISTINCT ON (donorig_cdn) donorig_cdn,cerhue_num_rfa,cerhue_dt
FROM t_certif_hue
ORDER BY donorig_cdn, cerhue_dt DESC;
We have a query in which a list of parameter values is provided in "IN" clause of the query. Some time back this query failed to execute as the size of data in "IN" clause got quite large and hence the resulting query exceeded the 16 MB limit of the query in REDSHIFT. As a result of which we then tried processing the data in batches so as to limit the data and not breach the 16 MB limit.
My question is what are the factors/pitfalls to keep in mind while supplying such large data for the "IN" clause of a query or is there any alternative way in which I can deal with such large data for the "IN" clause?
If you have control over how you are generating your code, you could split it up as follows
first code to be submitted, drop and recreate filter table:
drop table if exists myfilter;
create table myfilter (filter_text varchar(max));
Second step is to populate the filter table in parts of a suitable size, e.g. 1000 values at a time
insert into myfilter
values({{myvalue1}},{{myvalue2}},{{myvalue3}} etc etc up to 1000 values );
repeat the above step multiple times until you have all of your values inserted
Then, use that filter table as follows
select * from master_table
where some_value in (select filter_text from myfilter);
drop table myfilter;
Large IN is not the best practice itself, it's better to use joins for large lists:
construct a virtual table a subquery
join your target table to the virtual table
like this
with
your_list as (
select 'first_value' as search_value
union select 'second_value'
...
)
select ...
from target_table t1
join your_list t2
on t1.col=t2.search_value
How do you do LIMIT in DB2 for iSeries?
I have a table with more than 50,000 records and I want to return records 0 to 10,000, and records 10,000 to 20,000.
I know in SQL you write LIMIT 0,10000 at the end of the query for 0 to 10,000 and LIMIT 10000,10000 at the end of the query for 10000 to 20,000
So, how is this done in DB2? Whats the code and syntax?
(full query example is appreciated)
Using FETCH FIRST [n] ROWS ONLY:
http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db29.doc.perf/db2z_fetchfirstnrows.htm
SELECT LASTNAME, FIRSTNAME, EMPNO, SALARY
FROM EMP
ORDER BY SALARY DESC
FETCH FIRST 20 ROWS ONLY;
To get ranges, you'd have to use ROW_NUMBER() (since v5r4) and use that within the WHERE clause: (stolen from here: http://www.justskins.com/forums/db2-select-how-to-123209.html)
SELECT code, name, address
FROM (
SELECT row_number() OVER ( ORDER BY code ) AS rid, code, name, address
FROM contacts
WHERE name LIKE '%Bob%'
) AS t
WHERE t.rid BETWEEN 20 AND 25;
Developed this method:
You NEED a table that has an unique value that can be ordered.
If you want rows 10,000 to 25,000 and your Table has 40,000 rows, first you need to get the starting point and total rows:
int start = 40000 - 10000;
int total = 25000 - 10000;
And then pass these by code to the query:
SELECT * FROM
(SELECT * FROM schema.mytable
ORDER BY userId DESC fetch first {start} rows only ) AS mini
ORDER BY mini.userId ASC fetch first {total} rows only
Support for OFFSET and LIMIT was recently added to DB2 for i 7.1 and 7.2. You need the following DB PTF group levels to get this support:
SF99702 level 9 for IBM i 7.2
SF99701 level 38 for IBM i 7.1
See here for more information: OFFSET and LIMIT documentation, DB2 for i Enhancement Wiki
Here's the solution I came up with:
select FIELD from TABLE where FIELD > LASTVAL order by FIELD fetch first N rows only;
By initializing LASTVAL to 0 (or '' for a text field), then setting it to the last value in the most recent set of records, this will step through the table in chunks of N records.
#elcool's solution is a smart idea, but you need to know total number of rows (which can even change while you are executing the query!). So I propose a modified version, which unfortunately needs 3 subqueries instead of 2:
select * from (
select * from (
select * from MYLIB.MYTABLE
order by MYID asc
fetch first {last} rows only
) I
order by MYID desc
fetch first {length} rows only
) II
order by MYID asc
where {last} should be replaced with row number of the last record I need and {length} should be replaced with the number of rows I need, calculated as last row - first row + 1.
E.g. if I want rows from 10 to 25 (totally 16 rows), {last} will be 25 and {length} will be 25-10+1=16.
Try this
SELECT * FROM
(
SELECT T.*, ROW_NUMBER() OVER() R FROM TABLE T
)
WHERE R BETWEEN 10000 AND 20000
The LIMIT clause allows you to limit the number of rows returned by the query. The LIMIT clause is an extension of the SELECT statement that has the following syntax:
SELECT select_list
FROM table_name
ORDER BY sort_expression
LIMIT n [OFFSET m];
In this syntax:
n is the number of rows to be returned.
m is the number of rows to skip before returning the n rows.
Another shorter version of LIMIT clause is as follows:
LIMIT m, n;
This syntax means skipping m rows and returning the next n rows from the result set.
A table may store rows in an unspecified order. If you don’t use the ORDER BY clause with the LIMIT clause, the returned rows are also unspecified. Therefore, it is a good practice to always use the ORDER BY clause with the LIMIT clause.
See Db2 LIMIT for more details.
You should also consider the OPTIMIZE FOR n ROWS clause. More details on all of this in the DB2 LUW documentation in the Guidelines for restricting SELECT statements topic:
The OPTIMIZE FOR clause declares the intent to retrieve only a subset of the result or to give priority to retrieving only the first few rows. The optimizer can then choose access plans that minimize the response time for retrieving the first few rows.
There are 2 solutions to paginate efficiently on a DB2 table :
1 - the technique using the function row_number() and the clause OVER which has been presented on another post ("SELECT row_number() OVER ( ORDER BY ... )"). On some big tables, I noticed sometimes a degradation of performances.
2 - the technique using a scrollable cursor. The implementation depends of the language used. That technique seems more robust on big tables.
I presented the 2 techniques implemented in PHP during a seminar next year. The slide is available on this link :
http://gregphplab.com/serendipity/uploads/slides/DB2_PHP_Best_practices.pdf
Sorry but this document is only in french.
Theres these available options:-
DB2 has several strategies to cope with this problem.
You can use the "scrollable cursor" in feature.
In this case you can open a cursor and, instead of re-issuing a query you can FETCH forward and backward.
This works great if your application can hold state since it doesn't require DB2 to rerun the query every time.
You can use the ROW_NUMBER() OLAP function to number rows and then return the subset you want.
This is ANSI SQL
You can use the ROWNUM pseudo columns which does the same as ROW_NUMBER() but is suitable if you have Oracle skills.
You can use LIMIT and OFFSET if you are more leaning to a mySQL or PostgreSQL dialect.