Optimize "select * from table" query with 200millionen entries - DB2 database - db2

I have simple
select * from table query but the data has ~200 million of dataentries. How can I optimize the query and still take the whole data from the table?
The database is db2.

Enable parallelism by setting the current degree = any. This will utilize multiple processors to satisfy the query faster. If you have any where clause search conditions, indexing may help.

Related

PostgreSQL Query Performance Fluctuates

We have a system that loads data and then conducts data QC in PostgreSQL. The QC function's performance fluctuates drastically in one of our environments with no apparent pattern. I was able to track down the performance of the following simple query in the QC function:
WITH foo AS (SELECT full_address, jsonb_agg (gad_rec_id) gad_rec_ids
FROM azgiv.v_full_addresses
WHERE gad_gly_id = 495
GROUP BY full_address
HAVING count(1) > 1)
SELECT gad_nguid, gad_rec_id, foo.full_address
FROM azgiv.v_full_addresses JOIN foo
ON foo.full_address = v_full_addresses.full_address
AND v_full_addresses.gad_gly_id = 495;
When I ran into slow-performance situation (Fig 2), I had to ANALYZE the table behind the view before the query plan changes to fast (Fig 1). The v_full_addresses is a simple view of a partitioned table with bunch of columns concatenated.
Here are two images of the query plans for the above query. I am newbie when comes to understanding query optimization and any help is greatly appreciated.
&
If performance improves after you ANALYZE a table, that means that the database's knowledge about the distribution of the data is outdated.
The best remedy is to tell PostgreSQL to collect these statistics more often:
ALTER TABLE some_table SET (autovacuum_analyze_scale_factor = 0.02);
0.02 is five times lower than the default 0.1, so statistics will be gathered five times more often.
If the bad query plans are generated right after a bulk load, you must choose a different strategy. In this case the problem is that it takes up to a minute for auto-analyze to kick in and calculate new statistics.
In that case you should run an explicit ANALYZE at the end of the bulk load.

create 2 indexes on same column

I have a table with geometry column.
I have 2 indexes on this column:
create index idg1 on tbl using gist(geom)
create index idg2 on tbl using gist(st_geomfromewkb((geom)::bytea))
I have a lot of queries using the geom (geometry) field.
Which index is used ? (when and why)
If there are two indexes on same column (as I show here), can the select queries run slower than define just one index on column ?
The use of an index depends on how the index was defined, and how the query is invoked. If you SELECT <cols> FROM tbl WHERE geom = <some_value>, then you will use the idg1 index. If you SELECT <cols> FROM tabl WHERE st_geomfromewkb(geom) = <some_value>, then you will use the idg2 index.
A good way to know which index will be used for a particular query is to call the query with EXPLAIN (i.e., EXPLAIN SELECT <cols> FROM tbl WHERE geom = <some_value>) -- this will print out the query plan, which access methods, which indexes, which joins, etc. will be used.
For your question regarding performance, the SELECT queries could run slower because there are more indexes to consider in the query planning phase. In terms of executing a given query plan, a SELECT query will not run slower because by then the query plan has been established and the decision of which index to use has been made.
You will certainly experience performance impact upon INSERT/UPDATE/DELETE of the table, as all indexes will need to be updated with respect to the changes in the table. As such, there will be extra I/O activity on disk to propagate the changes, slowing down the database, especially at scale.
Which index is used depends on the query.
Any query that has
WHERE geom && '...'::geometry
or
WHERE st_intersects(geom, '...'::geometry)
or similar will use the first index.
The second index will only be used for queries that have the expression st_geomfromewkb((geom)::bytea) in them.
This is completely useless: it converts the geometry to EWKB format and back. You should find and rewrite all queries that have this weird construct, then you should drop that index.
Having two indexes on a single column does not slow down your queries significantly (planning will take a bit longer, but I doubt if you can measure that). You will have a performance penalty for every data modification though, which will take almost twice as long as with a single index.

Concurrent select queries split by row ids Vs one query

When SELECT querying one table l, no joins, with billions of rows, is it a good idea to run concurrent queries by splitting the query into multiple queries, split into distinct subsets/ranges by the indexes column, say integer primary key id?
Or does Postgres internally do this already, leading to no significant gain in speed for the end user?
I have two use cases:
getting the total count of rows
getting the list of ids
Edit-1: The query has conditional clause on columns where one of the columns is not indexed, and the other columns are indexed
SELECT id
FROM l
WHERE indexed_column-1='A'
AND indexed_column-2='B'
AND not_indexed_column-1='C'
Postgres has parallelization built in since version 9.6. (Improved in current versions.) It will be much more efficient than manually splitting a SELECT on a big table.
You can set the number of max_parallel_workers to your needs to optimize.
While you are only interested in the id column, it may help to have an index on (id) (given if it's the PK) and fulfill prerequisites for an index-only scan.
In the case where you want to count the number of rows, you can just let PostgreSQL's internal query parallelization do the work. It will be faster, and the result will be consistent.
In the case where you want to get the list of primary keys, it depends on the WHERE conditions of the query. If you are selecting only a few rows, parallel query will do nicely.
If you want all ids of the table, PostgreSQL will probably not choose a parallel plan, because the cost of exchanging so many values between the worker processes will outweigh the advantages of parallelization. In that case, you may be faster with parallel sessions as you envision.
This 4-column composite index would probably be faster than using parallelism:
INDEX(indexed_column-1, indexed_column-2, -- first, in either order
not_indexed_column-1, id)

Oracle order by query very slow

I am facing a problem when doing order by on a table.
My select query is working fine, but when i do order by (even on the primary key) it just goes on and on with no results. Finally i need to kill the session. The table has20K records.
Any suggestion for the this?
Query is as:
SELECT * FROM Users ORDER BY ID;
I do not about know about the query plan as i am new to oracle
For the unordered query, Is SQL Developer retrieving and displaying 20K rows, or just the fisrt 50? Your comparison might not be fair.
What is the size of those 20K rows: select bytes/1024/1024 MB from user_segments where segment_name = 'USERS'; I've seen many cases where a few megabytes of data use many gigabytes of storage. Maybe the data was very large before and somebody just deleted it (this doesn't remove the space). Or maybe somebody inserted those rows 1 at a time with an APPEND hint, and each row is taking an entire block.
Your query might be waiting for more temp tablespace for sorting, look at DBA_RESUMABLE.

SQL: Order of output

I was checking the docs of postgresql for Recursive queries where I got an example.
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t
Is the above statement same as 100 SELECT statements. From the docs:
Recursive queries are typically used to deal with hierarchical or tree-structured data.
If I want to sort the hierarchical structure based on some criteria will it be advisable to recursive query. eg. SQL Query: Fetch ordered rows from a table - II and the accepted answer. Should the data be retrieved from the DB and then sorted in memory. Or RECURSIVE query will be more effcient !!
The answer depends on your schema design, hardware/OS, configuration, and volume of data loaded. Run it both ways with explain and explain analyze and pick the fastest over several typical queries.
Even if I had enough information to guess your schema and exemplar data, any answer good for me may not the good for yo.