I have the following huge tables.
Table_1 with 500000 (0.5m) rows
Table_8 with 20000000 (20m) rows
Table_13 with 4000000 (4m) rows
Table_6 with 500000 (0.5m) rows
Table_15 with 200000 (0.2m) rows
I need to pull out so many records(recent events) to show on google map by joining about 28 tables.
How to speed up for huge tables in SQL select query ?
I searched from Google to use the clustered index and non clustered indexes. By getting the advices from DTA(Database Engine Tuning Advisor), I build those clustered index and non clustered indexes. But it still take long time.
I have 2 views and 1 stored procedure as the following.
https://gist.github.com/LaminGitHub/57b314b34599b2a47e65
Please kindly give me idea.
Best Regards
Related
We have a transaction table of sale to customers with over 2000 million rows on Redshift. Every months transaction table has 5 million rows. For MIS (monthly 5 million rows only), I need to check if a customer is new based on mobile number, or the mobile number exists in the 2000 million database without joining it on the full table so my query remains efficient.
What I have tried:
newtable=SELECT DISTINCT(mobile_no) as mobile_no,'old' as category FROM table
maintable=SELECT maintable.*, coalesce(nq.category,'new')
FROM maintable as maintable
LEFT JOIN (newquery) as nq on nq.mobile_no=maintable.mobile_no;
This is very slow takes over 50 mins. I also tried
SELECT exists (SELECT 1 FROM newtable WHERE mobile_no=maintable.mobile_no LIMIT 1) as as category but this gives an 'out of memory' error.
Amazon RedShift is a data warehouse, so it won't be fast on queries by design. If you will be doing analysis on the data and expect a faster result, you might want to explore other products they offer such as EMR to do your queries faster.
Here is a reference on what each service is intention is: https://aws.amazon.com/big-data/datalakes-and-analytics/
I am experienced in fine tuning in oracle, but in postgres I am unable to improve performance.
Problem statement: I need to aggregate rows out of one postgres table - that has large no.of columns (110) and 175 million rows for a month range. The query other than aggregation has a very simple where clause :
where time between '2019-03-15' and '2019-04-15'
and org_name in ('xxx','yyy'.. 15 elements)
There are individutal btree indexes on table for each "time" idx_time column and "org_name" idx_org_name but not composite index.
I tried creating new index with ('org_name','time') but my manager does not want to change anything.
How can I make it run faster? It takes 15 minutes now (in case of smaller set of org_name it takes 6 minutes ). Most time is spent on data access from table.
Is parallel execution possible?
thanks, Jay
QUERY EXPLAIN ANALYZE :
I am relatively new to using Postgres, but am wondering what could be the workaround here.
I have a table with about 20 columns and 250 million rows, and an index created for the timestamp column time (but no partitions).
Queries sent to the table have been failing (although using the view first/last 100 rows function in PgAdmin works), running endlessly. Even simple select * queries.
For example, if I want to LIMIT a selection of the data to 10
SELECT * from mytable
WHERE time::timestamp < '2019-01-01'
LIMIT 10;
Such a query hangs - what can be done to optimize queries in a table this large? When the table was of a smaller size (~ 100 million rows), queries would always complete. What should one do in this case?
If time is of data type timestamp or the index is created on (time::timestamp), the query should be fast as lightning.
Please show the CREATE TABLE and the CREATE INDEX statement, and the EXPLAIN output for the query for more details.
"Query that doesn't complete" usually means that it does disk swaps. Especially when you mention the fact that with 100M rows it manages to complete. That's because index for 100M rows still fits in your memory. But index twice this size doesn't.
Limit won't help you here, as database probably decides to read the index first, and that's what kills it.
You could try and increase available memory, but partitioning would actually be the best solution here.
Partitioning means smaller tables. Smaller tables means smaller indexes. Smaller indexes have better chances to fit into your memory.
I wanted to load the table which is having a table size of more than 1 TB size from S3 to Redshift.
I cannot use DISTSTYLE as ALL because it is a big table.
I cannot use DISTSTYLE as EVEN because I want to use this table in joins which are making performance issue.
Columns on my table are
id INTEGER, name VARCHAR(10), another_id INTEGER, workday INTEGER, workhour INTEGER, worktime_number INTEGER
Our redshift cluster has 20 nodes.
So, I tried distribution key on a workday but the table is badly skewed.
There are 7 unique work days and 24 unique work hours.
How to avoid the skew in such cases?
How we avoid skewing of the table in case of an uneven number of row counts for the unique key (let's say hour1 have 1million rows, hour2 have 1.5million rows, hour3 have 2million rows, and so on)?
Distribute your table using DISTSTYLE EVEN and use either SORTKEY or COMPOUND SORTKEY. Sort Key will help your query performance. Try this first.
DISTSTYLE/DISTKEY determines how your data is distributed. From the columns used in your queries, it is advised choose a column that causes the least amount of skew as the DISTKEY. A column which has many distinct values, such as timestamp, would be a good first choice. Avoid columns with few distinct values, such as credit card types, or days of week.
You might need to recreate your table with different DISTKEY / SORTKEY combinations and try out which one will work best based on your typical queries.
For more info https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-key.html
Here is the architecture that I recommend
1) load to a staging table with dist even and sort by something that is sorted on your loaded s3 data - this means you will not have to vacuum the staging table
2) set up a production table with the sort / dist you need for your queries. after each copy from s3, load that new data into the production table and vacuum.
3) you may wish to have 2 mirror production tables and flip flop between them using a late binding view.
its a bit complex to do this you need may need some professional help. There may be specifics to your use case.
As of writing this(Just after Re-invent 2018), Redshift has Automatic Distribution available, which is a good starter.
The following utilities will come in handy:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminScripts
As indicated in Answers POSTED earlier try a few combinations by replicating the same table with different DIST keys ,if you don't like what Automatic DIST is doing. After the tables are created run the admin utility from the git repos (preferably create a view on the SQL script in the Redshift DB).
Also, if you have good clarity on query usage pattern then you can use the following queries to check how well the sort key are performing using the below SQLs.
/**Queries on tables that are not utilizing SORT KEYs**/
SELECT t.database, t.table_id,t.schema, t.schema || '.' || t.table AS "table", t.size, nvl(s.num_qs,0) num_qs
FROM svv_table_info t
LEFT JOIN (
SELECT tbl, COUNT(distinct query) num_qs
FROM stl_scan s
WHERE s.userid > 1
AND s.perm_table_name NOT IN ('Internal Worktable','S3')
GROUP BY tbl) s ON s.tbl = t.table_id
WHERE t.sortkey1 IS NULL
ORDER BY 5 desc;
/**INTERLEAVED SORT KEY**/
--check skew
select tbl as tbl_id, stv_tbl_perm.name as table_name,
col, interleaved_skew, last_reindex
from svv_interleaved_columns, stv_tbl_perm
where svv_interleaved_columns.tbl = stv_tbl_perm.id
and interleaved_skew is not null;
of course , there is always room for improvement in the SQLs above, depending on specific stats that you may want to look at or drill down to.
Hope this helps.
I have 2 tables in PostgreSQL one of which is 16 million rows and the other is around 3000. They both share 2 common IDs, but the larger table has thousands of iterations of the same ID.
I'm trying to do a LEFT JOIN with a few conditions as follows:
SELECT LT.Col1, LT.Col2, LT.Col3, ST.Col1, ST.Col2
FROM large_table as LT
LEFT JOIN small_table as ST
ON LT.id1 = ST.id1 AND LT.id2 = ST.id2
WHERE LT.Col1 > 30
AND LT.Col2 > 2
AND LT.Col3 BETWEEN '11:00:00'::time AND '21:00:00'::time
I have created multi-column Indexes based on id1 and id2 for each table, but the query is just running and running. Using PGAdmin4 on a macbook pro 16gb RAM, 2.9ghz quad core i7. I've checked the computer performance and it's not struggling. Does anybody have any advice on how to speed up the query? Am I just asking too much of it?
Since this is a left outer join, your best bet is to use indexes on large_table that reduces the number of rows early on.
Unfortunately none of your conditions checks for equality, so a combined index is useless.
You could create indexes on the three columns of large_table and see if PostgreSQL uses them (e.g. using a bitmap inex scan and combining the results).
Those indexes that PostgreSQL chooses not to use can be dropped again.
You might try creating combined index for tuple (id1, id2) in both tables. Then use ON (LT.id1, LT.id2) = (ST.id1, ST.id2)