I'm running queries on a Redshift cluster using DataGrip that take upwards of 10 hours to run and unfortunately these often fail. Alas, DataGrip doesn't maintain a connection to the database long enough for me to get to see the error message with which the queries fail.
Is there a way of retrieving these error messages later, e.g. using internal Redshift tables? Alternatively, is there are a way to make DataGrip maintain the connection for long enough?
Yes, you Can!
Query stl_connection_log table to find out pid by looking at the recordtime column when your connection was initiated and also dbname, username and duration column helps to narrow down.
select * from stl_connection_log order by recordtime desc limit 100
If you can find the pid, you can query stl_query table to find out if are looking at right query.
select * from stl_query where pid='XXXX' limit 100
Then, check the stl_error table for your pid. This will tell you the error you are looking for.
select * from stl_error where pid='XXXX' limit 100
If I’ve made a bad assumption please comment and I’ll refocus my answer.
Related
I have a append only table with more than ~80M records attached to timescaledb, the frequency of insertion of records to table is one minute. Also, there is an index created on non unique column and start time (ds_id, start_time).
When I try to run the simple
select * from observation where ds_id in (27525, 27567, 28787,27099)
The query itself is taking longer than 1 minute to give the output.
I, also tried to analyze the table, as it is append only there is no scope of vacuum on this table.
So, I am in confusion why the simple select query is taking much time. I am thinking due to huge number of records it is taking time to query the results.
Please help me in understanding the issue and help me with fixing
query plan: https://explain.depesz.com/s/M8H7
Thanks in advance.
Note: ds_id (fk) and start_time(insertion time) are the one used for getting results. Also, I am sorry for not providing the table structure and details as it is confidential. :(
Is there a way to fetch the tables that are operated by a given query? For example, the below query operates on table 'abc':
select * from abc
After a query is executed successfully, can we fetch the tables that the query actually operated on in redshift?
Harsha - yes and in a number of ways. The most straight forward is to query the stl_scan system table which lists all table scans and the query number that generated the scan. The question for you is how do you want to identify the query you just ran? By text? By current session id? Stl_scan will have lots of data in it for a busy cluster so you want to find only those rows you care about. If current session you can use "where pid = (SELECT pg_backend_pid())" to get the query run by the current session but pid isn't in stl_scan so you will need to join with stl_query which has both pid and query number. You will also want to have a "where starttime > getdate() - interval '1 hour'" in your query so you aren't looking through all of history for information about a query you just ran.
I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100
At my work, I needed to build a new join table in a postgresql database that involved doing a lot of computations on two existing tables. The process was supposed to take a long time so I set it up to run over the weekend before I left on Friday. Now, I want to check to see if the query finished or not.
How can I check if an INSERT command has finished yet while not being at the computer I ran it on? (No, I don't know how many rows it was suppose to add.)
Select * from pg_stat_activity where state not ilike 'idle%' and query ilike 'insert%'
This will return all non-idle sessions where the query begins with insert, if your query does not show in this list then it is no longer running.
pg_stat_activity doc
You can have a look at the table pg_stat_activity which contains all database connections including active query, owner etc.
At https://gist.github.com/rgreenjr/3637525 there is a copy-able example how such a query could look like.
I am working on a project that requires me to take a live Twitter feed and store records from it in a PostgreSQL database. The project requires that the tweets' location data be stored for searching under PostGIS. I'm using a perl script to get the Twitter feed (using AnyEvent::Twitter::Stream and the Twitter API). Every 5000 tweets, the script fork()s and the child process issues the SQL to insert the rows. I'm using AutoCommit => 0 to speed up the inserts.
The problem is that the child process isn't done storing the 5000 tweets before the next 5000 come in, so I get numerous postgres processes. I need to figure out how to speed up the database inserts enough to allow the child process to exit before the next one is started.
The tasks the child process does now (for each tweet) are:
Insert a record in the tweets table, using ST_GeomFromEWKT to convert the latitude/longitude data to GIS coordinates
Insure that the author of the tweet and any users mentioned in the tweet are in the users table
Insert mentions of users and hashtags in the relevant tables
Any advice on diagnosing the speed or speeding up the process would be most helpful. This eventually has to work in real time, so temporary tables and text files are not good options. The server is a dual-Xeon HP server running Debian with 8G of RAM.
In the postgres docs is a comment on speeding up inserts by misusing the insert from select clause.
This seems to be a significant difference, have you tried that?
Useful tip for faster INSERTs:
You can use the INSERT INTO tbl <query> syntax to accelerate the speed of inserts by batching them together. For example...
INSERT INTO my_table SELECT 1, 'a' UNION SELECT 2, 'b' UNION SELECT 3, 'c' UNION ...
If you batch up many sets of values per INSERT statement and batch up multiple INSERT statements per transaction, you can achieve significantly faster insertion performance. I managed to achieve almost 8x faster inserts on a PostgreSQL 8.1 / Win2K installation by batching up 100 (small) using this technique.
Otherwise, if you cannot get the postgres up to the required speed, you may check your IO performance on the HP box.
Also, check if there are many indexes to be updated after insert. Maybe you even need to say goodbye to many of your constraints (FK constraints). This would allow to insert the records in any order and there is no need to wait for a user to be created before inserting the tweet.
I would also check, if there is a possibility to check for the users in the db while you collect the tweets.
Last but not least, you should implement a queue to insert the batches of 5000 tweets and not simply fire them off to the db.
I've benchmarked performance of creating points, and ST_GeomFromEWKT is the slowest method. Try using ST_MakePoint in a prepared statement to minimize any overhead:
use DBI;
# Prepare an insert
$sth=$dbh->prepare("INSERT INTO mytable (geom) ".
"SELECT ST_SetSRID(ST_MakePoint(?, ?), 4326) AS geom");
# In a for-loop of 5000 points, do the insert
$sth->execute($longitude, $latitude);