Insert into PostgreSQL/PostGIS database is too slow - perl

I am working on a project that requires me to take a live Twitter feed and store records from it in a PostgreSQL database. The project requires that the tweets' location data be stored for searching under PostGIS. I'm using a perl script to get the Twitter feed (using AnyEvent::Twitter::Stream and the Twitter API). Every 5000 tweets, the script fork()s and the child process issues the SQL to insert the rows. I'm using AutoCommit => 0 to speed up the inserts.
The problem is that the child process isn't done storing the 5000 tweets before the next 5000 come in, so I get numerous postgres processes. I need to figure out how to speed up the database inserts enough to allow the child process to exit before the next one is started.
The tasks the child process does now (for each tweet) are:
Insert a record in the tweets table, using ST_GeomFromEWKT to convert the latitude/longitude data to GIS coordinates
Insure that the author of the tweet and any users mentioned in the tweet are in the users table
Insert mentions of users and hashtags in the relevant tables
Any advice on diagnosing the speed or speeding up the process would be most helpful. This eventually has to work in real time, so temporary tables and text files are not good options. The server is a dual-Xeon HP server running Debian with 8G of RAM.

In the postgres docs is a comment on speeding up inserts by misusing the insert from select clause.
This seems to be a significant difference, have you tried that?
Useful tip for faster INSERTs:
You can use the INSERT INTO tbl <query> syntax to accelerate the speed of inserts by batching them together. For example...
INSERT INTO my_table SELECT 1, 'a' UNION SELECT 2, 'b' UNION SELECT 3, 'c' UNION ...
If you batch up many sets of values per INSERT statement and batch up multiple INSERT statements per transaction, you can achieve significantly faster insertion performance. I managed to achieve almost 8x faster inserts on a PostgreSQL 8.1 / Win2K installation by batching up 100 (small) using this technique.
Otherwise, if you cannot get the postgres up to the required speed, you may check your IO performance on the HP box.
Also, check if there are many indexes to be updated after insert. Maybe you even need to say goodbye to many of your constraints (FK constraints). This would allow to insert the records in any order and there is no need to wait for a user to be created before inserting the tweet.
I would also check, if there is a possibility to check for the users in the db while you collect the tweets.
Last but not least, you should implement a queue to insert the batches of 5000 tweets and not simply fire them off to the db.

I've benchmarked performance of creating points, and ST_GeomFromEWKT is the slowest method. Try using ST_MakePoint in a prepared statement to minimize any overhead:
use DBI;
# Prepare an insert
$sth=$dbh->prepare("INSERT INTO mytable (geom) ".
"SELECT ST_SetSRID(ST_MakePoint(?, ?), 4326) AS geom");
# In a for-loop of 5000 points, do the insert
$sth->execute($longitude, $latitude);

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.
UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.
That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

PostgreSQL: UPDATE large table

I have a large PostgreSQL table of 29 million rows. The size (according to the stats tab in pgAdmin is almost 9GB.) The table is post-gis enabled with an empty geometry column.
I want to UPDATE the geometry column using ST_GeomFromText, reading from X and Y coordinate columns (SRID: 27700) stored in the same table. However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
To overcome this, should I UPDATE the 29 million rows in batches/stages? How can I do 1 million rows (the FIRST 1 million), then do the next 1 million rows until I reach 29 million?
Or are there other more efficient ways of updating large tables like this?
I should add, the table is hosted in AWS.
My UPDATE query is:
UPDATE schema.table
SET geom = ST_GeomFromText('POINT(' || eastingcolumn || ' ' || northingcolumn || ')',27700);
You did not give any server specs, writing 9GB can be pretty fast on recent hardware.
You should be OK with one, long, update - unless you have concurrent writes to this table.
A common trick to overcome this problem (a very long transaction, locking writes to the table) is to split the UPDATE into ranges based on the primary key, ran in separate transactions.
/* Use PK or any attribute with a known distribution pattern */
UPDATE schema.table SET ... WHERE id BETWEEN 0 AND 1000000;
UPDATE schema.table SET ... WHERE id BETWEEN 1000001 AND 2000000;
For high level of concurrent writes people use more subtle tricks (like: SELECT FOR UPDATE / NOWAIT, lightweight locks, retry logic, etc).
From my original question:
However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
Turns out our Amazon AWS instance database was running out of space, stopping my original ST_GeomFromText query from completing. Freeing up space fixed it.
On an important note, as suggested by #mlinth, ST_Point ran my query far quicker than ST_GeomFromText (24mins vs 2hours).
My final query being:
UPDATE schema.tablename
SET geom = ST_SetSRID(ST_Point(eastingcolumn,northingcolumn),27700);

How can I get the total run time of a query in redshift, with a query?

I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100

Caching various statistics in a special one row table?

Expecting hundreds of millions of rows and write-heavy applciation.
We need return SELECT COUNT(*) FROM orders and SELECT SUM(amount) FROM orders quite frequently and both of them are too slow to be ran on every request.
We are thinking about adding a special table called stats with just a single row. It has total_orders and total_amount, which we would increase every time we add a new order. Is this kind of SQL "cache" table a practical solution? What does it mean in terms of write performance?
Another option is to use Memcached or Redis, but they can get out of sync and are not persistent. Any other ideas?

Oracle order by query very slow

I am facing a problem when doing order by on a table.
My select query is working fine, but when i do order by (even on the primary key) it just goes on and on with no results. Finally i need to kill the session. The table has20K records.
Any suggestion for the this?
Query is as:
SELECT * FROM Users ORDER BY ID;
I do not about know about the query plan as i am new to oracle
For the unordered query, Is SQL Developer retrieving and displaying 20K rows, or just the fisrt 50? Your comparison might not be fair.
What is the size of those 20K rows: select bytes/1024/1024 MB from user_segments where segment_name = 'USERS'; I've seen many cases where a few megabytes of data use many gigabytes of storage. Maybe the data was very large before and somebody just deleted it (this doesn't remove the space). Or maybe somebody inserted those rows 1 at a time with an APPEND hint, and each row is taking an entire block.
Your query might be waiting for more temp tablespace for sorting, look at DBA_RESUMABLE.