What about expected performance in Pentaho? - postgresql

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.

Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?

17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

Related

Postgres running slow after indexing finished

My postgres was running really slow lately, an aggregation for a month it usually ended up taking more than 1 minute (to be more exact the last one took 7 mins and 23 secs).
Last friday i recreated the servers (master and replica) and reimported the database.
First thing I noticed is that from 133gb now the database is 42gb (the actual data is around 12gb, i guess the rest are the indexes).
Everything was fast as hell for a day, after that the indexing finished (26gb on indexes) and now I'm back to square 1.
A count on ~5 million rows takes 3 mins 42 secs.
Made the autovacuum more aggressive and it looks like it's doing it's job now but the DB is still slow.
I am using the db for an API so it's constantly growing. Atm i have 2 tables one that has around 5 mil rows and the other 28 mil.
So if the master has a lot of activity and let's say that i'm expecting some performance loss, i don't expect it from the replica.
What's curios is that after a restart it's really fast for an hour or so.
Also another thing that i noticed was that on every query I do the IO is 100% while the memory and cpu are almost not used at all.
Any help would be greatly appreciated.
Update
Same database on a smaller machine works like a charm.
Same queries, same indexes.
The only difference is the traffic, not writing or updating that much.
Also i forgot to mention one thing, one of my indexes is clustered.
The live machine is a 5 core with 64gb and 3k IO.
The test machine is a 2 core with 4gb and an SSD.
Update
Found my issue.
Apparently the autovacuum can't get a lock and by the time it gets it the dead tuples increased.
Made the autovacuum more aggresive for now and deleted a bunch of unused indexes.
Still don't know how to fix the lock issue tho.
Update
Looks like something is increasing the estimated row count.
Since my last update here the row count increased by 2 mil.
I guess that by tomorrow the row count will be again around 12 mil and the count will be slow as hell again.
Could this be related to autovacuum?
Update
Well found my issue.
Looks like postgres is losing a lot of speed on a write intensive database.
Had a column that was used as a flag and updated a lot of times per day.
Everything looks really good after the flag and update was removed.
Any clue on how to fix this issue on a write intensive table?
May be the following pointers help:
Are you really sure you want to do a 5mil row Aggregation for an API? Everytime ? Can't you split the data into chunks such that only a small number of chunks actually get most of the new rows (and so the aggregation of all the previous chunks can be reused for the next Query)? Time is one such measure, serial numbers could be another, etc. If so, partitioning the data is an obvious solution you should investigate, it really has a good chance of giving you sub-second query times (assuming you store aggregations for previous chunks smartly).
A hunch about that first hour magic is that although this data fits RAM, concurrent querying pushes that data-set out and then its purely disk I/O... and in that case, CPU / RAM being idle isn't a surprise.
Finally, I think this setup is asking for a re-design where there is only so much you could do with a single SQL, and in that expecting sub-second Query times for data that is not within RAM for a 5mil data-set is probably being too optimistic!
(Nonetheless, do post your findings, if possible)

Postgresql Truncation speed

We're using Postgresql 9.1.4 as our db server. I've been trying to speed up my test suite so I've stared profiling the db a bit to see exactly what's going on. We are using database_cleaner to truncate tables at the end of tests. YES I know transactions are faster, I can't use them in certain circumstances so I'm not concerned with that.
What I AM concerned with, is why TRUNCATION takes so long (longer than using DELETE) and why it takes EVEN LONGER on my CI server.
Right now, locally (on a Macbook Air) a full test suite takes 28 minutes. Tailing the logs, each time we truncate tables... ie:
TRUNCATE TABLE table1, table2 -- ... etc
it takes over 1 second to perform the truncation. Tailing the logs on our CI server (Ubuntu 10.04 LTS), take takes a full 8 seconds to truncate the tables and a build takes 84 minutes.
When I switched over to the :deletion strategy, my local build took 20 minutes and the CI server went down to 44 minutes. This is a significant difference and I'm really blown away as to why this might be. I've tuned the DB on the CI server, it has 16gb system ram, 4gb shared_buffers... and an SSD. All the good stuff. How is it possible:
a. that it's SO much slower than my Macbook Air with 2gb of ram
b. that TRUNCATION is so much slower than DELETE when the postgresql docs state explicitly that it should be much faster.
Any thoughts?
This has come up a few times recently, both on SO and on the PostgreSQL mailing lists.
The TL;DR for your last two points:
(a) The bigger shared_buffers may be why TRUNCATE is slower on the CI server. Different fsync configuration or the use of rotational media instead of SSDs could also be at fault.
(b) TRUNCATE has a fixed cost, but not necessarily slower than DELETE, plus it does more work. See the detailed explanation that follows.
UPDATE: A significant discussion on pgsql-performance arose from this post. See this thread.
UPDATE 2: Improvements have been added to 9.2beta3 that should help with this, see this post.
Detailed explanation of TRUNCATE vs DELETE FROM:
While not an expert on the topic, my understanding is that TRUNCATE has a nearly fixed cost per table, while DELETE is at least O(n) for n rows; worse if there are any foreign keys referencing the table being deleted.
I always assumed that the fixed cost of a TRUNCATE was lower than the cost of a DELETE on a near-empty table, but this isn't true at all.
TRUNCATE table; does more than DELETE FROM table;
The state of the database after a TRUNCATE table is much the same as if you'd instead run:
DELETE FROM table;
VACCUUM (FULL, ANALYZE) table; (9.0+ only, see footnote)
... though of course TRUNCATE doesn't actually achieve its effects with a DELETE and a VACUUM.
The point is that DELETE and TRUNCATE do different things, so you're not just comparing two commands with identical outcomes.
A DELETE FROM table; allows dead rows and bloat to remain, allows the indexes to carry dead entries, doesn't update the table statistics used by the query planner, etc.
A TRUNCATE gives you a completely new table and indexes as if they were just CREATEed. It's like you deleted all the records, reindexed the table and did a VACUUM FULL.
If you don't care if there's crud left in the table because you're about to go and fill it up again, you may be better off using DELETE FROM table;.
Because you aren't running VACUUM you will find that dead rows and index entries accumulate as bloat that must be scanned then ignored; this slows all your queries down. If your tests don't actually create and delete all that much data you may not notice or care, and you can always do a VACUUM or two part-way through your test run if you do. Better, let aggressive autovacuum settings ensure that autovacuum does it for you in the background.
You can still TRUNCATE all your tables after the whole test suite runs to make sure no effects build up across many runs. On 9.0 and newer, VACUUM (FULL, ANALYZE); globally on the table is at least as good if not better, and it's a whole lot easier.
IIRC Pg has a few optimisations that mean it might notice when your transaction is the only one that can see the table and immediately mark the blocks as free anyway. In testing, when I've wanted to create bloat I've had to have more than one concurrent connection to do it. I wouldn't rely on this, though.
DELETE FROM table; is very cheap for small tables with no f/k refs
To DELETE all records from a table with no foreign key references to it, all Pg has to do a sequential table scan and set the xmax of the tuples encountered. This is a very cheap operation - basically a linear read and a semi-linear write. AFAIK it doesn't have to touch the indexes; they continue to point to the dead tuples until they're cleaned up by a later VACUUM that also marks blocks in the table containing only dead tuples as free.
DELETE only gets expensive if there are lots of records, if there are lots of foreign key references that must be checked, or if you count the subsequent VACUUM (FULL, ANALYZE) table; needed to match TRUNCATE's effects within the cost of your DELETE .
In my tests here, a DELETE FROM table; was typically 4x faster than TRUNCATE at 0.5ms vs 2ms. That's a test DB on an SSD, running with fsync=off because I don't care if I lose all this data. Of course, DELETE FROM table; isn't doing all the same work, and if I follow up with a VACUUM (FULL, ANALYZE) table; it's a much more expensive 21ms, so the DELETE is only a win if I don't actually need the table pristine.
TRUNCATE table; does a lot more fixed-cost work and housekeeping than DELETE
By contrast, a TRUNCATE has to do a lot of work. It must allocate new files for the table, its TOAST table if any, and every index the table has. Headers must be written into those files and the system catalogs may need updating too (not sure on that point, haven't checked). It then has to replace the old files with the new ones or remove the old ones, and has to ensure the file system has caught up with the changes with a synchronization operation - fsync() or similar - that usually flushes all buffers to the disk. I'm not sure whether the the sync is skipped if you're running with the (data-eating) option fsync=off .
I learned recently that TRUNCATE must also flush all PostgreSQL's buffers related to the old table. This can take a non-trivial amount of time with huge shared_buffers. I suspect this is why it's slower on your CI server.
The balance
Anyway, you can see that a TRUNCATE of a table that has an associated TOAST table (most do) and several indexes could take a few moments. Not long, but longer than a DELETE from a near-empty table.
Consequently, you might be better off doing a DELETE FROM table;.
--
Note: on DBs before 9.0, CLUSTER table_id_seq ON table; ANALYZE table; or VACUUM FULL ANALYZE table; REINDEX table; would be a closer equivalent to TRUNCATE. The VACUUM FULL impl changed to a much better one in 9.0.
Brad, just to let you know. I've looked fairly deeply into a very similar question.
Related question: 30 tables with few rows - TRUNCATE the fastest way to empty them and reset attached sequences?
Please also look at this issue and this pull request:
https://github.com/bmabey/database_cleaner/issues/126
https://github.com/bmabey/database_cleaner/pull/127
Also this thread: http://archives.postgresql.org/pgsql-performance/2012-07/msg00047.php
I am sorry for writing this as an answer, but I didn't find any comment links, maybe because there are too much comments already there.
I've encountered similar issue lately, i.e.:
The time to run test suite which used DatabaseCleaner varied widely between different systems with comparable hardware,
Changing DatabaseCleaner strategy to :deletion provided ~10x improvement.
The root cause of the slowness was a filesystem with journaling (ext4) used for database storage. During TRUNCATE operation the journaling daemon (jbd2) was using ~90% of disk IO capacity. I am not sure if this is a bug, an edge case or actually normal behaviour in these circumstances. This explains however why TRUNCATE was a lot slower than DELETE - it generated a lot more disk writes. As I did not want to actually use DELETE I resorted to setting fsync=off and it was enough to mitigate this issue (data safety was not important in this case).
A couple of alternate approaches to consider:
Create a empty database with static "fixture" data in it, and run the tests in that. When you are done, just just drop the database, which should be fast.
Create a new table called "test_ids_to_delete" that contains columns for table names and primary key ids. Update your deletion logic to insert the ids/table names in this table instead, which will be much faster than running deletes. Then, write a script to run "offline" to actually delete the data, either after a entire test run has finished, or overnight.
The former is a "clean room" approach, while latter means there will be some test data will persist in database for longer. The "dirty" approach with offline deletes is what I'm using for a test suite with about 20,000 tests. Yes, there are sometimes problems due to having "extra" test data in the dev database but at times. But sometimes this "dirtiness" has helped us find and fixed bug because the "messiness" better simulated a real-world situation, in a way that clean-room approach never will.

Is killing a "CLUSTER ON index" dangerous for database?

All the question is in the title,
if we kill a cluster query on a 100 millions row table, will it be dangerous for database ?
the query is running for 2 hours now, and i need to access the table tomorrow morning (12h left hopefully).
I thought it would be far quicker, my database is running on raid ssd and Bi-Xeon Processor.
Thanks for your wise advice.
Sid
No, you can kill the cluster operation without any risk. Before the operation is done, nothing has changed to the original table- and indexfiles. From the manual:
When an index scan is used, a temporary copy of the table is created
that contains the table data in the index order. Temporary copies of
each index on the table are created as well. Therefore, you need free
space on disk at least equal to the sum of the table size and the
index sizes.
When a sequential scan and sort is used, a temporary sort file is also
created, so that the peak temporary space requirement is as much as
double the table size, plus the index sizes.
As #Frank points out, it is perfectly fine to do so.
Assuming you want to run this query in the future and assuming you have the luxury of a service window and can afford some downtime, I'd tweak some settings to boost the performance a bit.
In your configuration:
turn off fsync, for higher throughput to the file system
Fsync stands for file system sync. With fsync on, the database waits for the file system to commit on every page flush.
maximize your maintenance_work_mem
It's ok to just take all memory available, as it will not be allocated during production hours. I don't know how big your table and the index you are working on are, things will run faster when they can be fully loaded in main memory.

Why is count(*) taking extremely long in one PostgreSQL database but not another?

I have two Postgres databases. In one I have two tables, each with about 8,000,000 rows, and a count on either of them takes about a second. In another database, also Postgres, there are tables that are 1,000,000 rows, and a count takes 10s, and one table thats about 6,000,000 rows, and count takes 3min to run. What factors determine how long this will take? They are on different machines, but the database that takes longer is on a faster machine.
I've read about how postgres count is slow in general, but this seems odd to me. I can't really use a workaround, because I am using django, and it does a count in the admin, which is taking forever and making it dificult to use.
Any information on this would be helpful.
Speed of counting depends not just on the number of rows in the table but on the time taken to read the data from disk. The time depends on many things:
Number of rows in the table - as you already mentioned.
The number of records per page (if each record takes more space you need to read more pages to read the same number of rows).
If pages are only partly full you have to read more pages.
If the tables is already cached in memory (having more memory available helps here).
If the table is indexed with a small index (the index can be counted instead).
Hardware differences.
etc....
Indexes, caches, disk speed, for starters all have an impact.
Is the "slow table" properly vacuumed?
Do not use VACUUM FULL, it only creates table and index bloat. VACUUM is absolutely enough. VACUUM ANALYZE would even be better.
And make sure autovacuum is turned on and properly configured

Cassandra random read speed

We're still evaluating Cassandra for our data store. As a very simple test, I inserted a value for 4 columns into the Keyspace1/Standard1 column family on my local machine amounting to about 100 bytes of data. Then I read it back as fast as I could by row key. I can read it back at 160,000/second. Great.
Then I put in a million similar records all with keys in the form of X.Y where X in (1..10) and Y in (1..100,000) and I queried for a random record. Performance fell to 26,000 queries per second. This is still well above the number of queries we need to support (about 1,500/sec)
Finally I put ten million records in from 1.1 up through 10.1000000 and randomly queried for one of the 10 million records. Performance is abysmal at 60 queries per second and my disk is thrashing around like crazy.
I also verified that if I ask for a subset of the data, say the 1,000 records between 3,000,000 and 3,001,000, it returns slowly at first and then as they cache, it speeds right up to 20,000 queries per second and my disk stops going crazy.
I've read all over that people are storing billions of records in Cassandra and fetching them at 5-6k per second, but I can't get anywhere near that with only 10mil records. Any idea what I'm doing wrong? Is there some setting I need to change from the defaults? I'm on an overclocked Core i7 box with 6gigs of ram so I don't think it's the machine.
Here's my code to fetch records which I'm spawning into 8 threads to ask for one value from one column via row key:
ColumnPath cp = new ColumnPath();
cp.Column_family = "Standard1";
cp.Column = utf8Encoding.GetBytes("site");
string key = (1+sRand.Next(9)) + "." + (1+sRand.Next(1000000));
ColumnOrSuperColumn logline = client.get("Keyspace1", key, cp, ConsistencyLevel.ONE);
Thanks for any insights
purely random reads is about worst-case behavior for the caching that your OS (and Cassandra if you set up key or row cache) tries to do.
if you look at contrib/py_stress in the Cassandra source distribution, it has a configurable stdev to perform random reads but with some keys hotter than others. this will be more representative of most real-world workloads.
Add more Cassandra nodes and give them lots of memory (-Xms / -Xmx). The more Cassandra instances you have, the data will be partitioned across the nodes and much more likely to be in memory or more easily accessed from disk. You'll be very limited with trying to scale a single workstation class CPU. Also, check the default -Xms/-Xmx setting. I think the default is 1GB.
It looks like you haven't got enough RAM to store all the records in memory.
If you swap to disk then you are in trouble, and performance is expected to drop significantly, especially if you are random reading.
You could also try benchmarking some other popular alternatives, like Redis or VoltDB.
VoltDB can certainly handle this level of read performance as well as writes and operates using a cluster of servers. As an in-memory solution you need to build a large enough cluster to hold all of your data in RAM.