I'm quite new to PostgreSQL. I'm implementing a c program to transfer much data into a PostgreSQL database.
In order to develop the program more and more I have to test a lot (especially performance), I have to perform several runs of the imports successively.
I want to make sure that the caches are clean again when I start the program again.
Which items do I have to keep in mind - besides shared buffers - in order to achieve this.
Thanks a lot in advance
EDIT:
We are using Suse Linux Enterprise 12 and PostgreSQL 9.4.
Related
I've been working on optimizing tables in database. One of our table requires monthly vacuuming because of cleaning up processes.pg_squeeze Table size can get upto 25 GB. As this table is used by production users, we can't afford downtime every month to run VACUUM FULL.
I found that pg_squeeze and pg_repack can be used for this purpose. But I'm not able to understand the difference between those two. Can someone please explain what is the difference and which will be more suitable for me to use?
Thank you.
The main difference is that pg_squeeze operates entirely inside the database, with no need for an external binary. It also has a background worker that will schedule table rewrites automatically if certain criteria are met.
So you could say that pg_repack is less invasive (for example, installation requires no restart of the database), but pg_squeeze is more feature complete.
Disclaimer: I work for the company who wrote pg_squeeze.
Thanks for the question, was just searching for the same.
There's a point that might influence one's decision - extension support when running on managed db. For example AWS RDS supports pg_repack but not pg_squeeze.
I'm about upgrade a quite large PostgreSQL cluster from 9.3 to 11.
The upgrade
The cluster is approximately 1,2Tb in size. The database has a disk system consisting of a fast HW RAID 10 array of 8 DC-edition SSDs with 192GB ram and 64 cores. I am performing the upgrade by replicating the data to a new server with streaming replication first, then upgrading that one to 11.
I tested the upgrade using pg_upgrade with the --link option, this takes less than a minute. I also tested the upgrade regularly (without --link) with many jobs, that takes several hours (+4).
Questions
Now the obvious choice is of cause for me to use the --link option, however all this makes me wonder - is there any downsides (performance or functionality wise) to using that over the regular slower method? I do not know the internal workings of postgresql data structures, but I have a feeling there could be a performance difference after the upgrade between rewriting the data entirely and to just using hard links - whatever that means?
Considerations
The only thing I can find in the documentation about the drawbacks of --link is the downside of not being able to access the old data directory after the upgrade is performed https://www.postgresql.org/docs/11/pgupgrade.htm However that is only a safety concern and not a performance drawback and doesn't really apply in my case of replicating the data first.
The only other thing I can think of is reclaiming space, with whatever performance upsides that might have. However as I understand it, that can also be achieved by running a VACUUM FULL DATABASE (or CLUSTER?) command after the --link-upgraded database has been upgraded? Also the reclaiming of space is not very impactful performance wise on an SSD as I understand.
I appreciate if anyone can help cast some light into this.
There is absolutely no downside to using hard links (with the exception you noted, that the old cluster is dead and has to be removed).
A hard link is in no way different from a normal file.
A “file” in UNIX is in reality an “inode”, a structure containing file metadata. An entry in a directory is a (hard) link to that inode.
If you create another hard link to the inode, the same file will be in two different directories, but that has no impact whatsoever on the behavior of the file.
Of course you must make sure that you don't start both the only and the new server. Instant data corruption would ensue. That's why you should remove the old cluster as soon as possible.
I have a fairly busy DB2 on Windows server - 9.7, fix pack 11.
About 60% of the CPU time used by all queries in the package cache is being used by the following two statements:
CALL SYSIBM.SQLSTATISTICS(?,?,?,?,?,?)
CALL SYSIBM.SQLPRIMARYKEYS(?,?,?,?)
I'm fairly decent with physical tuning and have spent a lot of time on SQL tuning on this system as well. The applications are all custom, and educating developers is something I also spend time on.
I get the impression that these two stored procedures are something that perhaps ODBC calls? Reading their descriptions, they also seem like things that are completely unnecessary to do the work being done. The application doesn't need to know the primary key of a table to be able to query it!
Is there anything I can tell my developers to do that will either eliminate/reduce the execution of these or cache the information so that they're not executing against the database millions of times and eating up so much CPU? Or alternately anything I can do at the database level to reduce their impact?
6.5 years later, and I have the answer to my own question. This is a side effect of using an ORM. Part of what it does is to discover the database schema. Rails also has a similar workload. In Rails, you can avoid this by using the schema cache. This becomes particularly important at scale. Not sure if there are equivalencies for other ORMs, but I hope so!
How the dump load can be done faster in Progress?
I need to automate the process of dump load,so that I can have dump load on weekly basis?
Generally one wouldn't need to do a weekly D&L as the server engine does a decent job of managing is data. A D&L should only be done when there's an evident concern about performance, when changing versions, or making a significant organizational change in the data extents.
Having said that, a binary D&L is usually the fastest, particularly if you can make it multi-threaded.
Ok, dumping and loading to cross platforms to build a training system is probably a legitimate use-case. (If it were Linux to Linux you could just backup and restore -- you may be able to do that Linux to UNIX if the byte ordering is the same...)
The binary format is portable across platforms and versions of Progress. You can binary dump a Progress version 8 HPUX database and load it into a Windows OpenEdge 11 db if you'd like.
To do a binary dump use:
proutil dbname -C dump tablename
That will create tablename.bd. You can then load that table with:
proutil dbname -C load tablename
Once all of the data has been loaded you need to remember to rebuild the indexes:
proutil dbname -C idxbuild all
You can run many simultaneous proutil commands. There is no need to go one table at a time. You just need to have the db up and running in multi-user mode. Take a look at this for a longer explanation: http://www.greenfieldtech.com/downloads/files/DB-20_Bascom%20D+L.ppt
It is helpful to split your database up into multiple storage areas (and they should be type 2 areas) for best results. Check out: http://dbappraise.com/ppt/sos.pptx for some ideas on that.
There are a lot of tuning options available for binary dump & load. Details depend on what version of Progress you are running. Many of them probably aren't really useful anyway but you should look at the presentations above and the documentation and ask questions.
I have Postgres running on Windows and I'm trying to investigate the strange behaviour: There are 17 postgres processes, 8 out of those 17 consume ~300K memory each.
Does anybody know what such behavior is coused by?
Does anybody know about a tool to investigate the problem?
8 out of those 17 consume ~300K memory
each.
Are you 110% sure? Windows doesn't know how much memory is used from the shared buffers. Each proces could use just a few kb's and using the shared memory together with the other processes.
What problem do you have? Using memory is not a problem, memory is made to use. And 300KB each, that's just a few MB all together, if each proces is realy using 300KB.
And don't forget, PostgreSQL is a multi proces system. That's also why it scales so easy on multi core and multi processor systems.
See pgAdmin: http://www.pgadmin.org/
A tool for analyzing output from postgresql can be found at http://pgfouine.projects.postgresql.org/
pgFouine is a PostgreSQL log analyzer used to generate detailed reports from a PostgreSQL log file. pgFouine can help you to determine which queries you should optimize to speed up your PostgreSQL based application.
I don't think you can find out why you have alot of processes running, but if you feel that it might be because of database usage, this tool might help you find a cause.