We've recently experienced a flooding of logs on Heroku Postgres, alongside with major performance degradation in our app. Thousands of messages like the following were logged with no clear indication of the cause.
I was wondering if a background process (like VACUUM) could generate such a huge influx of messages, or if this might be something specific to Heroku. Appreciate any help in identifying possible origins.
Log lines contain only (seemingly) hex codes (redacted parts of it for brevity and security). Unfortunately I'm unable to see the value of log_line_prefix on Heroku.
02/12/2022 10:59:58.000 am
[BLUE] [25-27464] 632d303934312d346539652d6163934312d346539652d616331352d223a7b227374726174656779223a224b525f4f4e4c59222c226368696c644f626a65637469766557656967687473223a6e756c6c2c226b6579526573756c7457656967687473223a6e756c6c7d2c226b6579526573756c744869657261726368794e6f646573223a5b7b226964223a2231643231663765392d616534302d346136312d613265322d346338336466353466316162222c22737461747573223a22414354495645222c2263616c63756c617465644d657472696356616c7565436f6e66
02/12/2022 10:59:58.000 am
[BLUE] [25-27466] 352d643032362d346436372d3933373223a39362e38383038323734794e6f646573223a5b7b226964223a2265666331666361372d353430642d343732632d383162642d656330666430386337323631222c22737461747573223a22414354495645222c2263616c63756c617465644d657472696356616c7565436f6e66696775726174696f6e223a6e756c6c2c226d657472696356616c7565223a31353431392e302c22616363756d756c6174656456616c7565223a6e756c6c2c22616363756d756c61746564476f616c223a37353335352e32383535333035343336
02/12/2022 10:59:58.000 am
[BLUE] [25-27471] 756c6c2c2267726964457870726573331227d5d7d2c226162353833657373223a6e756c6c2c22737461747573223a22414354495645222c2270726f6772657373436f6e66696775726174696f6e223a7b227374726174656779223a224b525f4f4e4c59222c226368696c644f626a65637469766557656967687473223a6e756c6c2c226b6579526573756c7457656967687473223a6e756c6c7d2c226b6579526573756c744869657261726368794e6f646573223a5b7b226964223a2231623266383838632d353836322d346462382d613934312d3663636135313364
Related
Whats recommendation can you give me on setting up a database failover postgresql cluster? I have only 2 virtual machine.
Right now i read this https://wiki.clusterlabs.org/wiki/PgSQL_Replicated_Cluster
I have some questions about it:
Where is it written in the configuration files when the second machine should turn on as an active one?
How does the first car understand that the second car is active?
Why does not the virtual IP address conflict?
When the main machine turns on, how will the system understand what needs to be done replication from the second server?
Sorry for my bad English
Its almost 2 months you asked it but it seems you are in same boat as I was in few weeks back. I have gone through your link and it explains that you need to use corosync + pacemaker + pcs. Frankly, I have no experience on any of them but I used pgpool2 4.0.4 (latest at the time of writing) with PostgreSQL 9.5.14 and 10.7, successfully able to brought up 2 clusters in last 2 months.
With pgpool you do not need to use any other tool/library and all configuration goes to one file pgpool.conf along with few password (1 liners) in pool_password and pcp.conf.
All the needed configuration of watchdog(component of pgpool cluster) to find out the live/dead status of cluster comes with pgpool and merely need configuration to handle it.
You may find more information on pgpool2 at here and about latest version at here.
Also you may refer (just read first to get a gist of whole process) at link which is super useful and quite detailed on how the whole process goes.
Also let us know if you were able to setup cluster with mentioned technologies at your link.
Edit: you may find extracted configurations of pgpool.conf at my gist page
I have kept only the settings which I changed. Rest have been left as default , or may be i forgot to add 1-2 to this.
Most of the comment on the file come right from standard documentation and self-explanatory but few places I have added my own comment and they are
vip configuration.
At one place I am using a different postgres password.
note about recovery_1st_stage
note about key file referred by logdir
Also most important things is , sit back and read through original links referring to std. documentation to just a gist of what the whole thing/process is. It will be easier for you to modify it as per your needs later.
I read it , 3-4 times ( slow learner ) both the documentation and then used a mix of both approaches.
Also there are 4 files, i created
recovery_1st_stage
pgpool_remote_start.sh
failover.sh
promote_standby.sh
You will find guidance on these at both the places : std. documentation and other tutorial. they are plain sh file with a bunch of ssh and psql commands.
I'm looking at increasing the number of days historical events that are stored in the Tableau Server database from the default 183 to +365 days and I'm trying to understand what the performance impact to Tableau Server itself would be since the database and backup sizes also start increasing. Would it cause the overall Tableau Server running 2019.1.1 over time to slow to a crawl or begin to have a noticeable impact with respect to performance?
I think the answer here depends on some unknowns and makes it pretty subjective:
How much empty space is on your PostGres node.
How many events typically occur on your server in a 6-12 month period.
Maybe more importantly than a yes or a no (which should be taken with a grain of salt) would be things to consider prior to making the change.
Have you found value in the 183 default days? Is it worth the risk adding 365? It sounds like you might be doing some high level auditing, and a longer period is becoming a requirement. If that's the case, the answer is that you have no choice but to go ahead with the change. See steps below.
Make sure that the change is made in a Non-prod environment first. Ideally one with high traffic. Even though you will not get an exact replica - it would certainly be worth the practice in switching it over. Also you want to make sure that Non-prod and prod environments match exactly.
Make sure the change is very well documented. For example, if you were to change departments without anyone having knowledge of a non-standard config setting, it might make for a difficult situation if ever Support is needed or if there is a question as to what might be causing slow behavior.
Things to consider after a change:
Monitor the size of backups.
Monitor the size of the historical table(s) (See Data-Dictionary for table names if not already known.)
Be ready to roll back the config change if the above starts to inflate.
Overall:
I have not personally seen much troubleshooting value from these tables being over a certain number of days (ie: if there is trouble on the server it is usually investigated immediately and not referenced to 365+ days ago.) Perhaps the value for you lies in determining the amount of usage/expansion on Tableau Server.
I have not seen this table get so large that it brings down a server or slows it down. Especially if the server is sized appropriately.
If you're regularly/heavily working with and examining PostGres data, it may be wise to extract at a low traffic time of the day. This will prevent excess usage from outside sources during peak times. Remember that adhoc querying of PostGres is Technically unsupported. This leads to awkward situations if things go awry.
Some two weeks ago, I deployed some changes to my app (Flask + SQLAlchemy on top of Postgres) to Heroku. The response time of my dynos went up soon afterwards and the time outs in responses started. Before these problems started, the current app's version has been running flawlessly for some 2-3 months.
Naturally, I suspected my changes in the app and went through them, but there were none relevant to this (changes in the front end, replaced plain text emails with HTML ones, minor changes in the static data that the app is using).
I have a copy of the app for testing purposes, so I cloned the latest backup of the production DB and started investigating (the clone was some 45GiB, compared to 56GiB of the original, but this seems to be a normal consequence of "bloating").
It turns out that even the trivial requests take ridiculous amount of time on production, while they work on the testing one as they should. For example, select * from A where some_id in (three, int, values) takes under 0.5 sec on testing, and some 12-15 sec on prod (A has 3M records and some_id is a foreign key to a much smaller table). Even select count(*) from A will take the same amount of time, so it's not indexing or anything like that.
This is not tied to a specific query or even a table, thus removing my doubts of my code as most of it was unchanged for months and worked fine until these problems started.
Looking further into this, I found that the logs contain load averages for the DB server, and my production one is showing load-avg 22 (I searched for postgres load-avg in Papertrail), and it seems to be almost constant (slowly rising over prolonged periods of time).
I upgraded the production DB from Postgres 9.6 / Standard 2 plan (although, my connections number was around 105/400 and the cache hit rate was 100%) to Postgres 10 / Standard 3 plan, but this didn't make a slightest improvement. This upgrade also meant some 30-60min of downtime. Soon after bringing the app back up, the DB server's load was high (sadly, I didn't check during the downtime). Also, the DB server's load doesn't seem to have spikes that would reflect the app's usage (the app is mostly used in the USA and EU, and the usual app's load reflects that).
At this point, I am without ideas (apart from contacting Heroku's support, which a colleague of mine will do) and would appreciate any suggestions what to look or try next.
I ended up upgrading from standard-2 to standard-7 and my DB's load dropped to around 0.3-0.4. I don't have an explanation of why it started so suddenly.
For years, at least 8, our company has been running a process daily that has never failed. Nothing on the client side has changed, but we recently upgraded to V7R1 on the System i. The very first run of the old process fails with a Cursor not open message reported back to the client, and that's all that's in the job log as well. I have seen Error -501, SQLSTATE 24501 on occasions.
I got both IBM and DataDirect (provider of the ODBC driver) involved. IBM stated it was a client issue, DataDirect dug through logs and found that when requesting the next block of records from a cursor this error occurs. They saw no indication that the System i alerted the client that the cursor was closed.
In troubleshooting, I noticed that the ODBC driver has an option for WITH HOLD which by default is checked. If I uncheck it, this particular issue goes away, but it introduces another issue (infinite loops) which is even more serious.
There's no single common theme that causes these errors, the only thing that I see that causes this is doing some processing while looping through a fairly large resultset. It doesn't seem to be related to timing, or to a particular table or table type. The outside loops are sometimes large tables with many datatypes, sometimes tiny tables with nothing but CHAR(10) and CHAR(8) data types.
I don't really expect an answer on here since this is a very esoteric situation, but there's always some hope.
There were other issues that IBM has already addressed by having us apply PTFs to take us to 36 for the database level. I am by no means a System i expert, just a Java programmer who has to deal with this issue that has nothing to do with Java at all.
Thanks
This is for anyone else out there who may run across a similar issue. It turns out it was a bug in the QRWTSRVR code that caused the issue. The driver opened up several connections within a single job and used the same name for cursors in at least 2 of those connections. Once one of those cursors was closed QRWTSRVR would mistakenly attempt to use the closed cursor and return the error. Here is the description from the PTF cover letter:
DESCRIPTION OF PROBLEM FIXED FOR APAR SE62670 :
A QRWTSRVR job with 2 cursors named C01 takes a MSGSQL0501
error when trying to fetch from the one that is open. The DB2
code is trying to use the cursor which is pseudo closed.
The PTF SI57756 fixed the issue. I do not know that this PTF will be generally released, but if you find this post because of a similar issue hopefully this will assist you in getting it corrected.
This is how I fix DB problems on the iseries.
Start journaling the tables on the iseries or change the connection to the iseries to commit = *NONE.
for the journaling I recommend using two journals each with its own receiver.
one journal for tables with relatively few changes like a table of US States or a table that gets less than 10 updates a month. This is so you can determine when the data was changed for an audit. Keep all the receivers for this journal on-line for ever.
one journal for tables with many changes through out the day. Delete the receivers for these journals when you can no longer afford the space they take up.
If the journal or commit *none doesn't fix it. You'll need to look at the sysixadv table long running queries can wreck an ODBC connection.
SELECT SYS_TNAME, TBMEMBER, INDEX_TYPE, LASTADV, TIMESADV, ESTTIME,
REASON, "PAGESIZE", QUERYCOST, QUERYEST, TABLE_SIZE, NLSSNAME,
NLSSDBNAME, MTIUSED, MTICREATED, LASTMTIUSE, QRYMICRO, EVIVALS,
FIRSTADV, SYS_DNAME, MTISTATS, LASTMTISTA, DEPCNT FROM sysixadv
ORDER BY ESTTIME desc
also order by timesadv desc
fix those queries maybe create the advised index.
Which ODBC driver are you using?
If you're using the IBM i Access ODBC driver, then this problem may be fixed by APAR SE61342. The driver didn't always handle the return code from the server that indicated that the result set was closed and during the SQLCloseCursor function, the driver would send a close command to the server, which would return an error, since the server had already closed the cursor. Note, you don't have to be at SP11 to hit this condition, it just made it easier to hit, since I enabled pre-fetch in more cases in that fixpack. An easy test to see if that is the problem is to disable pre-fetch for the DSN or pass PREFETCH=0 on the connection string.
If you're using the DB2 Connect driver, I can't really offer much help, sorry.
So some of us dev's are starting to take over the management of some of our SQL Server boxes as we upgrade to SQL Server 2008 R2. In the past, we've manually reduced the log file sizes by using
USE [databaseName]
GO
DBCC SHRINKFILE('databaseName_log', 1)
BACKUP LOG databaseName WITH TRUNCATE_ONLY
DBCC SHRINKFILE('databaseName_log', 1)
and I'm sure you all know how the truncate only has been deprecated.
So the solutions that I've found so far are setting the recovery = simple, then shrink, then set it back... however, this one got away from us before we could get there.
Now we've got a full disk, and the mirroring that is going on is stuck in a half-completed, constantly erroring state where we can't alter any databases. We can't even open half of them in object explorer.
So from reading about it, the way around this happening in the future is to have a maintenance plan set up. (whoops. :/ ) but while we can create one, we can't start it with no disk space and SQL Server stuck in its erroring state (event viewer is showing it recording errors about 5 per second... this has been going on since last night.)
Anyone have any experience with this?
So you've kind of got a perfect storm of bad circumstances here in that you've already reached the point where SQL Server is unable to start. Normally at this point it's necessary to detach a database and move it to free space, but if you're unable to do that you're going to have to start breaking things and rebuilding.
If you have a mirror and a backup that is up to date, you're going to need to blast one unlucky database on the disk to get the instance back online. Once you have enough space, then take emergency measures to break any mirrors necessary to get the log files back to a manageable size and get them shrunk.
The above is very much emergency recovery and you've got to triple check that you have backups, transaction log backups, and logs anywhere you can so you don't lose any data.
Long term to manage the mirror you need to make sure that your mirrors are remaining synchronized, that full and transaction log backups are being taken, and potentially reconfiguring each database on the instance with a maximum file size where the sum of all log files does not exceed the available volume space.
Also, I would double check that your system databases are not on the same volume as your database data and log files. That should help with being able to start the instance when you have a full volume somewhere.
Bear in mind, if you are having to shrink your log files on a regular basis then there's already a problem that needs to be addressed.
Update: If everything is on the C: drive then consider reducing the size of the page file to get enough space to online the instance. Not sure what your setup is here.