Best way to keep the TYPO3 sys_log nice & clean? - typo3

I have this MySQL
DELETE FROM sys_log
WHERE sys_log.tstamp < UNIX_TIMESTAMP(ADDDATE(NOW(), INTERVAL -2 MONTH))
ORDER BY sys_log.tstamp ASC
LIMIT 10000
Is this good for keeping the sys_log small, if I cronjob it?

Yes and No
It IS NOT if you care about your record history.
You can revert changes to records (content, pages etc.) using the sys_history table. The sys_history tables and sys_log tables are related. When you truncate sys_log, you also loose the ability to rollback any changes to the system. Your clients may not like that.
It IS if you only care about the sys_log size.
Truncating the table via cron is fine.
In TYPO3 4.6 and up you can use the Table garbage collection scheduler task als pgampe says. For TYPO3 versions below 4.5 you can use the tablecleaner extension. If you remove all records from sys_log older than [N] days, you will also retain your record history for [N] days. That seems to be the best solution to me.
And please try to fix what is filling your sys_log in the first place ;-)

There is a scheduler task for this.
It is called Table garbage collection (scheduler).
In TYPO3 4.7, it can only clean the sys_log table. Starting from TYPO3 6.0, it can also clean the sys_history table. You can configure the number of days and what tables to clean.
Extensions may register further tables to clean.

Yes, it is.
See also other suggestions by Jochen Weiland about keeping TYPO3 installation clean and small

Since TYPO3 9, the history is no longer stored using sys_log.
You can safely delete records from sys_log.
See Breaking Change #55298.
For version before TYPO3 v9, sys_history referenced sys_log, so:
if you delete records from sys_log, you should make sure sys_history is not referencing the records you want to delete or delete these as well, if intended (see example DB queries below)
For versions before v9 (to delete only records in sys_log which are not referenced by sys_history):
DELETE FROM sys_log WHERE NOT EXISTS
(SELECT * FROM sys_history WHERE sys_history.sys_log_uid=sys_log.uid)
AND recuid=0 AND tstamp < $timestamp LIMIT $limit
Feel free to optimize this for your requirements.
What you can also do safely (without affecting sys_history) is deleting records with sys_log.error != 0.
Some more recommendations:
Set your debugging level to verbose (Warnings) on development but errors-only in production
Regularly look at the sys log and eliminate problems. You can delete the specific error from the sys_log once you have taken care of the problem (see sys_log.error != 0, sys_log.details). You can do this with a database command or on newer TYPO3 versions use the "SYSTEM: log" in the backend and use the "Delete similar errors" button:
You can also consider, doing a truncate sys_log and truncate sys_history together with using the lowlevel cleaner and delete records with deleted=1 on a major version upgrade. Be sure to talk with someone in close vicinity to the editors first though, as this will remove the entire history. Be sure that you will want to do that.
For the scheduler task "Table garbage collection" see the documentation: https://docs.typo3.org/c/typo3/cms-scheduler/master/en-us/Installation/BaseTasks/Index.html

Another common cause for large sys_log tables are issues/errors in one of the extensions used in the TYPO3 installation.
A common example when an old version of tx_solr is used:
Core: Error handler (FE): PHP Warning: Invalid argument supplied for foreach() in typo3conf/ext/solr/classes/class.tx_solr_util.php
Core: Error handler (FE): PHP Warning: array_reverse() expects parameter 1 to be array, null given in typo3conf/ext/solr/classes/class.tx_solr_util.php line 280
This set of records will pop up in sys_log every minute or so which leads to millions of records in a short period of time.
Luckily, these kind of records don't have any effect on the record history in sys_history and the associated rollback functionality, so it's safe to delete them.
If you have a large sys_log this will likely cause issues with LOCK timeouts, so you'll have to limit the delete query:
delete from sys_log where details LIKE 'Core:%' LIMIT 200000;

Related

PostgreSQL logical replication - ignore pre-existing data

Imagine dropping a subscription and recreating it from scratch. Is it possible to ignore existing data during the first synchronization?
Creating a subscription with (copy_data=false) is not an option because I do want to copy data, I just don't want to copy already existing data.
Example: There is a users table and a corresponding publication on the master. This table has 1 million rows and every minute a new row is added. Then we drop the subscription for a day.
If we recreate the subscription with (copy_data=true), replication will not start due to a conflict with already existing data. If we specify (copy_data=false), 1440 new rows will be missing. How can we synchronize the publisher and the subscriber properly?
You cannot do that, because PostgreSQL has no way of telling when the data were added.
You'd have to reconcile the tables by hand (or INSERT ... ON CONFLICT DO NOTHING).
Unfortunately PostgreSQL does not support nice skip options for conflicts yet, but I believe it will be enhanced in the feature.
Based on #Laurenz Albe answer which recommends the use of the statement:
INSERT ... ON CONFLICT DO NOTHING.
I believe that it would be better to use the following command which also will take care any possible updates on your data before you start the subscription again:
INSERT ... ON CONFLICT UPDATE SET...
Finally I have to say that both are dirty solutions as during the execution of the above statement and the creation of the subscription, new lines may have been arrived which will result in losing them until you perform again the custom sync.
I have seen some other suggested solutions using the LSN number from the Postgresql log file...
For me maybe is elegant and safe to delete all the data from the destination table and create the replication again!

TYPO3 : clear big mysql tables

On my TYPO3 6.2 website some SQL tables became quite big :
tx_realurl_urlcache 557Mo
cf_cache_hash 15.5Mo
tx_kesearch_stat_search 15.4Mo
tx_kesearch_stat_word 19.6Mo
sys_refindex 18.1Mo
Please notice that all the others tables (about 100 tables) are all combined > 15Mo ... so my question is simple :
-> Which one could I delete ? Is it safe or not ?
I have bad experiences in the past with TYPO3 database cleanup then I rather ask you for advice :)
TLDR: the only tables which could be cleared are cache tables, but they cost you performance and will build up again soon.
you might clear these tables and these propably will build up again, but you will suffer.
tx_realurl_urlcache - here realurl stores the generated urls, if you truncate it the url decoding might break/ some urls might be unknown = your page breaks
cf_cache_* - can be truncated but will be rebuild, meanwhile your server needs to rebuild the information. it is slower.
tx_kesearch_stat_search / tx_kesearch_stat_word - these two belong to the kesearch-extension and include the index information of your page. Truncating will terminate the search until the tables are rebuild
sys_refindex - here TYPO3 stores the references which will help you to avoid deleting used files or records. (normaly this index is rebuild with a scheduler task to get consistent data)
Do not delete the table! You can truncate some tables.
If you want to clean up some cache just flush all Typo3 caches in backend or just use the 'clear all cache' button inside the typo3-install tool.

Move items between collections

I need to move a heavy quantity of items between two collections. I tried to change direct at database the tables "item" and "collection2item", columns "owning_collection" and "item_id" respectively. Then I restarted tomcat, cleaned the cocoon cache, rebuilt the index and it's still not working.
Is the process metadata-export/metadata-import safer or easier than the above for mass move of items?
What else can I do?
Your process should be ok if you run the reindex with the -bf flags (just -f may be enough too).
Without the -f flag, the reindex (link goes to code as of DSpace 5.x) will check the last_modified value (in the item table) and only reindex items whose value in that column has changed since the last reindex. This also means that a reindex without -f should work if you also updated the last_modified timestamp.
Still not working?
If the reindex still doesn't happen, something else must be going wrong. Check your dspace.log -- are there any entries that look like "wrote item xyz to index"? If not then the items aren't being reindexed. Are there any error messages in the dspace.log around the time you do the reindex? Any error messages in the solr log file?
Also, make sure you always run the reindex (and all other dspace commands) as the same user that tomcat is running under, to avoid permissions problems. If you've ever run the commands as a different user, change the permissions of the solr data directory (probably [dspace]/solr/search/data) so that the tomcat user can create/write/delete files in it.
Overall recommendation
In most cases I'd go with batch metadata editing myself for moving items between collections, it avoids all these problems and will trigger a re-index of the affected items automatically.
The metadata import process is very reliable. It also provides a preview option that will allow you to see the changes before they are applied. After the items are updated, the proper re-indexing processes will run.
You only need to provide the item ids and the data fields you wish to edit.
If you prefer to build your CSV file by hand or from a SQL query, that will work as well. The name of the column at the top of your CSV will determine the fields to be updated.
https://wiki.duraspace.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-CSVFormat

Postgresql replication without DELETE statement

We have a requirement that says we should have a copy of all the items that were in our system at one point. The most simple way to explain it would be replication but ignoring the delete statement (INSERT and UPDATE are ok)
Is this possible ? or maybe the better question would be what is the best approach to tackle this kind of problem?
Make a copy/replica of current database and use triggers via dblink from current database to the replica. Use after insert and after update trigger to insert and update data in replica.
So whenever a row insertion/updation take place in current database it will directly reflect to replica.
I'm not sure that I understand the question completely, but I'll try to help:
First (opposite to #Sunit) - I suggest avoiding triggers. Triggers are introducing additional overhead and impacting performance.
The solution I would use (and I'm actually using in few of my projects with similar demands) - don't use DELETE at all. Instead you can add bit (boolean) column called "Deleted", set its default value to 0 (false), and instead of deleting the row you update this field to 1 (true). You'll also need to change your other queries (SELECT) to include something like "WHERE Deleted = 0".
Another option is to continue using DELETE as usual, and to allow deleting records from both primary and replica, but configure WAL archiving, and store WAL archives in some shared directory. This will allow you to moment-in-time recovery, meaning that you'll be able to restore another PostgreSQL instance to state of your cluster in any moment in time (i.e. before the deletion). This way you'll have a trace of deleted records, but pretty complicated procedure to reach the records. Depending on how often deleted records will be checked in the future (maybe they are not checked at all, but simply kept for just-in-case tracking) this approach my also help.

Use WAL files for PostgreSQL record version control?

I want to be able to track changes to records in a PostgreSQL database. I've considered using a version field and on-update rules or triggers such that previous versions of records are kept in the table (or in a separate table). This would have the advantage of making it possible to view the version history of a record with a simple select statement. However, this functionality is something I think likely to be seldom used.
How could I satisfy the requirement of being able to construct a "version history" for a record using the WAL files? Reading the WAL and Point-in-Time recovery documentation at PostgreSQL.org has helped me understand how the state of the entire database can be rolled back to an arbitrary point in time, but not how to deal with update mistakes in particular records.
No, you cannot do this at this time. There is a large effort underway on the postgresql-hackers mailing list (the dev list) to rework WAL and build an interface to allow for logical replication in (possibly) PostgreSQL 9.3.
This is basically what you appear to be trying to do and, based on the discussions on that list, it is definitely not a trivial task.