TYPO3 : clear big mysql tables

TYPO3 : clear big mysql tables - typo3

On my TYPO3 6.2 website some SQL tables became quite big :
tx_realurl_urlcache 557Mo
cf_cache_hash 15.5Mo
tx_kesearch_stat_search 15.4Mo
tx_kesearch_stat_word 19.6Mo
sys_refindex 18.1Mo
Please notice that all the others tables (about 100 tables) are all combined > 15Mo ... so my question is simple :
-> Which one could I delete ? Is it safe or not ?
I have bad experiences in the past with TYPO3 database cleanup then I rather ask you for advice :)

TLDR: the only tables which could be cleared are cache tables, but they cost you performance and will build up again soon.
you might clear these tables and these propably will build up again, but you will suffer.
tx_realurl_urlcache - here realurl stores the generated urls, if you truncate it the url decoding might break/ some urls might be unknown = your page breaks
cf_cache_* - can be truncated but will be rebuild, meanwhile your server needs to rebuild the information. it is slower.
tx_kesearch_stat_search / tx_kesearch_stat_word - these two belong to the kesearch-extension and include the index information of your page. Truncating will terminate the search until the tables are rebuild
sys_refindex - here TYPO3 stores the references which will help you to avoid deleting used files or records. (normaly this index is rebuild with a scheduler task to get consistent data)

Do not delete the table! You can truncate some tables.
If you want to clean up some cache just flush all Typo3 caches in backend or just use the 'clear all cache' button inside the typo3-install tool.

Related

Sync Elasticsearch Postgresql on a Springboot application

I have Postgresql as my primary database and I would like to take advantage of the Elasticsearch as a search engine for my SpringBoot application.
Problem: The queries are quite complex and with millions of rows in each table, most of the search queries are timing out.
Partial solution: I utilized the materialized views concept in the Postgresql and have a job running that refreshes them every X minutes. But on systems with huge amounts of data and with other database transactions (especially writes) in progress, the views tend to take long times to refresh (about 10 minutes to refresh 5 views). I realized that the current views are at it's capacity and I cannot add more.
That's when I started exploring other options just for the search and landed on Elasticsearch and it works great with the amount of data I have. As a POC, I used the Logstash's Jdbc input plugin but then it doesn't support the DELETE operation (bummer).
From here the soft delete is the option which I cannot take because:
A) Almost all the tables in the postgresql DB are updated every few minutes and some of them have constraints on the "name" key which in this case will stay until a clean-up job runs.
B) Many tables in my Postgresql Db are referenced with CASCADE DELETE and it's not possible for me to update 220 table's Schema and JPA queries to check for the soft delete boolean.
The same question mentioned in the link above also provides PgSync that syncs the postgresql with elasticsearch periodically. However, I cannot go with that either since it has LGPL license which is forbidden in our organization.
I'm starting to wonder if anyone else encountered this strange limitation of elasticsearch and RDMS.
I'm open to other options rather than elasticsearch to solve my need. I just don't know what's the right stack to use. Any help here is much appreciated!

Existing Postgres Database vs Solr

We have an app that uses postgres database, that has about 50 tables. Each table contains about 3 Million records (on average). The tables get updated with new data every now and than. Now, we want to implement search feature in our app. The search needs to be performed on one table at a time (no joins needed).
I've read about postgres full text support and that looks promising. But it seems that Solr is Super fast in comparison to it. Can I use my existing postgres database with Solr? If tables get updated would I need to re-index everything again?

It is definitely worth giving Solr a try. We moved many MySQL queries involving JOINs on multiple tables with sorting on different fields to Solr. We are very happy with Solr's search speed, sort speed, faceting capabilities and highly configurable text analysis/tokenization options.
If tables get updated would I need to re-index everything again?
No, you can run delta imports to only re-index your new and updated documents. See https://wiki.apache.org/solr/DataImportHandler.
Get started with https://lucene.apache.org/solr/4_1_0/tutorial.html and all the links in there.

Since nobody has leapt in, I'll answer.
I'm afraid it all depends. It depends on (at least)
how big the text is in each "document"
how flexible you want your searching to be
how much integration you need between database and text-search
how fast is fast enough
how much experience you have with both
When I've had a database that needs some text searching, I've just used PG's built-in options. If I didn't have superuser access to the db, or was already running a big Java setup then Solr might well have appealed.

Is Memcached for me?

I am SysAdmin for a couple of large online shops and I'm researching Memcached as a possible caching solution.
The most accessed queries are the ones which make up the dynamic product pages, so it would make sense to cache these. Staff regularly use an update program to update the tables with new prices. As I understand, if I used Memcached the changes would only be apparent after the cache expires and not after my program has updated.
In the docs, I can see "Memcache::flush" which flushes ALL existing items, but is there a way to flush an individual object?

You can see in docs that there is delete command that removes one item. Also there is a set to add or replace one item.

The most important part is to have a solid naming scheme on your keys. Presumably you have a cms type page to update/insert rows in your database (mysql?). Just ensure that you delete the memcache record whenever you do an update in mysql and you'll be fine.

Best way to keep the TYPO3 sys_log nice & clean?

I have this MySQL
DELETE FROM sys_log
WHERE sys_log.tstamp < UNIX_TIMESTAMP(ADDDATE(NOW(), INTERVAL -2 MONTH))
ORDER BY sys_log.tstamp ASC
LIMIT 10000
Is this good for keeping the sys_log small, if I cronjob it?

Yes and No
It IS NOT if you care about your record history.
You can revert changes to records (content, pages etc.) using the sys_history table. The sys_history tables and sys_log tables are related. When you truncate sys_log, you also loose the ability to rollback any changes to the system. Your clients may not like that.
It IS if you only care about the sys_log size.
Truncating the table via cron is fine.
In TYPO3 4.6 and up you can use the Table garbage collection scheduler task als pgampe says. For TYPO3 versions below 4.5 you can use the tablecleaner extension. If you remove all records from sys_log older than [N] days, you will also retain your record history for [N] days. That seems to be the best solution to me.
And please try to fix what is filling your sys_log in the first place ;-)

There is a scheduler task for this.
It is called Table garbage collection (scheduler).
In TYPO3 4.7, it can only clean the sys_log table. Starting from TYPO3 6.0, it can also clean the sys_history table. You can configure the number of days and what tables to clean.
Extensions may register further tables to clean.

Yes, it is.
See also other suggestions by Jochen Weiland about keeping TYPO3 installation clean and small

Since TYPO3 9, the history is no longer stored using sys_log.
You can safely delete records from sys_log.
See Breaking Change #55298.
For version before TYPO3 v9, sys_history referenced sys_log, so:
if you delete records from sys_log, you should make sure sys_history is not referencing the records you want to delete or delete these as well, if intended (see example DB queries below)
For versions before v9 (to delete only records in sys_log which are not referenced by sys_history):
DELETE FROM sys_log WHERE NOT EXISTS
(SELECT * FROM sys_history WHERE sys_history.sys_log_uid=sys_log.uid)
AND recuid=0 AND tstamp < $timestamp LIMIT $limit
Feel free to optimize this for your requirements.
What you can also do safely (without affecting sys_history) is deleting records with sys_log.error != 0.
Some more recommendations:
Set your debugging level to verbose (Warnings) on development but errors-only in production
Regularly look at the sys log and eliminate problems. You can delete the specific error from the sys_log once you have taken care of the problem (see sys_log.error != 0, sys_log.details). You can do this with a database command or on newer TYPO3 versions use the "SYSTEM: log" in the backend and use the "Delete similar errors" button:
You can also consider, doing a truncate sys_log and truncate sys_history together with using the lowlevel cleaner and delete records with deleted=1 on a major version upgrade. Be sure to talk with someone in close vicinity to the editors first though, as this will remove the entire history. Be sure that you will want to do that.
For the scheduler task "Table garbage collection" see the documentation: https://docs.typo3.org/c/typo3/cms-scheduler/master/en-us/Installation/BaseTasks/Index.html

Another common cause for large sys_log tables are issues/errors in one of the extensions used in the TYPO3 installation.
A common example when an old version of tx_solr is used:
Core: Error handler (FE): PHP Warning: Invalid argument supplied for foreach() in typo3conf/ext/solr/classes/class.tx_solr_util.php
Core: Error handler (FE): PHP Warning: array_reverse() expects parameter 1 to be array, null given in typo3conf/ext/solr/classes/class.tx_solr_util.php line 280
This set of records will pop up in sys_log every minute or so which leads to millions of records in a short period of time.
Luckily, these kind of records don't have any effect on the record history in sys_history and the associated rollback functionality, so it's safe to delete them.
If you have a large sys_log this will likely cause issues with LOCK timeouts, so you'll have to limit the delete query:
delete from sys_log where details LIKE 'Core:%' LIMIT 200000;

How to prevent Write Ahead Logging on just one table in PostgreSQL?

I am considering log-shipping of Write Ahead Logs (WAL) in PostgreSQL to create a warm-standby database. However I have one table in the database that receives a huge amount of INSERT/DELETEs each day, but which I don't care about protecting the data in it. To reduce the amount of WALs produced I was wondering, is there a way to prevent any activity on one table from being recorded in the WALs?

Ran across this old question, which now has a better answer. Postgres 9.1 introduced "Unlogged Tables", which are tables that don't log their DML changes to WAL. See the docs for more info, but at least now there is a solution for this problem.
See Waiting for 9.1 - UNLOGGED tables by depesz, and the 9.1 docs.

Unfortunately, I don't believe there is. The WAL logging operates on the page level, which is much lower than the table level and doesn't even know which page holds data from which table. In fact, the WAL files don't even know which pages belong to which database.
You might consider moving your high activity table to a completely different instance of PostgreSQL. This seems drastic, but I can't think of another way off the top of my head to avoid having that activity show up in your WAL files.

To offer one option to my own question. There are temp tables - "temporary tables are automatically dropped at the end of a session, or optionally at the end of the current transaction (see ON COMMIT below)" - which I think don't generate WALs. Even so, this might not be ideal as the table creation & design will be have to be in the code.

I'd consider memcached for use-cases like this. You can even spread the load over a bunch of cheap machines too.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse