Move items between collections

Move items between collections - import

I need to move a heavy quantity of items between two collections. I tried to change direct at database the tables "item" and "collection2item", columns "owning_collection" and "item_id" respectively. Then I restarted tomcat, cleaned the cocoon cache, rebuilt the index and it's still not working.
Is the process metadata-export/metadata-import safer or easier than the above for mass move of items?
What else can I do?

Your process should be ok if you run the reindex with the -bf flags (just -f may be enough too).
Without the -f flag, the reindex (link goes to code as of DSpace 5.x) will check the last_modified value (in the item table) and only reindex items whose value in that column has changed since the last reindex. This also means that a reindex without -f should work if you also updated the last_modified timestamp.
Still not working?
If the reindex still doesn't happen, something else must be going wrong. Check your dspace.log -- are there any entries that look like "wrote item xyz to index"? If not then the items aren't being reindexed. Are there any error messages in the dspace.log around the time you do the reindex? Any error messages in the solr log file?
Also, make sure you always run the reindex (and all other dspace commands) as the same user that tomcat is running under, to avoid permissions problems. If you've ever run the commands as a different user, change the permissions of the solr data directory (probably [dspace]/solr/search/data) so that the tomcat user can create/write/delete files in it.
Overall recommendation
In most cases I'd go with batch metadata editing myself for moving items between collections, it avoids all these problems and will trigger a re-index of the affected items automatically.

The metadata import process is very reliable. It also provides a preview option that will allow you to see the changes before they are applied. After the items are updated, the proper re-indexing processes will run.
You only need to provide the item ids and the data fields you wish to edit.
If you prefer to build your CSV file by hand or from a SQL query, that will work as well. The name of the column at the top of your CSV will determine the fields to be updated.
https://wiki.duraspace.org/display/DSDOC5x/Batch+Metadata+Editing#BatchMetadataEditing-CSVFormat

Related

Postgresql replication without DELETE statement

We have a requirement that says we should have a copy of all the items that were in our system at one point. The most simple way to explain it would be replication but ignoring the delete statement (INSERT and UPDATE are ok)
Is this possible ? or maybe the better question would be what is the best approach to tackle this kind of problem?

Make a copy/replica of current database and use triggers via dblink from current database to the replica. Use after insert and after update trigger to insert and update data in replica.
So whenever a row insertion/updation take place in current database it will directly reflect to replica.

I'm not sure that I understand the question completely, but I'll try to help:
First (opposite to #Sunit) - I suggest avoiding triggers. Triggers are introducing additional overhead and impacting performance.
The solution I would use (and I'm actually using in few of my projects with similar demands) - don't use DELETE at all. Instead you can add bit (boolean) column called "Deleted", set its default value to 0 (false), and instead of deleting the row you update this field to 1 (true). You'll also need to change your other queries (SELECT) to include something like "WHERE Deleted = 0".
Another option is to continue using DELETE as usual, and to allow deleting records from both primary and replica, but configure WAL archiving, and store WAL archives in some shared directory. This will allow you to moment-in-time recovery, meaning that you'll be able to restore another PostgreSQL instance to state of your cluster in any moment in time (i.e. before the deletion). This way you'll have a trace of deleted records, but pretty complicated procedure to reach the records. Depending on how often deleted records will be checked in the future (maybe they are not checked at all, but simply kept for just-in-case tracking) this approach my also help.

Is database deletion through Entity Framework permanent?

I have a web application using EntityFramework and an Azure SQL Database. I would like to know if deleting a row in the database removes the information permanently or simply marks is as deleted but it still be accessed if needed?
db.MyTable.Remove(objectInstance);
db.SaveChanges();
Is this someting that can be configured or do I need to implement this feature myself adding a deleted attribute?
The reason I want this is to be able to perform analytics including objects that might have been already deleted

EF has nothing to do with this actually. Whether records are deleted permanently or not is actually up to the RDBMS. EF is an ORM for the RDBMS.
Options IMO:
You manage the records marked as deleted using an extra column
You can move the deleted records to another table or file whichever is convenient for you to run analytics on. That way your queries will have to touch less number of records and be faster.
You can go through the log files and execute the INSERTs again to get the deleted records.
Hope my suggestions help you in right direction.

Is Memcached for me?

I am SysAdmin for a couple of large online shops and I'm researching Memcached as a possible caching solution.
The most accessed queries are the ones which make up the dynamic product pages, so it would make sense to cache these. Staff regularly use an update program to update the tables with new prices. As I understand, if I used Memcached the changes would only be apparent after the cache expires and not after my program has updated.
In the docs, I can see "Memcache::flush" which flushes ALL existing items, but is there a way to flush an individual object?

You can see in docs that there is delete command that removes one item. Also there is a set to add or replace one item.

The most important part is to have a solid naming scheme on your keys. Presumably you have a cms type page to update/insert rows in your database (mysql?). Just ensure that you delete the memcache record whenever you do an update in mysql and you'll be fine.

Best way to keep the TYPO3 sys_log nice & clean?

I have this MySQL
DELETE FROM sys_log
WHERE sys_log.tstamp < UNIX_TIMESTAMP(ADDDATE(NOW(), INTERVAL -2 MONTH))
ORDER BY sys_log.tstamp ASC
LIMIT 10000
Is this good for keeping the sys_log small, if I cronjob it?

Yes and No
It IS NOT if you care about your record history.
You can revert changes to records (content, pages etc.) using the sys_history table. The sys_history tables and sys_log tables are related. When you truncate sys_log, you also loose the ability to rollback any changes to the system. Your clients may not like that.
It IS if you only care about the sys_log size.
Truncating the table via cron is fine.
In TYPO3 4.6 and up you can use the Table garbage collection scheduler task als pgampe says. For TYPO3 versions below 4.5 you can use the tablecleaner extension. If you remove all records from sys_log older than [N] days, you will also retain your record history for [N] days. That seems to be the best solution to me.
And please try to fix what is filling your sys_log in the first place ;-)

There is a scheduler task for this.
It is called Table garbage collection (scheduler).
In TYPO3 4.7, it can only clean the sys_log table. Starting from TYPO3 6.0, it can also clean the sys_history table. You can configure the number of days and what tables to clean.
Extensions may register further tables to clean.

Yes, it is.
See also other suggestions by Jochen Weiland about keeping TYPO3 installation clean and small

Since TYPO3 9, the history is no longer stored using sys_log.
You can safely delete records from sys_log.
See Breaking Change #55298.
For version before TYPO3 v9, sys_history referenced sys_log, so:
if you delete records from sys_log, you should make sure sys_history is not referencing the records you want to delete or delete these as well, if intended (see example DB queries below)
For versions before v9 (to delete only records in sys_log which are not referenced by sys_history):
DELETE FROM sys_log WHERE NOT EXISTS
(SELECT * FROM sys_history WHERE sys_history.sys_log_uid=sys_log.uid)
AND recuid=0 AND tstamp < $timestamp LIMIT $limit
Feel free to optimize this for your requirements.
What you can also do safely (without affecting sys_history) is deleting records with sys_log.error != 0.
Some more recommendations:
Set your debugging level to verbose (Warnings) on development but errors-only in production
Regularly look at the sys log and eliminate problems. You can delete the specific error from the sys_log once you have taken care of the problem (see sys_log.error != 0, sys_log.details). You can do this with a database command or on newer TYPO3 versions use the "SYSTEM: log" in the backend and use the "Delete similar errors" button:
You can also consider, doing a truncate sys_log and truncate sys_history together with using the lowlevel cleaner and delete records with deleted=1 on a major version upgrade. Be sure to talk with someone in close vicinity to the editors first though, as this will remove the entire history. Be sure that you will want to do that.
For the scheduler task "Table garbage collection" see the documentation: https://docs.typo3.org/c/typo3/cms-scheduler/master/en-us/Installation/BaseTasks/Index.html

Another common cause for large sys_log tables are issues/errors in one of the extensions used in the TYPO3 installation.
A common example when an old version of tx_solr is used:
Core: Error handler (FE): PHP Warning: Invalid argument supplied for foreach() in typo3conf/ext/solr/classes/class.tx_solr_util.php
Core: Error handler (FE): PHP Warning: array_reverse() expects parameter 1 to be array, null given in typo3conf/ext/solr/classes/class.tx_solr_util.php line 280
This set of records will pop up in sys_log every minute or so which leads to millions of records in a short period of time.
Luckily, these kind of records don't have any effect on the record history in sys_history and the associated rollback functionality, so it's safe to delete them.
If you have a large sys_log this will likely cause issues with LOCK timeouts, so you'll have to limit the delete query:
delete from sys_log where details LIKE 'Core:%' LIMIT 200000;

How do I ALTER a set of partitioned tables in Postgres?

I created a set of partitioned tables in Postgres, and started inserting a lot of rows via the master table. When the load process blew up on me, I realized I should have declared the id row BIGSERIAL (BIGINT with a sequence, behind the scenes), but inadvertently set it as SERIAL (INTEGER). Now that I have a couple of billion rows loaded, I am trying to ALTER the column to BIGINT. The process seems to be working, but is taking a long time. So, in reality, I don't really know if it is working or it is hung. I'd rather not restart the entire load process again.
Any suggestions?

When you update a row to alter it in PostgreSQL, that writes out a new copy of the row and then does some cleanup later to remove the original. This means that trying to fix the problem by doing updates can take longer than just loading all the data in from scratch again--it's more disk I/O than loading a new copy, and some extra processing time too. The only situation where you'd want to do an update instead of a reload is when the original load was very inefficient, for example if a slow client programs is inserting the data and it's the bottleneck on the process.
To figure out if the process is still working, see if it's using CPU when you run top (UNIX-ish systems) or the Task Manager (Windows). On Linux, "top -c" will even show you what the PostgreSQL client processes are doing. You probably just expected it to take less time than the original load, which it won't, and it's still running rather than hung up.

Restart it (clarifying edit: restart the entire load process again).
Altering a column value requires a new row version, and all indexes pointing to the old version to be updated to point to the new version.
Additionally, see how much of the advise on populating databases you can follow.
Correction from #archnid:
altering the type of the column will trigger a table rewrite, so the row versioning isn't a big problem, but it will still take lots of disk space temporarily. you can usually monitor progress by looking at which files in the database directory are being appended to...