How can I run Sphinx indexer as soon as merge is complete?

How can I run Sphinx indexer as soon as merge is complete? - sphinx

I have set up 4 CRON jobs to automatically reindex my Sphinx indexes as below:
*/5 * * * /usr/bin/pgrep indexer || time /usr/local/sphinx/bin/indexer --rotate --config /usr/local/sphinx/etc/sphinx.conf ripples_delta
*/5 * * * /usr/bin/pgrep indexer || time /usr/local/sphinx/bin/indexer --rotate --config /usr/local/sphinx/etc/sphinx.conf users_delta
30 23 * * * /usr/bin/pgrep indexer || time /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf --merge users users_delta --merge-dst-range deleted 0 0 --rotate
0 0 * * * /usr/bin/pgrep indexer || time /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf --merge ripples ripples_delta --merge-dst-range deleted 0 0 --rotate
The above shows pgrep, which I hope is being used in every instance to check to see if indexer is already running. My intention here is to prevent any potentially resource hungry overlaps.
The first two Cron jobs run every 5 minutes and update the Delta indexes for my two main indexes.
The second two run once per day (one at 11:30pm and the other at 12am) and merge the delta indexes into their main counterparts.
My understanding is that following these index merges I need to re-run the index on the delta's in order to remove all of the previously merged data and essentially clean them up ready for the next day's indexing.
How can I ensure that this happens automatically upon completion of the merges? Obviously I could just add two more cron jobs but I need them to take place immediately after the relevant merge has finished.
Thanks in advance.

Another related issue, you should do
*/6 ... indexer --rotate users_delta ripples_delta
ie update both in one command. Then indexer buildes both indexes, then performs the rotation.
With two parallel processes, the two rotations could end up stepping on each other.
(also with the pgrep, it also means the second of the two delta updates are unlikly to first, the first will have always just been started)
Also change to say
34 23 * ...
ie rather than "30", which will mean happening exactly the same time as the delta. And the delta is liky to have already started, meaning will never get the merges.

Perhaps an even better way is to create a small 'indexing' daemon.
eg
<?php
while (1) {
if (filemtime('path_to_/ripples.sph') < time()-(24*3600)) {
`indexer --rotate ripples_delta`;
sleep(10);
`indexer --merge ripples ripples_delta --rotate`;
mysql_query("UPDATE sph_counter ... ");
`indexer --rotate ripples_delta`;
} elseif (filemtime('path_to_/users.sph') < time()-(24*3600)) {
`indexer --rotate users_delta`;
sleep(10);
`indexer --merge users users_delta --rotate`;
mysql_query("UPDATE sph_counter ... ");
`indexer --rotate users_delta`;
} else {
`indexer --rotate ripples_delta users_delta`;
}
sleep(5*60);
clearstatcache();
}
This way, you just leave this script running indefinitly (I've used screen for this. But a more robust solution is something like monit).
It will make sure that only ever one process is running at a time. Take care of all the actions. And if the indexing takes longer then it just maintains a gap of 5 minutes.
To be really clever could run a mysql query, to check if the rippes or user tables have updates. And dont even bother running indexer if not.

Create a small shell script that
indexes the delta
merges the delta back into the main
updates the database to update the counter flag (the main has changed, so deltas need to use the new counter)
reindexes the delta again
Being a shell script ensures they run in sequence.
Technically could also miss off 1) as the other */5 will have always recently run anyway.
You also need to run a script to run step 3) anyway. Sphinx cant do that for you. http://sphinxsearch.com/bugs/view.php?id=517

For any periodical task I would suggest to create a lock file on the beginning of the script to avoid re-entrance and check if it's exists on the script start.
Script wrapper sample (could be used for periodical MySQL backups as well) is here: http://astellar.com/2012/10/backups-running-at-the-same-time/

Related

In DB2 for z/OS, can I rebuild multiple indexspaces with one command?

On DB2 for z/OS version 10, an errant utility left MANY indexspaces in a status of "RW,RBDP" within a specific database. I can successfully use the REBUILD INDEXSPACE command to fix them one-by-one, but there are A LOT of them. So, I was hoping for some kind of wild-card or *ALL option, but that doesn't yet work for me.
Is there a way to do the equivalent of the following?
REBUILD INDEXSPACE (MYDB.*)
Thanks in advance!

You can't do an entire database at a time, but you could do some queries along with a LISTDEF to get a similar result.
First, find the indexes in question:
SELECT ' INCLUDE INDEX ' || RTRIM(CREATOR) || '.' || RTRIM(NAME)
FROM SYSIBM.SYSINDEXES
WHERE DBNAME = 'MYDB'
That will give you the list of indexes related to that database. Then, you can take the results as part of a larger LISTDEF. Here's some example JCL (honestly, I'm not sure how much of this is specific to my shop, so there may be some changes required):
//*****************************************************
//* RUN REBUILD INDEX UTILITY
//*****************************************************
//IXRBREST EXEC PGM=IEFBR14 DUMMY STEP FOR RESTART
//IXRBUTIL EXEC DB2UPROC,SYSTEM=DB2T,COND=(4,LT)
//STEPLIB DD DSN=DB2.PROD.SDSNLOAD,DISP=SHR
//DB2UPROC.SYSIN DD *
LISTDEF INDEXES
<insert generated list here>
REBUILD INDEX LIST INDEXES
SORTKEYS SORTDEVT SYSDA SHRLEVEL CHANGE
STATISTICS REPORT YES UPDATE ALL
MAXRO 240 LONGLOG CONTINUE DELAY 900 TIMEOUT TERM
DRAIN_WAIT 50 RETRY 6 RETRY_DELAY 30
That should get you the indexes that need to rebuild. If there are some that need to be rebuilt, and some that are fine, you could add SCOPE PENDING to the REBUILD INDEX utility, and it will only rebuild those in a pending state.

The alternative to REBUILD is to do a DROP/CREATE. This is necessary when the datasets are missing or damaged. Unless you specify DEFER on index create, you get a REBUILD for free. DB2 extracts the index from the tablespace VSAM Linear in all cases.

How can i know which query takes long time while running a postgres function?

Actually i am running a function in postgres which takes 1123 + ms to execute the function.
That function consist of calling other function and have many query to execute . How can i know which query is culprit for slow execution of function .
I have seen . select * from pg_stat_activity; give the output of current running process.
Can it is possible to get the individual query time while running the postgres function ?
I know many will say log the query time in database by insert but is there is any method in postgres so that i can get the time taking by each query .
Also is there is any way without changing the config file in postgres because i don't want to restart the postgres . If not , other solution most welcome.
Thanks.

You can do this via the pg_stat_statements extension, though loading it does require a server restart.
After installing it, just SET pg_stat_statements.track = all (as a superuser), call your function, then SELECT * FROM pg_stat_statements.
Unless you have exclusive access to the server, you probably want to include a default of pg_stat_statements.track = none in your postgresql.conf, so that only your session is tracked.

How can I ensure that a materialized view is always up to date?

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right? I'm surprised to not find much discussion of this on the web.
How should I go about doing this?
I think the top half of the answer here is what I'm looking for: https://stackoverflow.com/a/23963969/168143
Are there any dangers to this? If updating the view fails, will the transaction on the invoking update, insert, etc. be rolled back? (this is what I want... I think)

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right?
Yes, PostgreSQL by itself will never call it automatically, you need to do it some way.
How should I go about doing this?
Many ways to achieve this. Before giving some examples, keep in mind that REFRESH MATERIALIZED VIEW command does block the view in AccessExclusive mode, so while it is working, you can't even do SELECT on the table.
Although, if you are in version 9.4 or newer, you can give it the CONCURRENTLY option:
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
This will acquire an ExclusiveLock, and will not block SELECT queries, but may have a bigger overhead (depends on the amount of data changed, if few rows have changed, then it might be faster). Although you still can't run two REFRESH commands concurrently.
Refresh manually
It is an option to consider. Specially in cases of data loading or batch updates (e.g. a system that only loads tons of information/data after long periods of time) it is common to have operations at end to modify or process the data, so you can simple include a REFRESH operation in the end of it.
Scheduling the REFRESH operation
The first and widely used option is to use some scheduling system to invoke the refresh, for instance, you could configure the like in a cron job:
*/30 * * * * psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv"
And then your materialized view will be refreshed at each 30 minutes.
Considerations
This option is really good, specially with CONCURRENTLY option, but only if you can accept the data not being 100% up to date all the time. Keep in mind, that even with or without CONCURRENTLY, the REFRESH command does need to run the entire query, so you have to take the time needed to run the inner query before considering the time to schedule the REFRESH.
Refreshing with a trigger
Another option is to call the REFRESH MATERIALIZED VIEW in a trigger function, like this:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
RETURN NULL;
END;
$$;
Then, in any table that involves changes on the view, you do:
CREATE TRIGGER tg_refresh_my_mv AFTER INSERT OR UPDATE OR DELETE
ON table_name
FOR EACH STATEMENT EXECUTE PROCEDURE tg_refresh_my_mv();
Considerations
It has some critical pitfalls for performance and concurrency:
Any INSERT/UPDATE/DELETE operation will have to execute the query (which is possible slow if you are considering MV);
Even with CONCURRENTLY, one REFRESH still blocks another one, so any INSERT/UPDATE/DELETE on the involved tables will be serialized.
The only situation I can think that as a good idea is if the changes are really rare.
Refresh using LISTEN/NOTIFY
The problem with the previous option is that it is synchronous and impose a big overhead at each operation. To ameliorate that, you can use a trigger like before, but that only calls a NOTIFY operation:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, 'my_mv';
RETURN NULL;
END;
$$;
So then you can build an application that keep connected and uses LISTEN operation to identify the need to call REFRESH. One nice project that you can use to test this is pgsidekick, with this project you can use shell script to do LISTEN, so you can schedule the REFRESH as:
pglisten --listen=refresh_mv --print0 | xargs -0 -n1 -I? psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY ?;"
Or use pglater (also inside pgsidekick) to make sure you don't call REFRESH very often. For example, you can use the following trigger to make it REFRESH, but within 1 minute (60 seconds):
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, '60 REFRESH MATERIALIZED VIEW CONCURRENLTY my_mv';
RETURN NULL;
END;
$$;
So it will not call REFRESH in less the 60 seconds apart, and also if you NOTIFY many times in less than 60 seconds, the REFRESH will be triggered only once.
Considerations
As the cron option, this one also is good only if you can bare with a little stale data, but this has the advantage that the REFRESH is called only when really needed, so you have less overhead, and also the data is updated more closer to when needed.
OBS: I haven't really tried the codes and examples yet, so if someone finds a mistake, typo or tries it and works (or not), please let me know.

Now there is a PostgreSQL extension to keep materialized views updated: pg_ivm.
It only computes and applies the incremental changes, rather than recomputing the contents fully as REFRESH MATERIALIZED VIEW does. It has 2 approaches, IMMEDIATE and DEFERRED:
For IMMEDIATE, the views are updated in the same transaction that its base table is modified.
For DEFERRED, the views are updated after the transaction is committed.
Version 1.0 has been released on 2022-04-28.

Let me point out three things on the previous answer by MatheusOl - the pglater technology.
As the last element of long_options array it should include "{0, 0, 0, 0}" element as pointed at https://linux.die.net/man/3/getopt_long by the phrase "The last element of the array has to be filled with zeros." So, it should read -
static struct option long_options[] = {
//......
{"help", no_argument, NULL, '?'},
{0, 0, 0, 0}
};
On the malloc/free thing -- one free(for char listen = malloc(...);) is missing. Anyhow, malloc caused pglater process to crash on CentOS (but not on Ubuntu - I don't know why). So, I recommend using char array and assign the array name to the char pointer(to both char and char**). You many need to force type conversion while you do that(pointer assignment).
char block4[100];
...
password_prompt = block4;
...
char block1[500];
const char **keywords = (const char **)&block1;
...
char block3[300];
char *listen = block3;
sprintf(listen, "listen %s", id);
PQfreemem(id);
res = PQexec(db, listen);
Use below table to calculate timeout where md is mature_duration which is the time difference between the latest refresh(lr) time point and current time.
when md >= callback_delay(cd) ==> timeout: 0
when md + PING_INTERVAL >= cd ==> timeout: cd-md[=cd-(now-lr)]
when md + PING_INTERVAL < cd ==> timeout: PI
To implement this algorithm(3rd point), you should init 'lr' as follows -
res = PQexec(db, command);
latest_refresh = time(0);
if (PQresultStatus(res) == PGRES_COMMAND_OK) {

multiple cron jobs on the same postgres table

I have a cron job that runs every 2 mins it takes a 10 records from a postgres table and working on them then it set a flag when it is finished. i want to make sure if the fist cron runs and takes more than 2 min the other one will run on different data on DBs not on the same data.
is there any why to handle this case?

This can be solved using a Database Transaction.
BEGIN;
SELECT
id,status,server
FROM
table_log
WHERE
(direction = '2' AND status_log = '1')
LIMIT 100
FOR UPDATE SKIP LOCKED;
what are we doing?
We are Selecting all rows available (not locked) from other cron-jobs that might be running. And selecting them for update. So this means all this Query grabs its unlocked and all results will be locked for this cron-job only.
how to update my locked rows?
Simple use a for loop on your processor language (Python, Ruby, PHP) and do a concatenation for each update remember we are building 1 single update.
UPDATE table_log SET status_log = '6' ,server = '1' WHERE id = '1';
Finally we use
COMMIT;
And all rows locked will be updated. This prevents other Queries from touching the same data at the same time. Hope it helps.

Turn your "finished" flag from binary to ternary ("needs work", "in process", "finished"). You also might want to store the pid of the "in process" process, in case it dies and you need to clean it up, and a timestamp for when it started.
Or use a queueing system that someone already wrote and debugged for you.

postgres trigger to check for data

I need to write a trigger that will check a table column to see if data is there or not. The trigger needs to run all the time and log msg every hour.
Basically it will run a select statement if result found then sleep for an hour else log and sleep for an hour

What you want is a scheduled job. pgAgent : http://www.pgadmin.org/docs/1.4/pgagent.html create an hourly job that checks for that line and then logs as required.
Edit to add:
Curious if you've considered writing a SQL script that generates the log on the fly by reading the table instead of a job. If you have a timestamp field, it is quite possible to have a script that returns all hourly periods that don't have a corresponding entry within that time frame (assuming the time stamp isn't updated). Why store a second log when you can generate it directly against the data?

Triggers (in pg and in every dbms i know) can execute before or after events such insert, update or delete. What you probably want here is a script launched via something like cron (if you are using a unix system) every hour, redirecting your output to the log file.
I used something like this many times and it sounded like this (written in python):
#!/usr/bin/python
import psycopg2
try:
conn = psycopg2.connect("dbname='"+dbmane+"' user='"+user+"' host='"+hostname+"' password='"+passwd+"'")
except:
# Get the most recent exception
exceptionType, exceptionValue, exceptionTraceback = sys.exc_info()
# Exit the script and print an error telling what happened.
logging.debug("I am unable to connect to the database!\n ->%s" % (exceptionValue))
exit(2)
cur = conn.cursor()
query = "SELECT whatever FROM wherever WHERE yourconditions"
try:
cur.execute(query)
except ProgrammingError:
print "Programming error, no result produced"
result = cur.fetchone()
if(result == None):
#do whatever you need, if result == None the data is not in your table column
I used to launch my script via cron every 10 minutes, you can easily configure it to launch the script every hour redirecting its output to the log file of your choice.
If your working in a windows environment, than you'll be looking for something like cron.
I don't think that a trigger can help you with this, they fire only after some events (you can use a trigger to check after every insert if the inserted data is the one you want to check every hour, but it's not the same, doing it via script is the best solution in my experience)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse