How to recreate tplog file and write it again in kdb - kdb

I accidentally removed the tplogs in the box. How can I create the log file again and connect to the log file in the tickerplant without affect the data in memory?
I know .u.L shows the path and .u.i is the count of the tplogs.
It is weird that .u.i still gives me the count when the file is already removed.
If I create and connect to the tplog again, will it be the only tplog its written?

.u.i counts the messages as they flow to the log file, it has no awareness of you deleting the log file so wouldn't reflect that. When a tp starts up it will call .u.ld which will do a count of the messages in the log file to set .u.i/j if the log file already exists on startup.
https://github.com/KxSystems/kdb-tick/blob/afc67eda6dfbb2ca89322f702db23ee68c2c7be3/tick.q#L29
You could use it to open a new file and reset .u.i/j/l/L. .u.ld .z.D.
*** If this is production were this has been done, I would break qa/uat and attempt this there first.
To answer your last question, if you re-create the tplog, only the messages written to the new log from that point will be saved. If your end of day is dependent on reading the log file then you will need to figure out an alternative with the rdb. If you are using an rdb/wdb for end of day and nothing goes wrong the messages will be retained. If rdb dies there will be no log to replay and data will be lost. Wdb will have most data already written to disk but you would need to be careful in it's startup if it died. If by default it removes the intra-day db and replays log it would delete the data and be unrecoverable.

This is somewhat related to your question, but it is possible to retroactively re-create a tickerplant log either from an in-memory table (if you still had it living in an RDB) or an on-disk table (if your system did manage to write from the RDB to the HDB). However recreating it exactly as it would have been in the live situation could be tricky, especially if you have a lot of tables that would be in the log.
In-mem table
.[`:/path/to/myTPlog;();:;()];
l:hopen`:/path/to/myTPlog;
{l enlist(`upd;`tabName;value x)}each select from tabName;
One issue here is that this would do one table in a full sequence, whereas in realtime you more likely had multiple tables intertwined based on the live timestamps. You could try to piece the log together chronologially across various tables in a lock-step time sequence but that would require a bit more work and memory.
If you were happy to write entire tables to the log one-at-a-time then you could even write the entire table as one upd:
l enlist(`upd;`tabName;value flip select from tabName);
On-disk table
unenum:{#[x; where type'[flip x] within 20 77h; value]};
{l enlist(`upd;`tabName;value x)}each delete date from unenum[select from tabName where date=.z.D-1];

Related

Idempotent streams or preventing duplicate rows using PipelineDB

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)
I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.
I can easily ship the logs to PipelineDB using copy from stdin, which works great.
However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.
How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)
Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows
Nonetheless, are there any alternative solutions to this issue?
Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.
That being said, there are a couple of approaches you could use here that may be worth thinking about.
First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.
Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

DB2 AS400/IBM ISeries Triggers/On File Change

Looking for best practices to get DELTAs of data over time.
No timestamps available, cannot program timestamps!
GOAL: To get differences in all files for all fields over time. Only need primary key as output. Also I need this for 15 minute intervals of data changes
Example:
Customer file has 50 columns/fields, if any field changes I want another file to record the primary key. Or anything to record the occurrence of a change in the customer file.
Issue:
I am not sure if triggers are the way to go since there is a lot of overhead associated with triggers.
Can anyone suggest best practices for DB2 deltas over time with consideration to overhead and performance?
I'm not sure why you think there is a lot of overhead associated with triggers, they are very fast in my experience, but as David suggested, you can journal the files you want to track, then analyze the journal receivers.
To turn on Journaling you need to perform three steps:
Create a receiver using CRTJRNRCV
Create a journal for the receiver using CRTJRN
Start journaling on the files using STRJRNPF. You will need to keep *BEFORE and *AFTER images to detect a change on update, but you can omit *OPNCLS records to save some space.
Once you do this, you can also use commitment control to manage transactions! But, you will now have to manage those receivers as they use a lot of space. You can do that by using MNGRCV(*SYSTEM) on the CRTJRN command. I suspect that you will want to prevent the system from deleting the old receivers automatically as that could cause you to miss some changes when the system changes receivers. But that means you will have to delete old receivers on your own when you are done with them. I suggest waiting a day or two to delete old receivers. That can be an overnight process.
To read the journal receiver, you will need to use RTVJRNE (Retreive Journal Entries) which lets you retrieve journal entries into variables, or DSPJRN (Display Journal) which lets you return journal entries to the display, a printer file, or an *OUTFILE. The *OUTFILE can then be read using ODBC, or SQL or however you want to process it. You can filter the journal entries that you want to receive by file, and by type.
Have you looked at journalling the files and evaluating the journal receivers?

PostgreSQL - Recovery of Functions' code following accidental deletion of data files

So, I am (well... I was) running PostgreSQL within a container (Ubuntu 14.04LTS with all the recent updates, back-end storage is "dir" because of convince).
To cut the long story short, the container folder got deleted. Following the use of extundelete and ext4magic, I have managed to extract some of the database physical files (it appears as if most of the files are there... but not 100% sure if and what is missing).
I have two copies of the database files. One from 9.5.3 (which appears to be more complete) and one from 9.6 (I upgraded the container very recently to 9.6, however it appears to be missing datafiles).
All I am after is to attempt and extract the SQL code the relates to the user defined functions. Is anyone aware of an approach that I could try?
P.S.: Last backup is a bit dated (due to bad practices really) so it would be last resort if the task of extracting the needed information is "reasonable" and "successful".
Regards,
G
Update - 20/4/2017
I was hoping for a "quick fix" by somehow extracting the function body text off the recovered data files... however, nothing's free in this life :)
Starting from the old-ish backup along with the recovered logs, we managed to cover a lot of ground into bringing the DB back to life.
Lessons learned:
1. Do implement a good backup/restore strategy
2. Do not store backups on the same physical machine
3. Hardware failure can be disruptive... Human error can be disastrous!
If you can reconstruct enough of a data directory to start postgres in single user mode you might be able to dump pg_proc. But this seems unlikely.
Otherwise, if you're really lucky you'll be able to find the relation for pg_proc and its corresponding pg_toast relation. The latter will often contain compressed text, so searches for parts of variables you know appear in function bodies may not help you out.
Anything stored inline in pg_proc will be short functions, significantly less than 8k long. Everything else will be in the toast relation.
To decode that you have to unpack the pages to get the toast hunks, then reassemble them and uncompress them (if compressed).
If I had to do this, I would probably create a table with the exact same schema as pg_proc in a new postgres instance of the same version. I would then find the relfilenode(s) for pg_catalog.pg_proc and its toast table using the relfilenode map file (if it survived) or by pattern matching and guesswork. I would replace the empty relation files for the new table I created with the recovered ones, restart postgres, and if I was right, I'd be able to select from the tables.
Not easy.
I suggest reading up on postgres's storage format as you'll need to understand it.
You may consider https://www.postgresql.org/support/professional_support/ . (Disclaimer, I work for one of the listed companies).
P.S.: Last backup is a bit dated (due to bad practices really) so it would be last resort if the task of extracting the needed information is "reasonable" and "successful".
Backups are your first resort here.
If the 9.5 files are complete and undamaged (or enough so to dump the schema) then simply copying them in place, checking permissions and starting the server will get you going. Don't trust the data though, you'll need to check it all.
Although it is possible to partially recover given damaged files, it's a long complicated process and the fact that you are asking on Stack Overflow probably means it's not for you.

delete temporary files in postgresql

I have a huge database of about 800GB. When I tried to run a query which groups certain variables and aggregates the result, it was stopping after running for a couple of hours. Postgres was throwing a message that disk space is full. After looking at the statistics I realized that the dB has about 400GB of temporary files. I believe these temp files where created while I was running the query. My question is how do I delete these temp files. Also, how do I avoid such problems - use cursors or for-loops to not process all the data at once? Thanks.
I'm using Postgres 9.2
The temporary files that get created in base/pgsql_tmp during query execution will get deleted when the query is done. You should not delete them by hand.
These files have nothing to do with temporary tables, they are use to store data for large hash or sort operations that would not fit in work_mem.
Make sure that the query is finished or canceled, try running CHECKPOINT twice in a row and see if the files are still there. If yes, that's a bug; did the PostgreSQL server crash when it ran out of disk space?
If you really have old files in base/pgsql_tmp that do not get deleted automatically, I think it is safe to delete them manually. But I'd file a bug with PostgreSQL in that case.
There is no way to avoid large temporary files if your execution plan needs to sort large result sets or needs to create large hashes. Cursors won't help you there. I guess that with for-loops you mean moving processing from the database to application code – doing that is usually a mistake and will only move the problem from the database to another place where processing is less efficient.
Change your query so that it doesn't have to sort or hash large result sets (check with EXPLAIN). I know that does not sound very helpful, but there's no better way. You'll probably have to do that anyway, or is a runtime of several hours acceptable for you?

Question on checkpoints during large data loads?

This is a question about how PostgreSQL works. During large data loads using the 'COPY' command, I see multiple checkpoints occur where 100% of the log files (checkpoint_segments) are recycled.
I don't understand this, I guess. What does pgsql do when a single transaction requires more space than available log files? It seems that it is wrapping around multiple times in the course of this load which is a single transaction. What am I missing?
Everything is working, I just want to understand it better in case I can tune things, etc.
When a checkpoint happens all dirty pages are written to disk. As these pages cannot get lost anymore it doesn't need the log for them any more so it is save to recycle. Writing dirty pages to disk doesn't mean this data is committed. The db can see from the meta data stored with each row that it belongs to a transaction that has not commited yet and it also can still abort this transaction in which case vacuum will eventually cleanup these rows.
When loading large amounts of data it is adviced to temporarily increase checkpoint_segments.