perl File::Tail syncronization - perl

im having this situation:
Im parsing some log files with perl daemon. This daemon writes data to mysql db.
Log file can:
be rotated ('solved by filesize and some logic')
doesnt exist ('ignore_nonexistant' parameter in Tail)
Daemon:
Can be killed
Can became dead by some reazon.
Im using File::Tail to tail tha file. For file rotation mechanism of date of creation or filesize can help. and what mechanism should i use to start tail from some position in file? (asume that there is a lot of such daemons, no write access to filesystem).
I've think about position variable in DB, but this wont help me.
Maybe some mechanism to pass position parameter to parrent process?
I just dont want to reinvent bicycle.

File::Tail already detects rotation and continues reading from the new file.
To deal with the daemon dying and restarting, can you query the database for the last record written when the daemon restarts, and just skip logfile lines until you get to a later line?

Try http://search.cpan.org/dist/Log-Unrotate/.
You'll have to implement your own Log::Unrotate::Cursor class if you wish to store position files in DB instead of local filesystem, but that should be trivial.
We wrote and used Log::Unrotate for 5 years in production and it tries really hard to never skip any data. (It tries so hard that it throws exception if your cursor becomes invalid, for example if log got rotated several times while reader didn't work for some reason. You may want to enable autofix_cursor option to change this behavior).
Also take a look at http://search.cpan.org/dist/File-LogReader/. I never used it but it's supposed to solve the same task.

Related

Update file descriptor pointing to /proc/self after fork() from python multiprocess.Process

I'm working on a C++ program that uses boost::python to provide a python wrapper/API for the user. The program tracks and limits its own memory usage by opening /proc/self/statm using a file descriptor. Every timestep it seeks to the beginning of that file and reads the vmsize from it.
proc_self_statm_fd = open( "/proc/self/statm", O_RDONLY );
However, this causes a problem when calling fork(). In particular, when a user writes a python script that does something like this:
proc = multiprocessing.Process(name="bkg_process",target=bkg_process,daemon=True)
The problem is that the forked process gets the file descriptor pointing to /proc/self/statm from the parent process, not its own, and this reports the wrong memory usage. Even worse, if the parent process exits, the child process will fail when trying to read from the file descriptor.
What's the correct solution for this? It needs to be handled at the C++ level because we don't have control over the user's python scripts. Is there a way to have the class auto detect that a fork has happened and grab a new file descriptor? In the worst case I can have it re-open the file for every update. I'm worried that would add runtime overhead though.
You could store the PID in the class, and check it against the value of getpid() on each call, and then reopen the file if the PID has changed. getpid() is typically much cheaper than open - on some systems it doesn't even need a context switch (it just fetches the PID from a magic location in the process's own memory).
That said, you may also want to actually measure the cost of reopening the file each time - it may not actually be significant.

What is the best way to tell MongoDB has started?

I have some automated tests that I run in order to test a MongoDB-related library. In order to do that, I start a Mongo server with a temporary data directory and on an ephemeral port, connect to it, and run some tests.
This leads to a race condition, obviously. So in my first version of these tests, I paused for a fixed amount of time and waited to make sure mongod had time to start before the tests began.
This was frustrating (and inefficient), so I decided to monitor the standard output and wait for a line on mongod's standard output stream matching the regular expression:
/\[initandlisten\] waiting for connections/
This got it working. So good, then I prepared to circle back and try to find a more robust way to do it. I recalled that a Java library called "embedmongo" ran MongoDB-based tests, and figured it must solve the problem. And it does this (GitHub):
protected String successMessage() {
return "waiting for connections on port";
}
... and uses that to figure out whether the process has started correctly.
So, are we right? Is examining the mongod process output log (is it ever internationalized? could the wording of the message ever change?) the very best way to do this? Or is there something more robust that we're both missing?
What we do in a similar scenario is:
Try to connect to the configured port (simply new Socket(host, port)) in a loop until it works (10 ms delay) - this ensures, that the mongo client, which starts an internal monitoring thread, does not throw exceptions due to "connection refused"
Connect to the mongodb and query something. This is important, as all mongo client objects are lazy init. (Simple listDatabaseNames() on the client is enough, but make sure to actually read the result.)
All the time, check the process for not being terminated.
I just wrote a small untilMongod command which does just that, which you can use in bash scripting: https://github.com/FGM/untilMongod
Includes a bash + Node.JS example use case.

Files read immediately after watchman notify are empty

I'm integrating watchman via the socket/bser interface in a JVM program.
I'm seeing odd timing where:
A file is written to by the build system (a small text file)
I get a watchman notification on the bser interface
Thread A listening for bser subscription notifications puts the update onto a queue for a separate thread
Thread B reads the queue, reads the changed file, and then puts the file's data on the wire
However, somehow, Thread B is reading an empty file.
Which, I assume is validly empty at some point, e.g. the IO/syscalls might be:
Clear the file contents
Write chunk 1
Write chunk 2
Close the file
And I assume my Thread B is reading the file between steps 1 and 2. Or maybe 1 and 4, if 4 is when the result is flushed.
My confusion is two fold:
1) I thought watchman's default 20ms wait would account for things like this, and I'd only see an update on my thread A, let alone when my thread B does a read, after step 4, and the data is done being written to the file.
2) Even if watchman did tell me "too soon" about the 1st syscall (say step 1), and I read the results while it was an empty file, there should be another syscall/watchman notification that "btw, the file has some content now".
FWIW/oddly enough, I was seeing this very same behavior when using the Java WatchService API, where I would get an inotify event, but read a file "too soon", and so get either empty or partial results, and then no follow up inotify event when the rest of the data was available.
I assumed this was a fluke/nuance of the WatchService, so I solved it at the time by checking the file mod time before reading it, and just waiting to ensure mod time >2 seconds old before assuming the file is "done" being written.
(Note that this also handled ~100mb+ files being written, where the build process might write a chunk of data every 100ms+, but with WatchService I was seeing 100s of inotify notifications for what was essentially a single continuous write.)
When I ported my WatchService code to watchman, I dropped this "ensureSettled" hack, because I assumed watchman's 20ms settle period (which is way lower than the 2s I was using, but hey it's the default) + it's general robustness compared to the somewhat beta WatchService would mean it wouldn't be a problem.
But within ~a day of using the watchman-ported code, I'm seeing empty file reads, just like I was with the WatchService.
Any ideas about what I'm missing?
I can add back the ensureSettled hack, but at this point I'm curious about what is going on.
The docs aren't very clear on this, sorry!
Dispatching of subscription notifications is subject to the settle timeout, but since file updates are non-atomic it's likely that the default 20ms kicks in before the file contents are visible to you; under the covers, the kernel generates a series of notifications for the various mutations that you're doing, so if the truncate takes 20ms before you write (or perhaps flush) the data out, you'll likely get a notification "in the middle".
This stuff is also operating system dependent. Here's an example of a recently discovered and resolved issue: https://github.com/facebook/watchman/commit/bac383c751b248ae742a2a20df3e8272238c0ae2
it doesn't sound like it is quite the same thing as you're experiencing, it just adds some color to this discussion.
If you already have code to manage the settling in your client, then it may be easier for you to add that back; we do this in watchman-make for example.
You may also wish to try setting https://facebook.github.io/watchman/docs/config.html#settle in a .watchmanconfig file in the root of the directory tree that you're watching and leave that to the watchman server. If/when you change this setting, you will need to delete and restart the watch.
Which you choose depends on how you want to trade ease of configuration against volume of code you want to maintain and (perhaps) volume of support questions from your user base if the .watchmanconfig isn't correctly configured for them.
Note that you can use the command invocation from https://facebook.github.io/watchman/docs/cmd/log-level.html to see the debug logging for the kernel notifications as they come in in real time; this may be helpful for you in understanding exactly which notifications are coming in and when.
Just curious, are you using https://github.com/facebook/watchman/tree/master/java to talk to the watchman server?

Is it correct to call openlog before every syslog call without affecting performance?

There are two packages that call Sys::Syslog::openlog with different identities and this is causing the messages to be logged randomly with either identities.
For example: if a script that's import package1 is run ahead of another script that's importing package2, the identity of the later script becomes persistent.
The syslog documentation says to call openlog before syslog but I am wondering if that should be done before every syslog call without affecting performance.
What's the right thing to do?

How to preserve data between executions of program

I am running a perl script on a HP-UX box. The script will execute every 15 minutes and will need to compare it's results with the results of the last time it executed.
I will need to store two variables (IsOccuring and ErrorCount) between the executions. What is the best way to do this?
Edit clarification:
It only compares the most recent execution to the current execution.
It doesn't matter if the value is lost between reboots.
And touching the filesystem is pretty much off limits.
If you can't touch the file system, try using a shared memory segment. There are helper modules for that like IPC::ShareLite, or you can use the shmget and related functions directly.
You'll have to store them in a file. This sort of file is often kept in /tmp, but any place where the user running the cron job has access would do. Make sure your script can handle the case where the file is missing.
You could create a separate process running a "remember stuff" service over your choice of IPC mechanism. This sounds like a rather tortured solution to "I don't want to touch the disk" but if it's important enough to offset a couple of days of development work (realistically, if you are new to IPC, and HP-SUX continues to live up to its name) then by all means read man perlipc for a start.
Does it have to be completely re-executed? Can you just have it running in a loop and sleeping for 15 minutes between iterations? Than you don't have to worry about saving the values externally, the program never stops.
I definitely think IPC is the way to go here.
I'd save off the data in a file. Then, inside the script I'd load the last results if the file exists.
Use module Storable to serialize Perl data structures, save them anywhere you want and deserialize them during next script execution.