Polling a constantly updating text file to cut selected information - perl

I have a log tail running in real time and saving to a text file using the 'script' command named 'test.txt' at /home/pi/ also in real time. Now I want to set up a process that constantly polls that text file for changes and cuts out a specific reoccurring bit of data. For example, a section of the log would look like:
Feb 9 11:43:24 dnsmasq[887]: query[A] captive.g.aaplimg.com from 192.168.178.21
Feb 9 11:43:24 dnsmasq[887]: forwarded captive.g.aaplimg.com to 8.8.4.4
Feb 9 11:43:24 dnsmasq[887]: reply captive.g.aaplimg.com is 17.253.55.202
Feb 9 11:43:24 dnsmasq[887]: reply captive.g.aaplimg.com is 17.253.57.211
Feb 9 11:43:54 dnsmasq[887]: query[A] captive.g.aaplimg.com from 192.168.178.21
And I want to cut info only from the lines with query[A] (assuming that could be used as a marker) so that the output text looks like:
11:43 captive.g.aaplimg.com
But the problem is that there are different URL's attached to this line of the log, so for example a line with 'query[A]' could also look like:
Feb 9 11:49:56 dnsmasq[887]: query[A] www.googleapis.com from 192.168.178.21
Then I would want the output to be:
11:49 www.googleapis.com
But it needs to happen in real-time, as the text file/log is updating because I want this text file to be constantly polled and sent to a printer also in real time (a long story)
I have been looking at awk + sed to cut out the info I need, but they're new to me so I find the format a bit confusing, and i find it especially hard to figure out how to run it so it happens in real time.
Running on debian buster on pi.
Would love some help! Thanks

I assume you're looking for something like this:
tail -f my.log | perl -nle 'print"$1$2" if /(\d\d:\d\d):\d\d.*query\[A\]( \S+)/' > test.txt
The -f constantly outputs the last lines as the file my.log grows. It feeds the lines into the little perl one-liner program which looks for query[A] (escaping the [ and ] chars with \ since they otherwise have special meaning in regexpes) and when found outputs the time in hours and minutes and the domain name captured by the regexp into $1 and $2.

Related

Zsh completion caching policy explained

I'm writting some zsh functions using the powerful completion feature. The computation of my completions take some times and I want to make use of the completion caching policy. From the zsh manual (https://zsh.sourceforge.io/Doc/Release/Completion-System.html) I found this code snippet
example_caching_policy () {
# rebuild if cache is more than a week old
local -a oldp
oldp=( "$1"(Nm+7) )
(( $#oldp ))
}
I couldn't find any explanation on the (Nm+7) syntax, what does Nm means ? With try and error I could find out that for example Nms+1 would change the cache policy to 1 second, while Nmh+1 to 1 hour. But where can I find the general (NmX+N) construct explanation ?
Same what does exactly means the line (( $#oldp )) ?
I can explain the (Nm+7)
man zshexpn, search for Glob Qualifiers
a[Mwhms][-|+]n
files accessed exactly n days ago. Files accessed within the last n days are selected using a negative
value for n (-n). Files accessed more than n days ago are selected by a positive n value (+n). Op‐
tional unit specifiers `M', `w', `h', `m' or `s' (e.g. `ah5') cause the check to be performed with
months (of 30 days), weeks, hours, minutes or seconds instead of days, respectively. An explicit `d'
for days is also allowed.
Any fractional part of the difference between the access time and the current part in the appropriate
units is ignored in the comparison. For instance, `echo *(ah-5)' would echo files accessed within the
last five hours, while `echo *(ah+5)' would echo files accessed at least six hours ago, as times
strictly between five and six hours are treated as five hours.
m[Mwhms][-|+]n
like the file access qualifier, except that it uses the file modification time.
N stands for NULL_GLOB, if zsh matches nothing, it will remove the pattern.
Without this N option, if it matches nothing it will print an error.
Example with 4 files
$ touch lal # = updates file modification date to now
$ ls
lal lil lol tut
$ ls l*(m+7)
lil lol
# files older than 7 days starting with l
$ ls l*(m-7)
lal
# files younger than 7 days starting with l
$ ls l*(m+200)
zsh: no match
# no files older than 200 days
$ ls l*(Nm+200)
lal lil lol tut
# N = NULL_GLOB made disappear the non-matching pattern so it's just ls

Why is no output files written in prinseqlite perl loop?

I am completely new to this type of coding/command lines, so I am sorry if I am asking this question in a wrong way.
I want to loop over all files in a directory (I am quality trimming DNA sequencing files (.fastq format))
I have written this loop:
for i in *.fastq; do
perl /apps/prinseqlite/0.20.4/prinseq-lite.pl -fastq $i -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq -out_bad null; done
The code itself seems to work, I can see in my terminal that it is taking the right files and it is doing the trimming (it is writing a summary log in the terminal as it goes), but no output files are generated - i.e these ones:
-out_good /proj/forhot/qfiltered/looptest/$i_filtered.fastq
If I run the code in a non-loop way, just on one file it works (= the output is generated). link this example:
prinseq-lite.pl -fastq 60782_merged_rRNA.fastq -min_len 220 -max_len 240 -min_qual_mean 30 -ns_max_n 5 -trim_tail_right 15 -trim_tail_left 15 -out_good 60782_merged_rRNA_filt_codeTEST.fastq -out_bad null
Is there a simple reason/answer to this?
This problem has nothing to do with Perl at all.
/proj/forhot/qfiltered/looptest/$i_filtered.fastq is read by the shell as interpolating the contents of i_filtered. There is no such shell variable, so this argument turns into /proj/forhot/qfiltered/looptest/.fastq ($i_filtered turns into nothing).
Therefore all of your prinseq-lite.pl executions place their output in the same file, which (because its name starts with a .) is "hidden": You need to use ls -a to see it, not just ls.
Fix
... -out_good /proj/forhot/qfiltered/looptest/${i}_filtered.fastq
Note that this would give you e.g. 60782_merged_rRNA.fastq_filtered.fastq for an input file of 60782_merged_rRNA.fastq. If you want to get rid of the duplicate .fastq part, you need something like:
... -out_good /proj/forhot/qfiltered/looptest/"${i%.fastq}"_filtered.fastq

How to effeciently find and parse last line of text from a log file via PowerShell?

I have a very large log file. I need to find out the last "WARN" line in that file effeciently (ie: read from the end), parse it, and return it as an object with "Date" field (DateTime type), "Level" field, and "Description" field
Any suggestions?
Here's what the file looks like
[Mon Dec 14 14:57:53 2015] [notice] Child 6180: Acquired the start mutex.
[Mon Dec 14 14:57:53 2015] [notice] Child 6180: Starting 150 worker threads.
[Mon Dec 14 15:04:43 2015] [warn] pid file C:/Program Files (x86)/Citrix/XTE/logs/xte.pid overwritten -- Unclean shutdown of previous Apache run?
[Mon Dec 14 15:04:43 2015] [notice] Server built: May 27 2011 16:04:42
[Mon Dec 14 15:04:43 2015] [notice] Parent: Created child process 5608
EDIT: This command must look inside the file, find the last matching line by search criteria, return that line, and "stop". Possible duplicate question is different in a number of ways: my script cannot simply sit there and wait for line to appear - it needs to run, get the line as quickly as possible, and get out. Furthermore, it needs to search for it by substring, and lastly it needs to return a DateTime and other fields broken up. Thanks for not voting to close this quesiton.
Open the file as a raw Stream, seek a "decent" block size from the end (say 1 MB), then search the resulting bytes for the binary representation of "warn" until you've found the last instance (I'm assuming you know the encoding in advance). If you find it, scan for the line terminators. If you don't find it, seek back 1 + 1 MB and go again. Repeat until you seek to the beginning.
If there is no "warn" in the entire file, this will be slower than just reading the file sequentially, but if you're certain there's a line of the kind you want near the end, this can terminate pretty quickly. The essential thing to do is not read the file as text with a StreamReader, since you lose the ability to seek arbitrarily.
Actually getting the code for this idea right is more involved. The difficulty of this operation is not due to anything in PowerShell -- there is no simple way to do this in any language, because reading a file in reverse is not an efficient operation in any file system I know of.
I'd approach that this way:
get-content $file -ReadCount 3000 |
ForEach-Object {
if ($_ -like '*warn*')
{$Lastfound = $_}
}
($Lastfound -like '*warn*')[-1]
It's certainly not going to be efficient. Everything in PowerShell and C# (and everything else) is built around reading forwards, not backwards. Given that and the fact that you don't even know where the last line might be, I don't see any way to avoid processing the whole file unless you want to spend several hours writing your own ReverseStreamReader.
Assuming the file is bigger than RAM -- which makes Get-Content impractical, IMO -- I'd probably do something like:
$LineNumber = [uint64]0;
$StreamReader = New-Object System.IO.StreamReader -ArgumentList "C:\LogFile.log"
$SearchPattern = [Regex]::Escape('[warn]');
while ($Line = $StreamReader.ReadLine()) {
$LineNumber++;
if ($Line -match $SearchPattern) {
$LastLineNumber = $LineNumber;
$LastLineMatch = $Line;
}
}
$StreamReader.Close()
$LastLineNumber
$LastLineMatch
Parsing the line is probably going to involve a lot of String.IndexOf() and String.Substring(). Turning the date into a DateTime should be done like so:
[datetime]::ParseExact('Mon Dec 14 15:04:43 2015','ddd MMM dd HH:mm:ss yyyy',[System.Globalization.CultureInfo]::InvariantCulture,[System.Globalization.DateTimeStyles]::None);
I chose -match over -like because as far as I can tell it actually performs better. That might be just my system, however.

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.
This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.
sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"
Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

Using sed to copy data between two numerical patterns to a new file

I'm running a bunch (~320) computational chemistry experiments and I need to pull a small amount of the data out of each of the files so that I can do some work on it in MatLab.
I'm pretty sure I can use sed to make this work, but try as I might I don't seem to be able to do so.
I need all of the data starting at the line beginning with "1 1" and ending with the line starting with "33 33".
I J FI(I,J) k(I,J) K(I,J)
1 1 -337.13279 -0.06697 -0.00430
2 2 3804.89120 8.52972 0.54787
3 3 3195.69653 6.01702 0.38648
4 4 3189.18684 5.99253 0.38490
5 5 3183.73262 5.97205 0.38359
6 6 3174.47525 5.93737 0.38136
7 7 3167.88746 5.91275 0.37978
8 8 1628.80868 1.56311 0.10040
9 9 1623.56055 1.55306 0.09975
10 10 1518.21620 1.35806 0.08723
11 11 1476.93012 1.28520 0.08255
12 12 1341.24087 1.05990 0.06808
13 13 1312.30373 1.01466 0.06517
14 14 1264.73004 0.94242 0.06053
15 15 1185.62592 0.82822 0.05320
16 16 1175.54013 0.81419 0.05230
17 17 1170.41211 0.80710 0.05184
18 18 1090.20196 0.70027 0.04498
19 19 1039.29190 0.63639 0.04088
20 20 1015.00116 0.60699 0.03899
21 21 1005.05773 0.59516 0.03823
22 22 986.55965 0.57345 0.03683
23 23 917.65537 0.49615 0.03187
24 24 842.93089 0.41863 0.02689
25 25 819.00146 0.39520 0.02538
26 26 758.39720 0.33888 0.02177
27 27 697.11173 0.28632 0.01839
28 28 628.75684 0.23292 0.01496
29 29 534.75856 0.16849 0.01082
30 30 499.35579 0.14692 0.00944
31 31 422.01320 0.10493 0.00674
32 32 409.30255 0.09870 0.00634
33 33 227.12411 0.03039 0.00195
33 2nd derivatives larger than 0.371D-04 over 561
MatLab is not a fan of text, so I'd like to not use text delimiters (though there are some in the header of this data section) and keep the data contained to only the numeric lines.
The data files contain a lot of other numbers as well, so I need to match the occurrence of "1 1" at the start of the line and "33 33" as the end of the copy. These 'indices' exist only in this block of info.
I attempted to use
% sed -n /"1 1"/,/"33 33"/p input.file > output.file
But I get a WHOLE BUNCH of data in the output file as it copies everything that shows up between any "1" and "33"
Is there any way to do what I'm looking for?
Also, I'm using the tcsh as that is what my servers run.
How about using awk
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{t=0}' file
Recommand by #mklement0, if there is only one block, to avoid processing the remainder of the file you can update the command to:
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{exit}' file
Your problem is twofold. First, there are two blanks between the ones, but your regex only allows for one (judging from the now indented code). Second, you are probably not precise enough; the /1 1/ pattern matches 11 11, for example, and 111 111 and so on.
So, you should consider:
sed -n -e '/^ *1 *1 /,/^33 *33 /p' -e '/^33 33 /q' input.file > output.file
The patterns are anchored to the start of line by the ^ (caret). The numbers are separated by one or more blanks (there are other, longer-winded ways of writing that in standard sed; the + option is not standard sed but is widely available). And the numbers are terminated by a blank. The chances are that the first expression alone will give you what you want. The second expression terminates the search early when it recognizes the 33 33 input line, which can save a significant amount of file I/O and hence processing time if the input file is big enough.
If the lines with ID numbers in the hundreds have some different format, then it should be fairly straight-forward to tweak the regexes to match what is used. If the data contains tabs instead of (or as well as) blanks, you can tweak the regexes to manage that, too.
If you data is all formatted exactly the same as this file, then you can use sed to just read the 3rd through the 35th line (rows 1 1 - 33 33). This is a lot easier than parsing the values, but does require that the files have a standard format:
sed -n 3,35p data.txt
Another cheap way would be to grep for only numeric lines, and take only the first 33:
grep "^[0-9 ][0-9 .-]*$" data.txt | head -n 33