Limit Sed print section of file btw 2 regexp to first occurrence - sed

I am parsing text weather data : http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly
and want to only grab data for my county/area.
The trick is that each text report has previous reports from earlier in the day and I'm only interested in the latest which appears towards the beginning of the file.
I attempted to use the "print section of file between two regular expressions (inclusive)"
from the sed one liners. I couldn't figure out how to get it to stop after one occurrence.
sed -n '/OHZ061/,/OHZ062/p' /tmp/weather.html
I found this: Sed print between patterns the first match result which works with the following
sed -n '/OHZ061/,$p;/OHZ062/q' /tmp/weather.html
but I feel like it isn't the most robust of solutions. I don't have anything to back up the statement of robustness but I have a gut feeling that there might be a more robust solution.
So are there any better solutions out there? Also is it possible to get my first attempted solution to work? And if you post a solution please give an explanation of all the switches/backreference/magic as I'm still trying to discover all the power of sed and command line tools.
And to help start you off:
wget -q "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly" -O /tmp/weather.html
ps: I looked at this post:http://www.unix.com/shell-programming-scripting/167069-solved-sed-awk-print-between-patterns-first-occurrence.html but the sed was completely greek to me and I couldn't muddle through it to get it to work for my problem.

sed is an excellent tool for simple substitutions on a single line. For anything else, just use awk:
awk '/OHZ061/{found=1} found{print; if(/OHZ062/) exit}' /tmp/weather.html

Not sed because I don't like to parse HTML with that tool, but here you have a solution using perl with the help of a HTML parser, HTML::TreeBuilder. Code is commented step by step, I think it's easy to follow.
Content of script.pl:
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TreeBuilder;
##
## Get content of the web page.
##
open my $fh, '-|', 'wget -q -O- "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly"' or die;
##
## Parse content into a tree structure.
##
my $tree = HTML::TreeBuilder->new;
$tree->parse_file( $fh ) || die;
##
## Content is inside <pre>...</pre>, so search it in scalar context to get only
## the first one (the newest).
##
my $weather_data = $tree->find_by_tag_name( 'pre' )->as_text or die;
##
## Split data in "$$' and discard all tables of weather info but the first one.
##
my $last_weather_data = (split /(?m)^\$\$/, $weather_data, 2)[0];
##
## Remove all data until the pattern "OHZ + digits" found in the text
##
$last_weather_data =~ s/\A.*(OHZ\d{3}.*)\z/$1/s;
##
## Print result.
##
printf qq|%s\n|, $last_weather_data;
Run it like:
perl script.pl
And at 23:00 of 14-March-2013 it yields:
OHZ001>008-015>018-024>027-034-035-043-044-142300-
NORTHWEST OHIO
CITY SKY/WX TMP DP RH WIND PRES REMARKS
DEFIANCE MOSUNNY 41 18 39 W7G17 30.17F
FINDLAY SUNNY 39 21 48 W13 30.17F
TOLEDO EXPRESS SUNNY 41 19 41 W14 30.16F
TOLEDO METCALF MOSUNNY 42 21 43 W9 30.17S
LIMA MOSUNNY 38 22 52 W12 30.18S

Related

Extract everything between first and last occurence of the same pattern in single iteration

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

Function name inside parentheses in Perl one liner

I'm working on a Perl one liner tutorial and there are one liners like this:
ls -lAF | perl -e 'while (<>) {next if /^[dt]/; print +(split)[8] . " size: " . +(split)[4] . "\n"}'
You see the function name split has been inside parentheses. Documentation about this use of functions is hard to find on Google so I couldn't find any information on it. Could somebody explain it? Thank you.
It probably doesn't help that the use of split is defaulting everything - it's splitting $_ by spaces and returning a list of values.
The (...)[8] is called a list slice, and it filters out all but the 9th value returned by split. The preceding plus is there to prevent Perl from misparsing the brackets as being part of a function call. Which also means you don't need it on the second instance.
So print +(split)[8]; is basically a very succinct way of writing
my #results=split(/ /,$_);
print $results[8];
The example you've included is performing the split twice so it might be more efficient to do the more verbose version as you can get $results[4] from the above without any extra effort.
Or because you can put a list of indexes inside the [], you could do the split once and use printf to format the output like this
printf "%s size: %s\n", (split)[8,4];
In my opinion you should be avoiding this author's advice, both for the reasons laid out in my comments on your question, and because they don't appear to know their topic at all well.
The original "one-liner" was this
ls -lAF | perl -e 'while (<>) {next if /^[dt]/; print +(split)[8] . " size: " . +(split)[4] . "\n"}'
This could be written much more succinctly by using the -n and -a options, giving this
ls -lAF | perl -wane 'print $F[8] size: $F[4]\n" unless /^[dt]/'
Even without the "luxury" of these options you could write
ls -lAF | perl -e '/^[dt]/ or printf "%s size: %s\n", (split)[8,4] while <>'
I recommend that you go and read the Camel Book several times over the next few years. That is the best way to learn the language that I have found.
Most installations of Perl include a full set of documentation, accessible using the perldoc command.
You need to read the Slices section of perldoc perldata which makes very clear this use of slicing.

Convert GMT datetime string to utc epoch in 1000s of large csv files

I have 1000s of csv files that consist of millions of rows that have integers, floats, nullable integers, and 2 types of GMT datetime string formats. Below is an example of such a row in one of the files:
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
I'm interested in the quickest way to convert (in place) both types of GMT datetime formatted strings into UTC epochs.
For example, the above row would be converted into:
1455938740,3,,87,340.3456,1368835200,1400457600,4,6
Suppose the files are isolated, so all can be gathered by *.csv
Is there a way I could do this with linux commands? If not, what would you suggest then?
Updated Answer
With thanks to #Borodin's insights, my best solution would now be like this:
perl -MTime::Local -plne '
s|(\d+)\/(\d+)\/(\d+) (\d+):(\d+)|timegm(0,$5,$4,$2,$1-1,$3)|ge ;
s|(\d+)\/(\d+)\/(\d+)|timegm(0,0,0,$2,$1-1,$3)|ge' file.csv
And if that can be debugged and found to work, I would incorporate it into GNU Parallel like this:
function doit(){
tmp=temp_$$
perl -MTime::Local -plne '
s|(\d+)\/(\d+)\/(\d+) (\d+):(\d+)|timegm(0,$5,$4,$2,$1-1,$3)|ge;
s|(\d+)\/(\d+)\/(\d+)|timegm(0,0,0,$2,$1-1,$3)|ge' "$1" >> $tmp && mv $tmp "$1"
}
export -f doit
find . -name \*.csv -print0 | parallel -0 doit {}
Original Answer
I'm afraid I am going to give you a very powerful fishing rod (more of a harpoon) rather than a ready-cooked fish supper, but I think you'll be able to work it out quite easily.
First, if you use the Time::Local module in Perl, you can pass it the seconds, minutes, hours, days, months and year and it will tell you the corresponding Epoch seconds:
# So, for midnight on 02:10:01 AM 1st May 2016, you can do
perl -MTime::Local -e 'print timelocal(1,10,2,1,5,2016)'
1464743401
Second, if you start Perl with -plne switches, it will effectively apply the code you supply to each and every line of the input file and print the result and sort out all line endings for you - somewhat akin to how awk loops over input files. So, if your file is called file.csv and looks like this:
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
2/21/2013 3:25,3,,87,340.3456,4/20/2013,6/20/2015,4,6
and you run a null program, it will just echo the input file:
perl -MTime::Local -plne '' file.csv
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
2/21/2013 3:25,3,,87,340.3456,4/20/2013,6/20/2015,4,6
If we now do a substitution and replace all commas with elephants:
perl -MTime::Local -plne 's/,/elephant/g' file.csv
2/20/2016 3:25elephant3elephantelephant87elephant340.3456elephant5/18/2013elephant5/1 9/2014elephant4elephant6
2/21/2013 3:25elephant3elephantelephant87elephant340.3456elephant4/20/2013elephant6/20/2015elephant4elephant6
That seems to work - now you can also do what I call a "computed replacement" - I don't know what real Perl-folk call it. Anyway, you use an e modifier flag after the replacement to execute that code and calculate the replacement text:
perl -MTime::Local -plne 's|(\d+)\/(\d+)\/(\d+)|timelocal(0,0,0,$2,$1,$3)|ge' file.csv
1458432000 3:25,3,,87,340.3456,1371510000,1403132400,4,6
1363824000 3:25,3,,87,340.3456,1369004400,1437346800,4,6
And - in case you missed it - that is the answer. The (\d+) is a regex for "one or more digits" and the fact it is in parentheses means it is captured. The first such group is captured as $1, the second as $2 and so on. So, I am basically looking for one or more digits that I save as $1, followed by a slash then 1 or more digits that I capture as $2 followed by a slash and 1 or more digits that I capture as $3. Then, in the replacement part, I use the captured groups to formulate a date. The g modifier means I do ALL occurrences on each line.
I'll leave you to add further capture groups for the 24-hour time and put that into the timelocal() call.
The capture groups I have given are a little loose too - you may want
\d{1,2}\/\d{1,2}\/\d{4}
or something to mean 1 or 2 digits for the day, 1 or 2 digits for the month and exactly 4 digits for the year. You can look that up!
When you get that working, if you have thousands of files, I would suggest you use GNU Parallel to do the files in parallel. Try looking at my other answers on here, or Ole Tange's as he wrote it, and you will see something like:
function doit(){
perl -plne '...' $1 ...
}
export -f doit
find . -name \*.csv -print0 | parallel -0 doit {}
As regards doing it in place, I think you will need to use a technique like this inside the doit() function. Basically it writes a new file and then, only if the Perl part worked (&& does that bit), it overwrites the original file with the temporary one:
tmp=$(mktemp ...)
perl -plne '...' "$1" > $tmp && mv $tmp "$1"
I suggest you make a backup before you do anything else - there is a lot to go wrong here. Good luck!
P.S. If you edit the tags under your question and add perl, I guess some Perl guru will help you out and maybe put the finishing touches on my suggestions and enlighten me/us as to what the real name is for the e modifier that does a "computed replacement".
Update
As hinted by Mark Setchell the timegm function from Time::Local is likely to be faster than the string parsing that Time::Piece provides
Here's a rewrite of my original solution which uses that module. The output is identical to that of the original
use strict;
use warnings 'all';
use Time::Local 'timegm';
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( #fields ) {
next unless m{/};
my ($mn, $dy, $yr, $h, $m, $s) = (/\d+/g, 0, 0, 0);
$_ = timegm($s, $m, $h, $dy, $mn-1, $yr);
}
print join(',', #fields), "\n";
}
__DATA__
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
output
1455938700,3,,87,340.3456,1368835200,1400457600,4,6
Original post
The Time::Piece module is small and quite fast. Here's a sample program that transforms your sample data
The algorithm is a simple one. Any field that doesn't contain a slash / is left alone, otherwise it is assumed to be a date/time field if there is also a colon : or just a date field if not
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece ();
while ( <DATA> ) {
chomp;
my #fields = split /,/;
for ( #fields ) {
next unless m{/};
my $fmt = /:/ ? '%m/%d/%Y %H:%M' : '%m/%d/%Y';
$_ = Time::Piece->strptime($_, $fmt)->epoch;
}
print join(',', #fields), "\n";
}
__DATA__
2/20/2016 3:25,3,,87,340.3456,5/18/2013,5/19/2014,4,6
output
1455938700,3,,87,340.3456,1368835200,1400457600,4,6
The first field 1455938700 differs from your own expect output 1455938740 by forty seconds. That's odd, as there is no seconds value in the original data and 1455938700 is exactly divisible by 60 whereas 1455938740 is not. So I stand by my computation

How to display lines in a file where it contains more than 5 comma in a line using egrep or awk

I have the lines in the following format:
enter image description here
Help is required to display the line alone containing more than 5 comma in a line in a separate file.
perl has a tr (translate) operator that returns the number of translations that occurred. We can use this to count substrings in a string.
cat file.txt | perl -ne 'print if tr/,// > 5'
Using egrep:
egrep '([^,]*,){6,}'
Using awk:
awk -F, 'NF>5{print}'
Using a sed which has an "extended regular expression option" (I'll assume -r here, but it could be -E):
sed -n -r -e '/([^,]*,){6,}/p'
Of course you have to be careful what you ask for. For example, if you have a CSV file with commas embedded within "values", and if you only want lines with more than five "values", then things get a little trickier for tools that are not CSV-aware.
Text in image looks like CSV.
then, using AWK...
awk -F'","' 'NF>5{print}'
like peak's above answer.
I think you already have answers to your raw question here. However, if what you're really asking is if you want to find how many rows have CSV fields that exceed 5, then I think you need something like Perl's Text::CSV module.
An example of this is the following string:
one,two,three,four,five,"six,seven"
This has six commas but only five fields. Do you want to see this line, or do you want to skip it? If you want to see it (as an exception -- a line with more than five commas), then use one of the methods already suggested.
If you don't, then you really want a CSV parser, and Perl's is quite nice -- more lightweight and easier than most languages, in my opinion:
use strict;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } );
open my $IN, "<:encoding(utf8)", "file.csv" or die;
while (my $row = $csv->getline($IN)) {
if (#$row > 5) {
$csv->combine(#$row);
print $csv->string(), "\n";
}
}
close $IN;

How do I best pass arguments to a Perl one-liner?

I have a file, someFile, like this:
$cat someFile
hdisk1 active
hdisk2 active
I use this shell script to check:
$cat a.sh
#!/usr/bin/ksh
for d in 1 2
do
grep -q "hdisk$d" someFile && echo "$d : ok"
done
I am trying to convert it to Perl:
$cat b.sh
#!/usr/bin/ksh
export d
for d in 1 2
do
cat someFile | perl -lane 'BEGIN{$d=$ENV{'d'};} print "$d: OK" if /hdisk$d\s+/'
done
I export the variable d in the shell script and get the value using %ENV in Perl. Is there a better way of passing this value to the Perl one-liner?
You can enable rudimentary command line argument with the "s" switch. A variable gets defined for each argument starting with a dash. The -- tells where your command line arguments start.
for d in 1 2 ; do
cat someFile | perl -slane ' print "$someParameter: OK" if /hdisk$someParameter\s+/' -- -someParameter=$d;
done
See: perlrun
Sometimes breaking the Perl enclosure is a good trick for these one-liners:
for d in 1 2 ; do cat kk2 | perl -lne ' print "'"${d}"': OK" if /hdisk'"${d}"'\s+/';done
Pass it on the command line, and it will be available in #ARGV:
for d in 1 2
do
perl -lne 'BEGIN {$d=shift} print "$d: OK" if /hdisk$d\s+/' $d someFile
done
Note that the shift operator in this context removes the first element of #ARGV, which is $d in this case.
Combining some of the earlier suggestions and adding my own sugar to it, I'd do it this way:
perl -se '/hdisk([$d])/ && print "$1: ok\n" for <>' -- -d='[value]' [file]
[value] can be a number (i.e. 2), a range (i.e. 2-4), a list of different numbers (i.e. 2|3|4) (or almost anything else, that's a valid pattern) or even a bash variable containing one of those, example:
d='2-3'
perl -se '/hdisk([$d])/ && print "$1: ok\n" for <>' -- -d=$d someFile
and [file] is your filename (that is, someFile).
If you are having trouble writing a one-liner, maybe it is a bit hard for one line (just my opinion). I would agree with #FM's suggestion and do the whole thing in Perl. Read the whole file in and then test it:
use strict;
local $/ = '' ; # Read in the whole file
my $file = <> ;
for my $d ( 1 .. 2 )
{
print "$d: OK\n" if $file =~ /hdisk$d\s+/
}
You could do it looping, but that would be longer. Of course it somewhat depends on the size of the file.
Note that all the Perl examples so far will print a message for each match - can you be sure there are no duplicates?
My solution is a little different. I came to your question with a Google search the title of your question, but I'm trying to execute something different. Here it is in case it helps someone:
FYI, I was using tcsh on Solaris.
I had the following one-liner:
perl -e 'use POSIX qw(strftime); print strftime("%Y-%m-%d", localtime(time()-3600*24*2));'
which outputs the value:
2013-05-06
I was trying to place this into a shell script so I could create a file with a date in the filename, of X numbers of days in the past. I tried:
set dateVariable=`perl -e 'use POSIX qw(strftime); print strftime("%Y-%m-%d", localtime(time()-3600*24*$numberOfDaysPrior));'`
But this didn't work due to variable substitution. I had to mess around with the quoting, to get it to interpret it properly. I tried enclosing the whole lot in double quotes, but this made the Perl command not syntactically correct, as it messed with the double quotes around date format. I finished up with:
set dateVariable=`perl -e "use POSIX qw(strftime); print strftime('%Y-%m-%d', localtime(time()-3600*24*$numberOfDaysPrior));"`
Which worked great for me, without having to resort to any fancy variable exporting.
I realise this doesn't exactly answer your specific question, but it answered the title and might help someone else!
That looks good, but I'd use:
for d in $(seq 1 2); do perl -nle 'print "hdisk$ENV{d} OK" if $_ =~ /hdisk$ENV{d}/' someFile; done
It's already written on the top in one long paragraph but I am also writing for lazy developers who don't read those lines.
Double quotes and single quote has big different meaning for the bash.
So please take care
Doesn't WORK perl '$VAR' $FILEPATH
WORKS perl "$VAR" $FILEPATH