How does adding to none work? - perl

While trying to reduce some default hash code, I came to discover that you could add to none to produce none or to produce what you're adding to it. Is there a particular reason for this? Will this change on different architectures or can I rely on this ability?
DB<1> print none + 1
DB<2> print 1 + none
1
And just for those that are curious, this is how I'm using it
foreach (#someArray) {
unless ($someHash{$_}++) {
$someHash{$_} = 1;
}
}
as a reduction for
foreach (#someArray) {
if (exists $someHash{$_}) {
$someHash{$_}++;
} else {
$someHash{$_} = 1;
}
}

You are not doing what you think you are doing. These two statements:
print none + 1
print 1 + none
Are not as straightforward as you might think. Because you have warnings turned off, you do not know what they do. Lets try them out in the command prompt, with warnings turned on (-w switch):
$ perl -lwe'print none + 1'
Unquoted string "none" may clash with future reserved word at -e line 1.
Name "main::none" used only once: possible typo at -e line 1.
print() on unopened filehandle none at -e line 1.
$ perl -lwe'print 1 + none'
Unquoted string "none" may clash with future reserved word at -e line 1.
Argument "none" isn't numeric in addition (+) at -e line 1.
1
In the first case, none, which is a bareword, is interpreted as a file handle, and the print statement fails because we never opened a file handle with that name. In the second case, the bareword none is interpreted as a string, which gets converted to a number by the addition operator +, and that number will be zero 0.
You can further clarify this by supplying a specific file handle for the first case:
$ perl -lwe'print STDOUT none + 1'
Unquoted string "none" may clash with future reserved word at -e line 1.
Argument "none" isn't numeric in addition (+) at -e line 1.
1
Which demonstrates that there is no real difference between none + 1 and 1 + none.

Related

Find and print two blocks of lines from file with sed in one pass

I am trying to come up with an sed command to find and print two blocks of variable number of lines from a text file that look like this:
...
INFO first block to match
id: "value"
...
last line of the first block
INFO next irrelevant block
id: "different value"
...
INFO second block to match
id: "value"
...
last line of the second block
...
I only have prior knowledge of the id value and the fact that each block starts with a line that has "INFO". I want to match each block from that first line and not include the first line of the next block in the output:
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
Ideally I would prefer to do it in a single pass, not have the file scanned multiple times from top to bottom. Currently I have this (it only matches the first block, and I need both):
sed -n -e "/INFO/{"'$!'"{N;/INFO.*id: \"value\"/{:l;p;n;/^[^\\[]/bl;}}}" file.log
EDIT
Linebreak between blocks is certainly nice, but entirely optional.
EDIT 2
Please note that INFO and id: "value" do not have to be in the beginning of the line, and all other words in my example are arbitrary and not known in advance. There can be any number of blocks (including 0) between and around the ones I need to match.
sed is powerful, terse, and dumb. awk is smarter!
awk '/^INFO/{f = /match/? 1: 0} f'
edit: I see you want a linebreak between each "block"; will update if I find a tighter way:
awk '/^INFO/{f = /match/? 1: 0; if(i++) $0 = RS $0} f'
/^INFO/{action}: Execute {action} only on lines beginning with "INFO"
variable = if ? then : else: Conditional Expression (ternary operator)
if(i++): The first time this is evaluated, i will be zero, thus the expression will be false. This prevents an extra line break at the first block.
$0 = RS $0: Prepend a Record Separator (newline) to $0 (entire record)
f If f is greater than zero, {print $0} is implied.
This might work for you (GNU sed):
sed -nE ':a;/^INFO/{N;/^id: "value"/M!D;:b;H;$!{n;/^INFO/!bb};x;s/^/x/;/^x{2}/{s/^x*.//p;q};x;ba}' file
This solution stores the required blocks in the hold space, prefixed by a counter. Once the required number of blocks are stored the counters are removed, the blocks printed and the process quit.
The solution (based only on the input provided) supposes that an id (if it exists) always follows the the INFO line.
Here is an alternative solution using a combination of sed and awk. It allows you to parse the input blockwise or recordwise. This approach relies on setting awk record separator (RS) to the empty string which makes awk read a full block in at a time.
So there are 2 steps:
Make the input record-parsable.
Process each record.
For your example this could something like this:
sed '1!s/^INFO/\n&/' infile | awk '/id: "value"/' RS= ORS='\n\n'
Output:
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
awk is good for this, and if you could set RS to a multi-character expression it would be ideal. (gnu awk allows this, but why bother with gnu awk when there is perl?)
perl -wnle 'BEGIN{$/="INFO"; undef $\} print "$/$_" if m/id: \"value\"/' input
Basically, this sets the record separator ($/) to the string "INFO" (so now each of your "records" is a "line" to perl). If the record matches the pattern id: "value", it is printed with "INFO" prepended to the start. (without -n, perl would retain the record separator the end of each record, which is not quite what you want). By omitting the "undef $\", you can get an extra newline between records. Some code golf could probably cut the length of this in half, but my perl is a bit rusty. Waiting for the shorter version in comments.
This may or may not be what you want depending on what your real data looks like:
$ awk '/INFO/{info=$0; f=0} /id: "value"/{print info; f=1} f' file
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
or if you want to do more with each block than just print it as you go then some variation of this is better:
$ awk '
/INFO/ { prt() }
{ block = block $0 ORS }
END { prt() }
function prt() {
if (block ~ /id: "value"/) {
printf "%s", block
}
block=""
}
' file
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
The above will behave the same using any awk in any shell on any UNIX box.

split does not return empty elements

Why do these not all return bbb?
$ perl -e '$a=" "; print map { "b" } split / /, $a;'
<<nothing>>
$ perl -e '$a=",,"; print map { "b" } split /,/, $a;'
<<nothing>>
$ perl -e '$a=" a"; print map { "b" } split / /, $a;'
bbb
$ perl -e '$a=",,a"; print map { "b" } split /,/, $a;'
bbb
I would have expected split to return an array with 3 elements in all cases.
$ perl -V
Summary of my perl5 (revision 5 version 24 subversion 1) configuration:
split's third parameter says how many elements to produce:
split /PATTERN/,EXPR,LIMIT
...
If LIMIT is negative, it is treated as if it were instead arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved); if all fields are empty, then all fields are considered to be trailing (and are thus stripped in this case).
It defaults to 0, which means as many as possible but leaving off any trailing empty elements.
You can pass -1 as the third argument to split to suppress this behavior.

Append to line that is preceded AND followed by empty line

I need to append an asterisk to a line, but only if said line is preceded and followed by empty lines (FYI, said empty lines will NOT have any white space in them).
Suppose I have the following file:
foo
foo
foo
foo
foo
I want the output to look like this:
foo
foo
foo
foo*
foo
I tried modifying the following awk command (found here):
awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
to suit my uses, but got all tied up in knots.
Sed or Perl solutions are, of course, welcome also!
UPDATE:
It turned out that the question I asked was not quite correct. What I really needed was code that would append text to non-empty lines that do not start with whitespace AND are followed, two lines down, by non-empty lines that also do not start with whitespace.
For this revised problem, suppose I have the following file:
foo
third line foo
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foo
eleventh line foo
this line starts with a space foo
last line foo
I want the output to look like this:
foobar
third line foobar
fifth line foo
this line starts with a space foo
this line starts with a space foo
ninth line foobar
eleventh line foo
this line starts with a space foo
last line foo
For that, this sed one-liner does the trick:
sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile
Thanks to Benjamin W.'s clear and informative answer below, I was able to cobble this one-liner together!
A sed solution:
$ sed '1N;N;s/^\(\n.*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
foo
foo*
foo
N;P;D is the idiomatic way to look at two lines at the same time by appending the next one to the pattern space, then printing and deleting the first line.
1N;N;P;D extends that to always having three lines in the pattern space, which is what we want here.
The substitution matches if the first and last line are empty (^\n and \n$) and appends one * to the line between the empty lines.
Notice that this matches and appends a * also for the second line of three empty lines, which might not be what you want. To make sure this doesn't happen, the first capture group has to have at least one non-whitespace character:
sed '1N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
Question from comment
Can we not append the * if the line two above begins with abc?
Example input file:
foo
foo
abc
foo
foo
foo
foo
There are three foo between empty lines, but the first one should not get the * appended because the line two above starts with abc. This can be done as follows:
$ sed '1{N;N};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
foo
foo
abc
foo
foo*
foo*
foo
This keeps four lines at a time in the pattern space and only makes the substitution if the pattern space does not start with abc:
1 { # On the first line
N # Append next line to pattern space
N # ... again, so there are three lines in pattern space
}
N # Append fourth line
/^abc/! # If the pattern space does not start with abc...
s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/ # Append '*' to 3rd line in pattern space
P # Print first line of pattern space
D # Delete first line of pattern space, start next cycle
Two remarks:
BSD sed requires an extra semicolon: 1{N;N;} instead of 1{N;N}.
If the first and third line of the file are empty, the second line does not get an asterisk appended because we only start checking once there are four lines in the pattern space. This could be solved by adding an extra substitution into the 1{} block:
1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/}
(remember the extra ; for BSD sed), but trying to cover all edge cases makes sed even less readable, especially in one-liners:
sed '1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
One way to think about these problems is as a state machine.
start: state = 0
0: /* looking for a blank line */
if (blank line) state = 1
1: /* leading blank line(s)
if (not blank line) {
nonblank = line
state = 2
}
2: /* saw non-blank line */
if (blank line) {
output noblank*
state = 0
} else {
state = 1
}
And we can translate this pretty directly to an awk program:
BEGIN {
state = 0; # start in state 0
}
state == 0 { # looking for a (leading) blank line
print;
if (length($0) == 0) { # found one
state = 1;
next;
}
}
state == 1 { # have a leading blank line
if (length($0) > 0) { # found a non-blank line
saved = $0; # save it
state = 2;
next;
} else {
print; # multiple leading blank lines (ok)
}
}
state == 2 { # saw the non-blank line
if (length($0) == 0) { # followed by a blank line
print saved "*"; # BINGO!
state = 1; # to the saw a blank-line state
} else { # nope, consecutive non-blank lines
print saved; # as-is
state = 0; # to the looking for a blank line state
}
print;
next;
}
END { # cleanup, might have something saved to show
if (state == 2) print saved;
}
This is not the shortest way, nor likely the fastest, but it's probably the most straightforward and easy to understand.
EDIT
Here is a comparison of Ed's way and mine (see the comments under his answer for context). I replicated the OP's input a million-fold and then timed the runnings:
# ls -l
total 22472
-rw-r--r--. 1 root root 111 Mar 13 18:16 ed.awk
-rw-r--r--. 1 root root 23000000 Mar 13 18:14 huge.in
-rw-r--r--. 1 root root 357 Mar 13 18:16 john.awk
# time awk -f john.awk < huge.in > /dev/null
2.934u 0.001s 0:02.95 99.3% 0+0k 112+0io 1pf+0w
# time awk -f ed.awk huge.in huge.in > /dev/null
14.217u 0.426s 0:14.65 99.8% 0+0k 272+0io 2pf+0w
His version took about 5 times as long, did twice as much I/O, and (not shown in this output) took 1400 times as much memory.
EDIT from Ed Morton:
For those of us unfamiliar with the output of whatever time command John used above, here's the 3rd-invocation results from the normal UNIX time program on cygwin/bash using GNU awk 4.1.3:
$ wc -l huge.in
1000000 huge.in
$ time awk -f john.awk huge.in > /dev/null
real 0m1.264s
user 0m1.232s
sys 0m0.030s
$ time awk -f ed.awk huge.in huge.in > /dev/null
real 0m1.638s
user 0m1.575s
sys 0m0.030s
so if you'd rather write 37 lines than 3 lines to save a third of a second on processing a million line file then John's answer is the right one for you.
EDIT#3
It's the standard "time" built-in from tcsh/csh. And even if you didn't recognize it, the output should be intuitively obvious. And yes, boys and girls, my solution can also be written as a short incomprehensible mess:
s == 0 { print; if (length($0) == 0) { s = 1; next; } }
s == 1 { if (length($0) > 0) { p = $0; s = 2; next; } else { print; } }
s == 2 { if (length($0) == 0) { print p "*"; s = 1; } else { print p; s = 0; } print; next; }
END { if (s == 2) print p; }
Here's a perl filter version, for the sake of illustration — hopefully it's clear to see how it works. It would be possible to write a version that has a lower input-output delay (2 lines instead of 3) but I don't think that's important.
my #lines;
while (<>) {
# Keep three lines in the buffer, print them as they fall out
push #lines, $_;
print shift #lines if #lines > 3;
# If a non-empty line occurs between two empty lines...
if (#lines == 3 && $lines[0] =~ /^$/ && $lines[2] =~ /^$/ && $lines[1] !~ /^$/) {
# place an asterisk at the end
$lines[1] =~ s/$/*/;
}
}
# Flush the buffer at EOF
print #lines;
A perl one-liner
perl -0777 -lne's/(?<=\n\n)(.*?)(\n\n)/$1\*$2/g; print' ol.txt
The -0777 "slurps" in the whole file, assigned to $_, on which the (global) substitution is run and which is then printed.
The lookbehind (?<=text) is needed for repeating patterns, [empty][line][empty][line][empty]. It is a "zero-width assertion" that only checks that the pattern is there without consuming it. That way the pattern stays available for next matches.
Such consecutive repeating patterns trip up the /(\n\n)(.*?)(\n\n)/$1$2\*$3/, posted initially, since the trailing \n\n are not considered for the start of the very next pattern, having been just matched.
Update: My solution also fails after two consecutive matches as described above and needs the same lookback: s/(?<=\n\n)(\w+)\n\n/\1\2*\n\n/mg;
The easiest way is to use multi-line match:
local $/; ## slurp mode
$file = <DATA>;
$file =~ s/\n\n(\w+)\n\n/\n\n\1*\n\n/mg;
printf $file;
__DATA__
foo
foo
foo
foo
foo
It's simplest and clearest to do this in 2 passes:
$ cat tst.awk
NR==FNR { nf[NR]=NF; nr=NR; next }
FNR>1 && FNR<nr && NF && !nf[FNR-1] && !nf[FNR+1] { $0 = $0 "*" }
{ print }
$ awk -f tst.awk file file
foo
foo
foo
foo*
foo
The above takes one pass to record the number of fields on each line (NF is zero for an empty line) and then the second pass just checks your requirements - the current line is not the first or last in the file, it is not empty and the lines before and after are empty.
alternative awk solution (single pass)
$ awk 'NR>2 && !pp && !NF {p=p"*"}
NR>1{print p}
{pp=length(p);p=$0}
END{print p}' foo
foo
foo
foo
foo*
foo
Explanation: defer printing to next line for decision making, so need to keep previous line in p and state of the second previous line in pp (length zero assumed to be empty). Do the bookkeeping assignments and at the end print the last line.

find the line number where a specific word appears with “sed” on tcl shell

I need to search for a specific word in a file starting from specific line and return the line numbers only for the matched lines.
Let's say I want to search a file called myfile for the word my_word and then store the returned line numbers.
By using shell script the command :
sed -n '10,$ { /$my_word /= }' $myfile
works fine but how to write that command on tcl shell?
% exec sed -n '10,$ { /$my_word/= }' $file
extra characters after close-brace.
I want to add that the following command works fine on tcl shell but it starts from the beginning of the file
% exec sed -n "/$my_word/=" $file
447431
447445
448434
448696
448711
448759
450979
451006
451119
451209
451245
452936
454408
I have solved the problem as follows
set lineno 10
if { ! [catch {exec sed -n "/$new_token/=" $file} lineFound] && [string length $lineFound] > 0 } {
set lineNumbers [split $lineFound "\n"]
foreach num $lineNumbers {
if {[expr {$num >= $lineno}] } {
lappend col $num
}
}
}
Still can't find a single line that solve the problem
Any suggestions ??
I don't understand a thing: is the text you are looking for stored inside the variable called my_word or is the literal value my_word?
In your line
% exec sed -n '10,$ { /$my_word/= }' $file
I'd say it's the first case. So you have before it something like
% set my_word wordtosearch
% set file filetosearchin
Your mistake is to use the single quote character ' to enclose the sed expression. That character is an enclosing operator in sh, but has no meaning in Tcl.
You use it in sh to group many words in a single argument that is passed to sed, so you have to do the same, but using Tcl syntax:
% set my_word wordtosearch
% set file filetosearchin
% exec sed -n "10,$ { /$my_word/= }" $file
Here, you use the "..." to group.
You don't escape the $ in $my_word because you want $my_word to be substitued with the string wordtosearch.
I hope this helps.
After a few trial-and-error I came up with:
set output [exec sed -n "10,\$ \{ /$myword/= \}" $myfile]
# Do something with the output
puts $output
The key is to escape characters that are special to TCL, such as the dollar sign, curly braces.
Update
Per Donal Fellows, we do not need to escape the dollar sign:
set output [exec sed -n "10,$ \{ /$myword/= \}" $myfile]
I have tried the new revision and found it works. Thank you, Donal.
Update 2
I finally gained access to a Windows 7 machine, installed Cygwin (which includes sed and tclsh). I tried out the above script and it works just fine. I don't know what your problem is. Interestingly, the same script failed on my Mac OS X system with the following error:
sed: 1: "10,$ { /ipsum/= }": extra characters at the end of = command
while executing
"exec sed -n "10,$ \{ /$myword/= \}" $myfile"
invoked from within
"set output [exec sed -n "10,$ \{ /$myword/= \}" $myfile]"
(file "sed.tcl" line 6)
I guess there is a difference between Linux and BSD systems.
Update 3
I have tried the same script under Linux/Tcl 8.4 and it works. That might mean Tcl 8.4 has nothing to do with it. Here is something else that might help: Tcl comes with a package called fileutil, which is part of the tcllib. The fileutil package contains a useful tool for this case: fileutil::grep. Here is a sample on how to use it in your case:
package require fileutil
proc grep_demo {myword myfile} {
foreach line [fileutil::grep $myword $myfile] {
# Each line is in the format:
# filename:linenumber:text
set lineNumber [lindex [split $line :] 1]
if {$lineNumber >= 10} { puts $lineNumber}
}
}
puts [grep_demo $myword $myfile]
Here is how to do it with awk
awk 'NR>10 && $0~f {print NR}' f="$my_word" "$myfile"
This search for all line larger than line number 10 that contains word in variable $my_word in file name stored in variable myfile

Why does defined sdf return true in this Perl example?

I tried this example in Perl. Can someone explain why is it true?
if (defined sdf) { print "true"; }
It prints true.
sdf could be any name.
In addition, if there is sdf function defined and it returns 0, then it does not print anything.
print (sdf); does not print sdf string but
if (sdf eq "sdf")
{
print "true";
}
prints true.
The related question remains if sdf is a string. What is it not printed by print?
sdf is a bareword.
perl -Mstrict -e "print qq{defined\n} if defined sdf"
Bareword "sdf" not allowed while "strict subs" in use at -e line 1.
Execution of -e aborted due to compilation errors.
For more fun, try
perl -Mstrict -e "print sdf => qq{\n}"
See Strictly speaking about use strict:
The subs aspect of use strict disables the interpretation of ``bare words'' as text strings. By default, a Perl identifier (a sequence of letters, digits, and underscores, not starting with a digit unless it is completely numeric) that is not otherwise a built-in keyword or previously seen subroutine definition is treated as a quoted text string:
#daynames = (sun, mon, tue, wed, thu, fri, sat);
However, this is considered to be a dangerous practice, because obscure bugs may result:
#monthnames = (jan, feb, mar, apr, may, jun,
jul, aug, sep, oct, nov, dec);
Can you spot the bug? Yes, the 10th entry is not the string 'oct', but rather an invocation of the built-in oct() function, returning the numeric equivalent of the default $_ treated as an octal number.
Corrected: (thanks #ysth)
E:\Home> perl -we "print sdf"
Unquoted string "sdf" may clash with future reserved word at -e line 1.
Name "main::sdf" used only once: possible typo at -e line 1.
print() on unopened filehandle sdf at -e line 1.
If a bareword is supplied to print in the indirect object slot, it is taken as a filehandle to print to. Since no other arguments are supplied, print defaults to printing $_ to filehandle sdf. Since sdf has not been opened, it fails. If you run this without warnings, you do not see any output. Note also:
E:\Home> perl -MO=Deparse -e "print sdf"
print sdf $_;
as confirmation of this observation. Note also:
E:\Home> perl -e "print asdfg, sadjkfsh"
No comma allowed after filehandle at -e line 1.
E:\Home> perl -e "print asdfg => sadjkfsh"
asdfgsadjkfsh
The latter prints both strings because => automatically quotes strings on the LHS if they consist solely of 'word' characters, removing the filehandle interpretation of the first argument.
All of these examples show that using barewords leads to many surprises. You should use strict to avoid such cases.
This is a "bareword". If it is allowed, it has the value of "sdf", and is therefore not undefined.
The example isn't special:
telemachus ~ $ perl -e 'if (defined sdf) { print "True\n" };'
True
telemachus ~ $ perl -e 'if (defined abc) { print "True\n" };'
True
telemachus ~ $ perl -e 'if (defined ccc) { print "True\n" };'
True
telemachus ~ $ perl -e 'if (defined 8) { print "True\n" };'
True
None of those is equivalent to undef which is what defined checks for.
You might want to check out this article on truth in Perl: What is Truth?
defined returns true if the expression has a value other than the undefined value.
the defined function returns true unless the value passed in the argument is undefined. This is useful from distinguishing a variable containing 0 or "" from a variable that just winked into existence.