I have this one liner:
perl -Mversion -e 'our $VERSION = v1.02; print $VERSION'
The output is (It is not visible, there is two characters: 1, 2):
Why module version is not printable? I expect to see v1.02
I have found this DOC
print v9786; # prints SMILEY, "\x{263a}"
print v102.111.111; # prints "foo"
print 102.111.111; # same
Answering to my question:
Despite on that v1.02 is v-string that is not string internally. And when we want to print it we should do extra steps. For example, use module version as suggested above.
UPD
I found next solution (DOC):
printf "%vd", $VERSION; # prints "1.2"
UPD
And this should be read:
There are two ways to enter v-strings: a bare number with two or more decimal points, or a bare number with one or more decimal points and a leading 'v' character (also bare). For example:
$vs1 = 1.2.3; # encoded as \1\2\3
$vs2 = v1.2; # encoded as \1\2
Related
I have a file that contains for some of the lines a number that is coded as text -> binary -> octets and I need to decode that to end up with the number.
All the lines where this encoded string is, begins with STRVID:
For example I have in one of the lines:
STRVID: SarI3gXp
If I do this echo "SarI3gXp" | perl -lpe '$_=unpack"B*"' I get the number in binary
0101001101100001011100100100100100110011011001110101100001110000
Now just to decode from binary to octets I do this (assign the previous command to a variable and then convert binary to octets
variable=$(echo "SarI3gXp" | perl -lpe '$_=unpack"B*"') ; printf '%x\n' "$((2#$variable))"
The result is the number but not in the correct order
5361724933675870
To get the previous number in the correct order I have to get for each couple of digits first the second digit and then the first digit to finally have the number I'm looking for. Something like this:
variable=$(echo "SarI3gXp" | perl -lpe '$_=unpack"B*"') ; printf '%x\n' "$((2#$variable))" | gawk 'BEGIN {FS = ""} {print $2 $1 $4 $3 $6 $5 $8 $7 $10 $9 $12 $11 $14 $13 $16 $15}'
And finally I have the number I'm looking for:
3516279433768507
I don't have any clue on how to do this automatically for every line that begins with STRVID: in my file. At the end what I need is the whole file but when a line begins with STRVID: then the decoded value.
When I find this:
STRVID: SarI3gXp
I will have in my file
STRVID: 3516279433768507
Can someone help with this?
First of all, all you need for the conversion is
unpack "h*", "SarI3gXp"
A perl one-liner using -p will execute the provided program for each line, and s///e allows us to modify a string with code as the replacement expression.
perl -pe's/^STRVID:\s*\K\S+/ unpack "h*", $& /e'
See Specifying file to process to Perl one-liner.
Please inspect the following sample demo code snippet for compliance with your problem.
You do not need double conversion when it can be done in one go.
Note: please read pack documentation , unpack utilizes same TEMPLATE
use strict;
use warnings;
use feature 'say';
while( <DATA> ) {
chomp;
/^STRVID: (.+)/
? say 'STRVID: ' . unpack("h*",$1)
: say;
}
__DATA__
It would be nice if you provide proper input data sample
STRVID: SarI3gXp
Perhaps the result of this script complies with your requirements.
To work with real input data file replace
while( <DATA> ) {
with
while( <> ) {
and pass filename as an argument to the script.
Output
It would be nice if you provide proper input data sample
STRVID: 3516279433768507
Perhaps the result of this script complies with your requirements.
To work with real input data file replace
while( <DATA> ) {
with
while( <> ) {
and pass filename as an argument to the script.
./script.pl input_file.dat
you can cross flip the numbers entirely via regex (and without back-references either) :
variable=$(echo "SarI3gXp" | perl -lpe '$_=unpack"B*"') ;
printf '%x\n' "$((2#$variable))" |
mawk -F'^$' 'gsub("..", "_&=&_") + gsub(\
"(^|[0-9]_)(_[0-9]|$)", _)+gsub("=",_)^_'
1 3516279433768507
The idea is to make a duplicate copy on the other side, like this :
_53=53__61=61__72=72__49=49__33=33__67=67__58=58__70=70_
then scrub out the leftovers, since the numbers u now want are anchoring the 2 sides of each equal sign ("=")
I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz
UPDATE
As pointed out in the answer, this question really has to do with Scalar versus List Context in Perl.
## ## ##
I am learning perl via self-taught crash course (primarily with the Llama book and the web). In attempting some byte swap code, I have found a one liner I do not understand completely. A comment in the script explains what I think is happening in the one-liner.
#!/usr/bin/perl --
#
# Script to print byte-swapped hex values
#
use 5.010;
use warnings;
use strict;
# NOTE: I realize I could use a single variable 'data', but the x- y- z- prefixes may help in
# identification (for clarity) in the code for this SO question.
my $xData;
my $yData;
my $zData;
for ( my $ijk = 998; $ijk < 1001; $ijk++ )
{
printf ( "\n%4d is hex " . (sprintf "0x%04X", $ijk) . "\n", $ijk );
# with byte swap
say "These numbers (bytes swapped) should match...";
# do sprintf, match pattern and store to ($1)($2), now reverse them into ($2)($1).
# BindOp leaves $_ alone, match stuffs $_ & is then used as input for reverse, prints.
say reverse ((sprintf "%04X", $ijk) =~ /(..)(..)/) ; # from perlmonks' webpage
$yData = (reverse ((sprintf "%04X", $ijk) =~ /(..)(..)/) );
say $yData; # does NOT match
$xData = sprintf "%04X", $ijk;
$xData =~ s/(..)(..)/$2$1/ ;
say $xData; # does match
$_ = sprintf "%04X", $ijk;
/(..)(..)/;
$zData = $2 . $1 ;
say $zData; # does match
}
exit 0;
OUTPUT:
998 is hex 0x03E6 These numbers (bytes swapped) should match...
E603
6E30
E603
E603
999 is hex 0x03E7 These numbers (bytes swapped) should match...
E703
7E30
E703
E703
1000 is hex 0x03E8 These numbers (bytes swapped) should match...
E803
8E30
E803
E803
Why does the one liner work, and why doesn't the $yData perform the same way? I'm pretty sure I understand why $xData and $zData work, but I would expect $yData to be the closest equivalent non-one-liner. What is the closest equivalent non-one-liner and why? Where is the discrepancy?
The reverse in your print (say) statement comes in the list context, while when assigned to $yData the context is scalar. This function (may) behave considerably differently based on the context.
From perldoc -f reverse
reverse LIST
In list context, returns a list value consisting of the elements of LIST in the opposite order. In scalar context, concatenates the elements of LIST and returns a string value with all characters in the opposite order.
In this case this produces different results.
When called in list context, it swaps the (two) input bytes, keeping each byte intact (represented by two hexadecimal digits matched in a group). When called in scalar context, it joins the input and returns a character string, running in the opposite order. Taken to represent a hex number this would have each byte changed.
Is there a way to get number of bytes that "consumed" by an 'unpack' call?
I just want to parse(unpack) different structures from a long string in several steps, like following:
my $record1 = unpack "TEMPLATE", substr($long_str, $pos);
# Advance position pointer
$pos += NUMBER_OF_BYTES_CONSUMED_BY_LAST_UNPACK();
# Other codes that might determin what to read in following steps
# ...
# Read again at the new position
my $record2 = unpack "TEMPLATE2", substr($long_str, $pos);
This does seem like a glaring omission in unpack, doesn't it? As a consolation prize, you could use an a* to the end of the unpack template to return the unused portion of the input string.
# The variable-length "w" format is to make the example slightly more interesting.
$x = pack "w*", 126..129;
while(length $x) {
# unpack one number, keep the rest packed in $x
($n, $x) = unpack "wa*", $x;
print $n;
}
If your packed string is really long, this is not a good idea since it has to make a copy of the "remainder" portion of the string every time you do an unpack.
You can add the character . to the end of the format string:
my (#ary) = unpack("a4v3a*.", "abcdefghijklmn");
say for #ary;
Output:
abcd
26213
26727
27241
klmn
14 # <-- 14 bytes consumed
This was cleverly hidden in the perl5100delta file. If it is documented in perlfunc somewhere, I cannot find it.
I have a directory full of files containing records like:
FAKE ORGANIZATION
799 S FAKE AVE
Northern Blempglorff, RI 99xxx
01/26/2011
These items are being held for you at the location shown below each one.
IF YOU ASKED THAT MATERIAL BE MAILED TO YOU, PLEASE DISREGARD THIS NOTICE.
The Waltons. The complete DAXXXX12118198
Pickup at:CHUPACABRA LOCATION 02/02/2011
GRIMLY, WILFORD
29 FAKE LANE
S. BLEMPGLORFF RI 99XXX
I need to remove all entries with the expression Pickup at:CHUPACABRA LOCATION.
The "record separator" issue:
I can't touch the input file's formatting -- it must be retained as is. Each record
is separated by roughly 40+ new lines.
Here's some awk ( this works ):
BEGIN {
RS="\n\n\n\n\n\n\n\n\n+"
FS="\n"
}
!/CHUPACABRA/{print $0}
My stab with perl:
perl -a -F\n -ne '$/ = "\n\n\n\n\n\n\n\n\n+";$\ = "\n";chomp;$regex="CHUPACABRA";print $_ if $_ !~ m/$regex/i;' data/lib51.000
Nothing is returned. I'm not sure how to specify 'field separator' in perl except at the commandline. Tried the a2p utility -- no dice. For the curious, here's what it produces:
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z
# process any FOO=bar switches
#$FS = ' '; # set field separator
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
$/ = "\n\n\n\n\n\n\n\n\n+";
$FS = "\n";
while (<>) {
chomp; # strip record separator
if (!/CHUPACABRA/) {
print $_;
}
}
This has to run under someone's Windows box otherwise I'd stick with awk.
Thanks!
Bubnoff
EDIT ( SOLVED ) **
Thanks mob!
Here's a ( working ) perl script version ( adjusted a2p output ):
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z
# process any FOO=bar switches
#$FS = ' '; # set field separator
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
$/ = "\n"x10;
$FS = "\n";
while (<>) {
chomp; # strip record separator
if (!/CHUPACABRA/) {
print $_;
}
}
Feel free to post improvements or CPAN goodies that make this more idiomatic and/or perl-ish. Thanks!
In Perl, the record separator is a literal string, not a regular expression. As the perlvar doc famously says:
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Still, it looks like you can get away with $/="\n" x 10 or something like that:
perl -a -F\n -ne '$/="\n"x10;$\="\n";chomp;$regex="CHUPACABRA";
print if /\S/ && !m/$regex/i;' data/lib51.000
Note the extra /\S/ &&, which will skip empty paragraphs from input that has more than 20 consecutive newlines.
Also, have you considered just installing Cygwin and having awk available on your Windows machine?
There is no need for (much)conversion if you can download gawk for windows
Did you know that Perl comes with a program called a2p that does exactly what you described you want to do in your title?
And, if you have Perl on your machine, the documentation for this program is already there:
C> perldoc a2p
My own suggestion is to get the Llama book and learn Perl anyway. Despite what the Python people say, Perl is a great and flexible language. If you know shell, awk and grep, you'll understand many of the Perl constructs without any problems.