Can I use unpack to split a string into characters in Perl? - perl

A common 'Perlism' is generating a list as something to loop over in this form:
for($str=~/./g) { print "the next character from \"$str\"=$_\n"; }
In this case the global match regex returns a list that is one character in turn from the string $str, and assigns that value to $_
Instead of a regex, split can be used in the same way or 'a'..'z', map, etc.
I am investigating unpack to generate a field by field interpretation of a string. I have always found unpack to be less straightforward to the way my brain works, and I have never really dug that deeply into it.
As a simple case, I want to generate a list that is one character in each element from a string using unpack (yes -- I know I can do it with split(//,$str) and /./g but I really want to see if unpack can be used this way...)
Obviously, I can use a field list for unpack that is unpack("A1" x length($str), $str) but is there some other way that kinda looks like globbing? ie, can I call unpack(some_format,$str) either in list context or in a loop such that unpack will return the next group of character in the format group until $str is exausted?
I have read The Perl 5.12 Pack pod and the Perl 5.12 pack tutorial and the Perkmonks tutorial
Here is the sample code:
#!/usr/bin/perl
use warnings;
use strict;
my $str=join('',('a'..'z', 'A'..'Z')); #the alphabet...
$str=~s/(.{1,3})/$1 /g; #...in groups of three
print "str=$str\n\n";
for ($str=~/./g) {
print "regex: = $_\n";
}
for(split(//,$str)) {
print "split: \$_=$_\n";
}
for(unpack("A1" x length($str), $str)) {
print "unpack: \$_=$_\n";
}

pack and unpack templates can use parentheses to group things much like regexps can. The group can be followed by a repeat count. * as a repeat count means "repeat until you run out of things to pack/unpack".
for(unpack("(A1)*", $str)) {
print "unpack: \$_=$_\n";
}
You'd have to run a benchmark to find out which of these is the fastest.

Related

Perl subroutines

Here I fixed most of my mistakes and thank you all, any other advice please with my hash at this point and how can I clear each word and puts the word and its frequency in a hash, excluding the empty words.. I think my code make since now.
So you can focus on the key part of the algorithm, how about accepting input on STDIN and output to STDOUT. That way there's no argument checking, etc. Just a simple:
$ prog < words.txt
All you really need is a very simple algorithm:
Read a line
Split it into words
Record a count of the word
When done, display the counts
Here's a sample program
#! /usr/bin/perl -w
use strict;
my (%data);
while (<STDIN>) {
chomp;
my(#words) = split(/\s+/);
foreach my $word (#words) {
if (!defined($data{$word})) {
$data{$word} = 0;
}
$data{$word}++;
}
}
foreach (sort(keys(%data))) {
print "$_: $data{$_}\n";
}
Once you understand this and have it working in your environment, you can extend it to meet your other requirements:
remove non-alphabetic characters from each word
print three results per line
use input and output files
put the algorithm into a subroutine
I agree that starting with dave's answer would be more productive, but if you are interested in your mistakes, here is what I see:
You assign the return value of checkArgs to a scalar variable $checkArgs, but return an array value. It means that $checkArgs will always contain 2 (the size of the array) after this call (because the program dies if the number of arguments is not 2). It is not very bad since you do not use the value later, but why you need it at all in this case?
You open files and close them immediately without reading from them. Does not make sense.
Statement
while (<>)
reads either from standard output or from all files in the command line arguments. The latter variant is like what you want, but your second argument is the output file, not input. The diamond operator will try to read from it too. You have two options: a) use only one file name in the command line arguments, read the file with <>, use standard output for output, and redirect output to a file in shell; b) use
while(<$file1>)
instead, of course, before closing files. Option a) is the traditional Unix- and Perl-style, but b) provides for clearer code for beginners.
Statements
return $word;
and
return $str, $hash{$str};
return corresponding values on the first iterations of the loops, all other data remain unprocessed. In the first case, you should create a local array, store all $word in it and return the array as a whole. In the second case, you already have such local %hash, it is enough to return this hash. In both cases, you need should assign the return values of the functions not to scalars, but to an array and a hash correspondingly. Now, you actually lose all you data.

To match for a certain number

I have a file which has a lot of floating point numbers like this:
4.5268e-06 4.5268e-08 4.5678e-01 4.5689e-04...
I need to check if there is atleast one number with an expoenent -1. So, I wrote this short snippet with the regex. The regex works because I checked and it does. But what I am getting in the output is all 1s. I know I am missing something very basic. Please help.
#!usr/local/bin/perl
use strict;
use warnings;
my $i;
my #values;
open(WPR,"test.txt")||die "couldnt open $!";
while(<WPR>)
{
chomp();
push #values,(/\d\.\d\d\d\de+[+-][0][1]/);
}
foreach $i (#values){
print "$i\n";}
close(WPR);
The regular expression match operator m (which you have omitted) returns true if it matches. True in Perl is usually returned as 1. (Note that most stuff is true, though).
If you want to stick with the short syntax, do this:
push #values, $1 if /(\d\.\d\d\d\de+[+-][0][1])/;
If I move the parenthesis, it works fine:
push #values,/(\d\.\d\d\d\de+[+-][0][1])/;
If there's going to be more than one match on the line, I'd add a g at the end.
If you have capture groups, and a list context, then match returns a list of capture results.
If you want to take this to its insane conclusion then:
my #values = map { /(\d\.\d\d\d\de+[+-][0][1])/g } <WPR> ;
Yes, you can use <WPR> in a list context too.
BTW, while your regex works, it probably isn't exactly what you meant. For example e+ matches one or more es. A little simpler might be:
/\d\.\d{4}e[+-]01/ ;
Which is still going to have other issues like matching x.xxxxe+01 as well.
You could try with this one:
/\d+\.\d+e-01/

perl pattern matching one by one and process it

I have a string
[something]text1[/something] blah blah [something]text2[/something]
I need to write a Perl script to read what is in the [something] tag, process it to "text-x", and put it back with an [otherthing] tag. So the above string should be
[otherthing]text-1[/otherthing] blah blah [otherthing]text-2[/otherthing]
Processing "textx" to "text-x" is not one step process.
So this is solution that I have till now:
m/[something](?<text>.*)[/something]/
This will get me the string in between and I can process that to "text-x" but how do I put it back in the same place with [otherthing]text-x[/otherthing]?
How do I use s/// in this case?
How to do it for the whole string one by one ?
You can use the /e switch on s/// to evaluate the right hand side before using the result as the substitution, and the /g flag to do this for every match.
Here is a simple example:
use 5.12.0;
my $str = ">1< >2< >34<";
$str =~ s/>(\d+)</">>".process("$1")."<<"/eg;
say $str;
sub process {
return "x" x $_[0];
}
This should come close. It uses the /e modifier to allow you to do processing in the replacement side of the regex and so it calls the fix_textx function where you can do multiple steps.
The normal way of iterating over matches is with the /g modifier.
#!/usr/bin/perl
use strict;
use warnings;
my $string = '[something]text1[/something] blah blah [something]text2[/something]';
$string =~ s{\[something\](text[^[]*)\[\/something\]}
{'[otherthing]' . fix_textx($1) . '[/otherthing]'}ge;
print $string;
sub fix_textx {
my ($testx) = #_;
$testx =~ s/text\K(.*)/-$1/;
return $testx;
}
EDIT: fixed the square bracket. Thanks #tadmc
In this particular case, you can accomplish what you're trying to do by splitting the string on "[something]" and then processing the beginning of each piece (except the first one), then joining the pieces back together when you're done.
I don't know if there is a general way to iterate over the regex matches in a string in Perl. I'm hoping someone else will answer this question and educate me on that.

Why does this base64 string comparison in Perl fail?

I am trying to compare an encode_base64('test') to the string variable containing the base64 string of 'test'. The problem is it never validates!
use MIMI::Base64 qw(encode_base64);
if (encode_base64("test") eq "dGVzdA==")
{
print "true";
}
Am I forgetting anything?
Here's a link to a Perlmonks page which says "Beware of the newline at the end of the encode_base64() encoded strings".
So the simple 'eq' may fail.
To suppress the newline, say encode_base64("test", "") instead.
When you do a string comparison and it fails unexpectedly, print the strings to see what is actually in them. I put brackets around the value to see any extra whitespace:
use MIME::Base64;
$b64 = encode_base64("test");
print "b64 is [$b64]\n";
if ($b64 eq "dGVzdA==") {
print "true";
}
This is a basic debugging technique using the best debugger ever invented. Get used to using it a lot. :)
Also, sometimes you need to read the documentation for things a couple time to catch the important parts. In this case, MIME::Base64 tells you that encode_base64 takes two arguments. The second argument is the line ending and defaults to a newline. If you don't want a newline on the end of the string you need to give it another line ending, such as the empty string:
encode_base64("test", "")
Here's an interesting tip: use Perl's wonderful and well-loved testing modules for debugging. Not only will that give you a head start on testing, but sometimes they'll make your debugging output a lot faster. For example:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More 0.88;
BEGIN { use_ok 'MIME::Base64' => qw(encode_base64) }
is( encode_base64("test", "dGVzdA==", q{"test" encodes okay} );
done_testing;
Run that script, with perl or with prove, and it won't just tell you that it didn't match, it will say:
# Failed test '"test" encodes okay'
# at testbase64.pl line 6.
# got: 'gGVzdA==
# '
# expected: 'dGVzdA=='
and sharp-eyed readers will notice that the difference between the two is indeed the newline. :)

How can I filter out specific column from a CSV file in Perl?

I am just a beginner in Perl and need some help in filtering columns using a Perl script.
I have about 10 columns separated by comma in a file and I need to keep 5 columns in that file and get rid of every other columns from that file. How do we achieve this?
Thanks a lot for anybody's assistance.
cheers,
Neel
Have a look at Text::CSV (or Text::CSV_XS) to parse CSV files in Perl. It's available on CPAN or you can probably get it through your package manager if you're using Linux or another Unix-like OS. In Ubuntu the package is called libtext-csv-perl.
It can handle cases like fields that are quoted because they contain a comma, something that a simple split command can't handle.
CSV is an ill-defined, complex format (weird issues with quoting, commas, and spaces). Look for a library that can handle the nuances for you and also give you conveniences like indexing by column names.
Of course, if you're just looking to split a text file by commas, look no further than #Pax's solution.
Use split to pull the line apart then output the ones you want (say every second column), create the following xx.pl file:
while(<STDIN>) {
chomp;
#fields = split (",",$_);
print "$fields[1],$fields[3],$fields[5],$fields[7],$fields[9]\n"
}
then execute:
$ echo 1,2,3,4,5,6,7,8,9,10 | perl xx.pl
2,4,6,8,10
If you are talking about CSV files in windows (e.g., generated from Excel), you will need to be careful to take care of fields that contain comma themselves but are enclosed by quotation marks.
In this case, a simple split won't work.
Alternatively, you could use Text::ParseWords, which is in the standard library. Add
use Text::ParseWords;
to the top of Pax's example above, and then substitute
my #fields = parse_line(q{,}, 0, $_);
for the split.
You can use some of Perl's built in runtime options to do this on the command line:
$ echo "1,2,3,4,5" | perl -a -F, -n -e 'print join(q{,}, $F[0], $F[3]).qq{\n}'
1,4
The above will -a(utosplit) using the -F(ield) of a comma. It will then join the fields you are interested in and print them back out (with a line separator). This assumes simple data without nested comma's. I was doing this with an unprintable field separator (\x1d) so this wasn't an issue for me.
See http://perldoc.perl.org/perlrun.html#Command-Switches for more details.
Went looking didn't find a nice csv compliant filter program thats flexible to be useful for than just a one-of, so I wrote one. Enjoy.
Basic usage is:
bash$ csvfilter [-r <columnTitle>]* [-quote] <csv.file>
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Text::CSV;
my $always_quote=0;
my #remove;
if ( ! GetOptions('remove:s'=> \#remove,
'quote-always'=>sub {$always_quote=1;}) ) {
die "$0:invalid option (use --remove [--quote-always])";
}
my #cols2remove;
sub filter(#)
{
my #fields=#_;
my #r;
my $i=0;
for my $c (#cols2remove) {
my $p;
#if ( $i $i ) {
push(#r, splice(#fields, $i));
}
return #r;
}
# create just one if these
my $csvOut=new Text::CSV({always_quote=>$always_quote});
sub printLine(#)
{
my #fields=#_;
my $combined=$csvOut->combine(filter(#fields));
my $str=$csvOut->string();
if ( length($str) ) {
print "$str\n";
}
}
my $csv = Text::CSV->new();
my $od;
open($od, "| cat") || die "output:$!";
while () {
$csv->parse($_);
if ( $. == 1 ) {
my $failures=0;
my #cols=$csv->fields;
for my $rm (#remove) {
for (my $c=0; $c$b} #cols2remove);
}
printLine($csv->fields);
}
exit(0);
\
In addition to what people here said about processing comma-separated files, I'd like to note that one can extract the even (or odd) array elements using an array slice and/or map:
#myarray[map { $_ * 2 } (0 .. 4)]
Hope it helps.
My personal favorite way to do CSV is using the AnyData module. It seems to make things pretty simple, and removing a named column can be done rather easily. Take a look on CPAN.
This answers a much larger question, but seems like a good relevant bit of information.
The unix cut command can do what you want (and a whole lot more). It has been reimplemented in Perl.