Best practice values for Perl split function limit parm

Best practice values for Perl split function limit parm - perl

I am parsing a string in a subroutine that specifies a fixed number of parameters and two optional parameters. N.B. I also specify the parameter string being used.
This parameter string is of the form:
local_fs_name rem_fs_name timeout diff_limit hi hihi (rem_hi) (rem_hihi)
so definitely six parameters with two optional parameters for a max of eight.
Should the upper limit be set to the maximum number of parameters or one more than the maximum, i.e. eight or nine?

The only reasons to limit the number of fields split returns that I can think of are either for efficiency purposes (and your subroutine would have to be called a lot with very many more parameters than required for this to matter) or if you really want to keep the separators in the final field.
You shouldn't be using split to verify the number of parameters. Fetch all of them into an array and then verify the contents of the array. Something like this:
my $params = 'local_fs_name rem_fs_name timeout diff_limit hi hihi rem_hi rem_hihi';
my #params = split ' ', $params;
if (#params < 6 or #params > 8) {
die "Usage: mysub local_fs_name rem_fs_name timeout diff_limit hi hihi [rem_hi [rem_hihi]]\n";
}

It's not a style (best practice) question.
split ' ', $_
and
split ' ', $_, 6
produce different results when 7+ args are provided.
>perl -E"say +( split ' ', 'a b c d e f g' )[5]"
f
>perl -E"say +( split ' ', 'a b c d e f g', 6 )[5]"
f g
My best guess is that don't want to limit.
Then there's the question of whether you want to keep trailing fields or not.
>perl -E"#a=split(' ', 'a b c d e ' ); say 0+#a;"
5
>perl -E"#a=split(' ', 'a b c d e ', -1); say 0+#a;"
6
My best guess is trailing whitespace isn't significant.

Related

PERL: Sorting Letters from A to Z

I'm Trying to sort letters in a file from A to Z
for example: a A B d r g
sorted: A a B d g r
#ARGV == 2 or die "Usage: $0 infile outfile\n";
open $old, '<', $ARGV[0] or die $!;
open $new, '>', $ARGV[1] or die $!;
#mass=<$old>;
#array=qw(#mass);
#sort=sort #array;
#mass1=sort {uc $a cmp uc $b} #sort;
print $new #mass1;
Where am I going wrong?

I don't think you understand the the standard text ordering is ASCII-based. So because all uppercase proceed all lowercase, the same is true of your input. Therefore, you order for a straight sort would be ( 'A', 'B', 'a', 'd', 'g', 'r' ).
You want to double compare the two strings. In this case, you're going to need to pass a routine to sort.
#sort= sort { lc $a cmp lc $b or $a cmp $b } #array;

I'm not sure what you intended to do with qw, but
suffice it to say that the contents of #mass will be never be used.
#array = qw(hello world);
Will cause #array to be defined to contain 2 strings, hello and world. It is just shorthand for:
#array = ('hello', 'world');
Which is why
#array=qw(#mass);
Evaluates to ('#mass') - an array with the single literal string of 5 characters #mass.
Maybe that's what you're doing wrong. What if you try
#array = map { split /\s+/} #mass;
#mass is the list of lines. Each line has words or just letters, separated by space.
What that last line does is maps each line with split /\s+/ - which will split each
line like 'ba ab a G' into a list like ('ba', 'ab', 'a', 'G') and #array will
become a single list of words/letters.
Then it's a matter of how you want to sort them. See the other answer as well.
Oh, and remember to put back the spaces when you write out your file:
print $new (join " ", #mass1);
If you want each line to be sorted interdependently of the other, that's easy too:
$mass1 = join "\n", map { join " ", sort (split /\s+/) } #mass
That reads, 'for every line in #mass, split on space, sort and join back again with space', and with the resulting array, join with newline to produce the output of the file.
Note that you can drop in sort with a comparator like sort { $a cmp $b } etc.
If your file is too big, then looping is maybe prudent:
for my $mass (<$old>) {
my $sorted_line = join " ", sort (split /\s+/, $mass);
print $new "$sorted_line\n";
}

You need to find the correct LOCALE to use, so that the order used by all functions (sort, etc) are using the correct locale and sort accordingly to it.
See this page showing most of the variables defining locales, and look for LANG and LC_ALL. and LC_COLLATE (I have to admit I'm not exactly sure which is used when. LC_ALL is supposed to take precedence over the others, so it's the one you can change to have all LC_* values set... Please test, ymmv)
I believe you probably need to use one of the unicode locales. Ascii won't do what you want, as CAPS are before regular letters in ascii.
To find out which locales you can use: locale -a
To see which locales you are currently set to : locale (user and system-wide values are possible)
You probably need something containing "utf-8" to have the order you seek
Then : (if for example en_US.UTF-8 is available):
just before using it in the sort, define locales you want to sort with:
LC_ALL=en_US.UTF-8
(or whatever the value you need it to be set at, and is available as shown by "locale -a")
(save/restore their previous values around the invocation if you need to)
In shell, you probably better want to ass "export" to those variables you redefine, to ensure subshells use the new value too (like: something | sort : in bash, sort will be in a subshell, therefore using the default value of LC_*, or using the exported value if you exported it!)

Perl : Convert a function call to a lisp style function call

I am trying to convert a line a, f_1(b, c, f_2(d, e)) to a line with lisp style function call
a (f_1 b c (f_2 d e)) using Text::Balanced subroutines:
A function call is in the form f(arglist), arglist can have one or many function calls within it with heirarchical calls too;
The way i tried -
my $text = q|a, f_1(a, b, f_2(c, d))|;
my ($match, $remainder) = extract_bracketed($text); # defaults to '()'
# $match is not containing the text i want which is : a, b, f_2(c,d) because "(" is preceded by a string;
my ($d_match, $d_remainder) = extract_delimited($remainder,",");
# $d_match doesnt contain the the first string
# planning to use remainder texts from the bracketed and delimited operations in a loop to translate.
Tried even the sub extract_tagged with start tag as /^[\w_0-9]+\(/ and end tag as /\)/, but doesn't work there too.
Parse::RecDescent is difficult to understand and put to use in a short time.

All that seems to be necessary to transform to the LISP style is to remove the commas and move each opening parenthesis to before the function names that precedes it.
This program works by tokenizing the string into identifiers /\w+/ or parentheses /[()]/ and storing the list in array #tokens. This array is then scanned, and wherever an identifier is followed by an opening parenthesis the two are switched over.
use strict;
use warnings;
my $str = 'a, f_1(b, c, f_2(d, e))';
my #tokens = $str =~ /\w+|[()]/g;
for my $i (0 .. $#tokens-1) {
#tokens[$i,$i+1] = #tokens[$i+1,$i] if "#tokens[$i,$i+1]" =~ /\w \(/;
}
print "#tokens\n";
output
a ( f_1 b c ( f_2 d e ) )

Apply regexp replace only to quoted piece

I need to apply a regexp filtration to affect only pieces of text within quotes and I'm baffled.
$in = 'ab c "d e f" g h "i j" k l';
#...?
$inquotes =~ s/\s+/_/g; #arbitrary regexp working only on the pieces inside quote marks
#...?
$out = 'ab c "d_e_f" g h "i_j" k l';
(the final effect can strip/remove the quotes if that makes it easier, 'ab c d_e_f g...)

You could figure out some cute trick that looks like line noise.
Or you could keep it simple and readable, and just use split and join. Using the quote mark as a field separator, operate on every other field:
my #pieces = split /\"/, $in, -1;
foreach my $i (0 ... $#pieces) {
next unless $i % 2;
$pieces[$i] =~ s/\s+/_/g;
}
my $out = join '"', #pieces;

If you want you use just a regex, the following should work:
my $in = q(ab c "d e f" g h "i j" k l);
$in =~ s{"(.+?)"}{$1 =~ s/\s+/_/gr}eg;
print "$in\n";
(You said the "s may be dropped :) )
HTH,
Paul

Something like
s/\"([\a\w]*)\"/
should match the quoted chunks. My perl regex syntax is a little rusty, but shouldn't just placing quote literals around what you're capturing do the job? You've then got your quoted string d e f inside the first capture group, so you can do whatever you want to it... What kind of 'arbitrary operation' are you trying to do to the quoted strings?
Hmm.
You might be better off matching the quoted strings, then passing them to another regex, rather than doing it all in one.

What pseudo-operators exist in Perl 5?

I am currently documenting all of Perl 5's operators (see the perlopref GitHub project) and I have decided to include Perl 5's pseudo-operators as well. To me, a pseudo-operator in Perl is anything that looks like an operator, but is really more than one operator or a some other piece of syntax. I have documented the four I am familiar with already:
()= the countof operator
=()= the goatse/countof operator
~~ the scalar context operator
}{ the Eskimo-kiss operator
What other names exist for these pseudo-operators, and do you know of any pseudo-operators I have missed?
=head1 Pseudo-operators
There are idioms in Perl 5 that appear to be operators, but are really a
combination of several operators or pieces of syntax. These pseudo-operators
have the precedence of the constituent parts.
=head2 ()= X
=head3 Description
This pseudo-operator is the list assignment operator (aka the countof
operator). It is made up of two items C<()>, and C<=>. In scalar context
it returns the number of items in the list X. In list context it returns an
empty list. It is useful when you have something that returns a list and
you want to know the number of items in that list and don't care about the
list's contents. It is needed because the comma operator returns the last
item in the sequence rather than the number of items in the sequence when it
is placed in scalar context.
It works because the assignment operator returns the number of items
available to be assigned when its left hand side has list context. In the
following example there are five values in the list being assigned to the
list C<($x, $y, $z)>, so C<$count> is assigned C<5>.
my $count = my ($x, $y, $z) = qw/a b c d e/;
The empty list (the C<()> part of the pseudo-operator) triggers this
behavior.
=head3 Example
sub f { return qw/a b c d e/ }
my $count = ()= f(); #$count is now 5
my $string = "cat cat dog cat";
my $cats = ()= $string =~ /cat/g; #$cats is now 3
print scalar( ()= f() ), "\n"; #prints "5\n"
=head3 See also
L</X = Y> and L</X =()= Y>
=head2 X =()= Y
This pseudo-operator is often called the goatse operator for reasons better
left unexamined; it is also called the list assignment or countof operator.
It is made up of three items C<=>, C<()>, and C<=>. When X is a scalar
variable, the number of items in the list Y is returned. If X is an array
or a hash it it returns an empty list. It is useful when you have something
that returns a list and you want to know the number of items in that list
and don't care about the list's contents. It is needed because the comma
operator returns the last item in the sequence rather than the number of
items in the sequence when it is placed in scalar context.
It works because the assignment operator returns the number of items
available to be assigned when its left hand side has list context. In the
following example there are five values in the list being assigned to the
list C<($x, $y, $z)>, so C<$count> is assigned C<5>.
my $count = my ($x, $y, $z) = qw/a b c d e/;
The empty list (the C<()> part of the pseudo-operator) triggers this
behavior.
=head3 Example
sub f { return qw/a b c d e/ }
my $count =()= f(); #$count is now 5
my $string = "cat cat dog cat";
my $cats =()= $string =~ /cat/g; #$cats is now 3
=head3 See also
L</=> and L</()=>
=head2 ~~X
=head3 Description
This pseudo-operator is named the scalar context operator. It is made up of
two bitwise negation operators. It provides scalar context to the
expression X. It works because the first bitwise negation operator provides
scalar context to X and performs a bitwise negation of the result; since the
result of two bitwise negations is the original item, the value of the
original expression is preserved.
With the addition of the Smart match operator, this pseudo-operator is even
more confusing. The C<scalar> function is much easier to understand and you
are encouraged to use it instead.
=head3 Example
my #a = qw/a b c d/;
print ~~#a, "\n"; #prints 4
=head3 See also
L</~X>, L</X ~~ Y>, and L<perlfunc/scalar>
=head2 X }{ Y
=head3 Description
This pseudo-operator is called the Eskimo-kiss operator because it looks
like two faces touching noses. It is made up of an closing brace and an
opening brace. It is used when using C<perl> as a command-line program with
the C<-n> or C<-p> options. It has the effect of running X inside of the
loop created by C<-n> or C<-p> and running Y at the end of the program. It
works because the closing brace closes the loop created by C<-n> or C<-p>
and the opening brace creates a new bare block that is closed by the loop's
original ending. You can see this behavior by using the L<B::Deparse>
module. Here is the command C<perl -ne 'print $_;'> deparsed:
LINE: while (defined($_ = <ARGV>)) {
print $_;
}
Notice how the original code was wrapped with the C<while> loop. Here is
the deparsing of C<perl -ne '$count++ if /foo/; }{ print "$count\n"'>:
LINE: while (defined($_ = <ARGV>)) {
++$count if /foo/;
}
{
print "$count\n";
}
Notice how the C<while> loop is closed by the closing brace we added and the
opening brace starts a new bare block that is closed by the closing brace
that was originally intended to close the C<while> loop.
=head3 Example
# count unique lines in the file FOO
perl -nle '$seen{$_}++ }{ print "$_ => $seen{$_}" for keys %seen' FOO
# sum all of the lines until the user types control-d
perl -nle '$sum += $_ }{ print $sum'
=head3 See also
L<perlrun> and L<perlsyn>
=cut

Nice project, here are a few:
scalar x!! $value # conditional scalar include operator
(list) x!! $value # conditional list include operator
'string' x/pattern/ # conditional include if pattern
"#{[ list ]}" # interpolate list expression operator
"${\scalar}" # interpolate scalar expression operator
!! $scalar # scalar -> boolean operator
+0 # cast to numeric operator
.'' # cast to string operator
{ ($value or next)->depends_on_value() } # early bail out operator
# aka using next/last/redo with bare blocks to avoid duplicate variable lookups
# might be a stretch to call this an operator though...
sub{\#_}->( list ) # list capture "operator", like [ list ] but with aliases

In Perl these are generally referred to as "secret operators".
A partial list of "secret operators" can be had here. The best and most complete list is probably in possession of Philippe Bruhad aka BooK and his Secret Perl Operators talk but I don't know where its available. You might ask him. You can probably glean some more from Obfuscation, Golf and Secret Operators.

Don't forget the Flaming X-Wing =<>=~.
The Fun With Perl mailing list will prove useful for your research.

The "goes to" and "is approached by" operators:
$x = 10;
say $x while $x --> 4;
# prints 9 through 4
$x = 10;
say $x while 4 <-- $x;
# prints 9 through 5
They're not unique to Perl.

From this question, I discovered the %{{}} operator to cast a list as a hash. Useful in
contexts where a hash argument (and not a hash assignment) are required.
#list = (a,1,b,2);
print values #list; # arg 1 to values must be hash (not array dereference)
print values %{#list} # prints nothing
print values (%temp=#list) # arg 1 to values must be hash (not list assignment)
print values %{{#list}} # success: prints 12
If #list does not contain any duplicate keys (odd-elements), this operator also provides a way to access the odd or even elements of a list:
#even_elements = keys %{{#list}} # #list[0,2,4,...]
#odd_elements = values %{{#list}} # #list[1,3,5,...]

The Perl secret operators now have some reference (almost official, but they are "secret") documentation on CPAN: perlsecret

You have two "countof" (pseudo-)operators, and I don't really see the difference between them.
From the examples of "the countof operator":
my $count = ()= f(); #$count is now 5
my $string = "cat cat dog cat";
my $cats = ()= $string =~ /cat/g; #$cats is now 3
From the examples of "the goatse/countof operator":
my $count =()= f(); #$count is now 5
my $string = "cat cat dog cat";
my $cats =()= $string =~ /cat/g; #$cats is now 3
Both sets of examples are identical, modulo whitespace. What is your reasoning for considering them to be two distinct pseudo-operators?

How about the "Boolean one-or-zero" operator: 1&!!
For example:
my %result_of = (
" 1&!! '0 but true' " => 1&!! '0 but true',
" 1&!! '0' " => 1&!! '0',
" 1&!! 'text' " => 1&!! 'text',
" 1&!! 0 " => 1&!! 0,
" 1&!! 1 " => 1&!! 1,
" 1&!! undef " => 1&!! undef,
);
for my $expression ( sort keys %result_of){
print "$expression = " . $result_of{$expression} . "\n";
}
gives the following output:
1&!! '0 but true' = 1
1&!! '0' = 0
1&!! 'text' = 1
1&!! 0 = 0
1&!! 1 = 1
1&!! undef = 0

The << >> operator, for multi-line comments:
<<q==q>>;
This is a
multiline
comment
q

Can I use Perl's unpack to break up a string into vars?

I have an image file name that consists of four parts:
$Directory (the directory where the image exists)
$Name (for a art site, this is the paintings name reference #)
$File (the images file name minus extension)
$Extension (the images extension)
$example 100020003000.png
Which I desire to be broken down accordingly:
$dir=1000 $name=2000 $file=3000 $ext=.png
I was wondering if substr was the best option in breaking up the incoming $example so I can do stuff with the 4 variables like validation/error checking, grabbing the verbose name from its $Name assignment or whatever. I found this post:
is unpack faster than substr?
So, in my beginners "stone tool" approach:
my $example = "100020003000.png";
my $dir = substr($example, 0,4);
my $name = substr($example, 5,4);
my $file = substr($example, 9,4);
my $ext = substr($example, 14,3); # will add the the "." later #
So, can I use unpack, or maybe even another approach that would be more efficient?
I would also like to avoid loading any modules unless doing so would use less resources for some reason. Mods are great tools I luv'em but, I think not necessary here.
I realize I should probably push the vars into an array/hash but, I am really a beginner here and I would need further instruction on how to do that and how to pull them back out.
Thanks to everyone at stackoverflow.com!

Absolutely:
my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = unpack 'A4' x 4, $example;
print "$dir\t$name\t$file\t$ext\n";
Output:
1000 2000 3000 .png

I'd just use a regex for that:
my ($dir, $name, $file, $ext) = $path =~ m:(.*)/(.*)/(.*)\.(.*):;
Or, to match your specific example:
my ($dir, $name, $file, $ext) = $example =~ m:^(\d{4})(\d{4})(\d{4})\.(.{3})$:;

Using unpack is good, but since the elements are all the same width, the regex is very simple as well:
my $example = "100020003000.png";
my ($dir, $name, $file, $ext) = $example =~ /(.{4})/g;

It isn't unpack, but since you have groups of 4 characters, you could use a limited split, with a capture:
my ($dir, $name, file, $ext) = grep length, split /(....)/, $filename, 4;
This is pretty obfuscated, so I probably wouldn't use it, but the capture in a split is an ofter overlooked ability.
So, here's an explanation of what this code does:
Step 1. split with capturing parentheses adds the values captured by the pattern to its output stream. The stream contains a mix of fields and delimiters.
qw( a 1 b 2 c 3 ) == split /(\d)/, 'a1b2c3';
Step 2. split with 3 args limits how many times the string is split.
qw( a b2c3 ) == split /\d/, 'a1b2c3', 2;
Step 3. Now, when we use a delimiter pattern that matches pretty much anything /(....)/, we get a bunch of empty (0 length) strings. I've marked delimiters with D characters, and fields with F:
( '', 'a', '', '1', '', 'b', '', '2' ) == split /(.)/, 'a1b2';
F D F D F D F D
Step 4. So if we limit the number of fields to 3 we get:
( '', 'a', '', '1', 'b2' ) == split /(.)/, 'a1b2', 3;
F D F D F
Step 5. Putting it all together we can do this (I used a .jpeg extension so that the extension would be longer than 4 characters):
( '', 1000, '', 2000, '', 3000, '.jpeg' ) = split /(....)/, '100020003000.jpeg',4;
F D F D F D F
Step 6. Step 5 is almost perfect, all we need to do is strip out the null strings and we're good:
( 1000, 2000, 3000, '.jpeg' ) = grep length, split /(....)/, '100020003000.jpeg',4;
This code works, and it is interesting. But it's not any more compact that any of the other solutions. I haven't bench-marked, but I'd be very surprised if it wins any speed or memory efficiency prizes.
But the real issue is that it is too tricky to be good for real code. Using split to capture delimiters (and maybe one final field), while throwing out the field data is just too weird. It's also fragile: if one field changes length the code is broken and has to be rewritten.
So, don't actually do this.
At least it provided an opportunity to explore some lesser known features of split.

Both substr and unpack bias your thinking toward fixed-layout, while regex solutions are more oriented toward flexible layouts with delimiters.
The example you gave appeared to be fixed layout, but directories are usually separated from file names by a delimiter (e.g. slash for POSIX-style file systems, backwardslash for MS-DOS, etc.) So you might actually have a case for both; a regex solution to split directory and file name apart (or even directory/name/extension) and then a fixed-length approach for the name part by itself.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Best practice values for Perl split function limit parm - perl

Related

PERL: Sorting Letters from A to Z

Perl : Convert a function call to a lisp style function call

Apply regexp replace only to quoted piece

What pseudo-operators exist in Perl 5?

Can I use Perl's unpack to break up a string into vars?

Categories

Resources