Perl module / subroutine to remove shared substring of strings - perl

Given a list/array of strings (in particular, UNIX paths), remove the shared part, eg:
./dir/fileA_header.txt
./dir/fileA_footer.txt
I probably will strip the directory before using the function, but strictly speacking this won't change much.
I'd like to know a method to either remove the shared parts (./dir/fileA_) or remove the not-shared part.
Thank you for your help!

This is a bit of a hack, but if you don't need to support Unicode strings (that is, if all characters have a value below 256), you can use xor to get the length of the longest common prefix of two strings:
my $n = do {
($str1 ^ $str2) =~ /^\0*/;
$+[0]
};
You can apply this operation in a loop to get the common prefix of a list of strings:
use v5.12.0;
use warnings;
sub common_prefix {
my $prefix = shift;
for my $str (#_) {
($prefix ^ $str) =~ /^\0*/;
substr($prefix, $+[0]) = '';
}
return $prefix;
}
my #paths = qw(
./dir/fileA_header.txt
./dir/fileA_footer.txt
);
say common_prefix(#paths);
Output: ./dir/fileA_

Related

truncate string in perl into substring with trailing elipses

I'm trying to truncate a string in a select input option using perl if it is longer than a set value, though i can't get it to work correctly.
my $value = defined $option->{value} ? $option->{value} : '';
my $maxValueLength = 50;
if ($value.length > $maxValueLength) {
$value = substr $value, 0, $maxValueLength + '...';
}
Another option is regex
$string =~ s/.{$maxLength}\K.*/.../;
It matches any character (.) given number of times ({N}, here $maxLength), what is the first $maxLength characters in $string; then \K makes it "forget" all previous matches so those won't get replaced later. The rest of the string that is matched is then replaced by ...
See Lookaround assertions in perlre for \K.
This does start the regex engine for a simple task but it doesn't need any conditionals -- if the string is shorter than the maximum length the regex won't match and nothing happens.
Your code has several syntax errors. Turn on use strict and use warnings if you don't have it, and then read the error messages it tells you about. This is a bit tricky because of Perl's very complex syntax (see also Damian Conway's keynote from the 2020 Perl and Raku Conference), but it boils down to these:
Use of uninitialized value in concatenation (.) or string at line 7
Argument "..." isn't numeric in addition (+) at line 8
I've used the following adaption of your code to produce these
use strict;
use warnings;
my $value = '1234567890' x 10;
my $maxValueLength = 50;
if ( $value.length > $maxValueLength ) {
$value = substr $value, 0, $maxValueLength + '...';
}
print $value;
Now let's see what they mean.
The . operator in Perl is a concatenation. You cannot use it to call methods, and length is not a method on a string. Perl thinks you are using the built-in length (a function, not a method) without an argument, which makes it default to $_. Most built-ins do this, to make one-liners shorter. But $_ is not defined. Now the . tries to concatenate the length of undef to $value. And using undef in a string operation leads to this warning.
The correct way of doing this is length $value (or with parentheses if you prefer them, length($value)).
The + operator is not concatenation (we just learned that the . is). It's a numerical addition. Perl is pretty good at converting between strings and numbers as there aren't really any types, so saying 1 + "5" would give you 6 without problems, but it cannot do that for a couple of dots in a string. Hence it complains about a non-number value in an addition.
You want the substring with a given length, and then you want to attach the three dots. Because of associativity (or stickyness) of operators you will need to use parentheses () for your substr call.
$value = substr($value, 0, $maxValueLength) . '...';
To find a length of the string use length(STRING)
Here is the code snippet how you can modify the script.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
my $string = "abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz";
say "length of original string is:".length($string);
my $value = defined $string ? $string : '';
my $maxValueLength = 50;
if (length($value) > $maxValueLength) {
$value = substr $value, 0, $maxValueLength;
say "value:$value";
say "value's length:".length($value);
}
Output:
length of original string is:80
value:abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvw
value's length:50

Perl - How to create commands that users can input in console?

I'm just starting in Perl and I'm quite enjoying it. I'm writing some basic functions, but what I really want to be able to do is to use those functions intelligently using console commands. For example, say I have a function adding two numbers. I'd want to be able to type in console "add 2, 4" and read the first word, then pass the two numbers as parameters in an "add" function. Essentially, I'm asking for help in creating some basic scripting using Perl ^^'.
I have some vague ideas about how I might do this in VB, but Perl, I have no idea where I'd start, or what functions would be useful to me. Is there something like VB.net's "Split" function where you can break down the contents of a scalar into an array? Is there a simple way to analyse one word at a time in a scalar, or iterate through a scalar until you hit a separator, for example?
I hope you can help, any suggestions are appreciated! Bear in mind, I'm no expert, I started Perl all of a few weeks ago, and I've only been doing VB.net half a year.
Thank you!
Edit: If you're not sure what to suggest and you know any simple/intuitive resources that might be of help, that would also be appreciated.
Its rather easy to make a script which dispatches to a command by name. Here is a simple example:
#!/usr/bin/env perl
use strict;
use warnings;
# take the command name off the #ARGV stack
my $command_name = shift;
# get a reference to the subroutine by name
my $command = __PACKAGE__->can($command_name) || die "Unknown command: $command_name\n";
# execute the command, using the rest of #ARGV as arguments
# and print the return with a trailing newline
print $command->(#ARGV);
print "\n";
sub add {
my ($x, $y) = #_;
return $x + $y;
}
sub subtract {
my ($x, $y) = #_;
return $x - $y;
}
This script (say its named myscript.pl) can be called like
$ ./myscript.pl add 2 3
or
$ ./myscript.pl subtract 2 3
Once you have played with that for a while, you might want to take it further and use a framework for this kind of thing. There are several available, like App::Cmd or you can take the logic shown above and modularize as you see fit.
You want to parse command line arguments. A space serves as the delimiter, so just do a ./add.pl 2 3 Something like this:
$num1=$ARGV[0];
$num2=$ARGV[1];
print $num1 + $num2;
will print 5
Here is a short implementation of a simple scripting language.
Each statement is exactly one line long, and has the following structure:
Statement = [<Var> =] <Command> [<Arg> ...]
# This is a regular grammar, so we don't need a complicated parser.
Tokens are seperated by whitespace. A command may take any number of arguments. These can either be the contents of variables $var, a string "foo", or a number (int or float).
As these are Perl scalars, there is no visible difference between strings and numbers.
Here is the preamble of the script:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
strict and warnings are essential when learning Perl, else too much weird stuff would be possible. The use 5.010 is a minimum version, it also defines the say builtin (like a print but appends a newline).
Now we declare two global variables: The %env hash (table or dict) associates variable names with their values. %functions holds our builtin functions. The values are anonymous functions.
my %env;
my %functions = (
add => sub { $_[0] + $_[1] },
mul => sub { $_[0] * $_[1] },
say => sub { say $_[0] },
bye => sub { exit 0 },
);
Now comes our read-eval-loop (we don't print by default). The readline operator <> will read from the file specified as the first command line argument, or from STDIN if no filename is provided.
while (<>) {
next if /^\s*\#/; # jump comment lines
# parse the line. We get a destination $var, a $command, and any number of #args
my ($var, $command, #args) = parse($_);
# Execute the anonymous sub specified by $command with the #args
my $value = $functions{ $command }->(#args);
# Store the return value if a destination $var was specified
$env{ $var } = $value if defined $var;
}
That was fairly trivial. Now comes some parsing code. Perl “binds” regexes to strings with the =~ operator. Regexes may look like /foo/ or m/foo/. The /x flags allows us to include whitespace in our regex that doesn't match actual whitespace. The /g flag matches globally. This also enables the \G assertion. This is where the last successful match ended. The /c flag is important for this m//gc style parsing to consume one match at a time, and to prevent the position of the regex engine in out string to being reset.
sub parse {
my ($line) = #_; # get the $line, which is a argument
my ($var, $command, #args); # declare variables to be filled
# Test if this statement has a variable declaration
if ($line =~ m/\G\s* \$(\w+) \s*=\s* /xgc) {
$var = $1; # assign first capture if successful
}
# Parse the function of this statement.
if ($line =~ m/\G\s* (\w+) \s*/xgc) {
$command = $1;
# Test if the specified function exists in our %functions
if (not exists $functions{$command}) {
die "The command $command is not known\n";
}
} else {
die "Command required\n"; # Throw fatal exception on parse error.
}
# As long as our matches haven't consumed the whole string...
while (pos($line) < length($line)) {
# Try to match variables
if ($line =~ m/\G \$(\w+) \s*/xgc) {
die "The variable $1 does not exist\n" if not exists $env{$1};
push #args, $env{$1};
}
# Try to match strings
elsif ($line =~ m/\G "([^"]+)" \s*/xgc) {
push #args, $1;
}
# Try to match ints or floats
elsif ($line =~ m/\G (\d+ (?:\.\d+)? ) \s*/xgc) {
push #args, 0+$1;
}
# Throw error if nothing matched
else {
die "Didn't understand that line\n";
}
}
# return our -- now filled -- vars.
return $var, $command, #args;
}
Perl arrays can be handled like linked list: shift removes and returns the first element (pop does the same to the last element). push adds an element to the end, unshift to the beginning.
Out little programming language can execute simple programs like:
#!my_little_language
$a = mul 2 20
$b = add 0 2
$answer = add $a $b
say $answer
bye
If (1) our perl script is saved in my_little_language, set to be executable, and is in the system PATH, and (2) the above file in our little language saved as meaning_of_life.mll, and also set to be executable, then
$ ./meaning_of_life
should be able to run it.
Output is obviously 42. Note that our language doesn't yet have string manipulation or simple assignment to variables. Also, it would be nice to be able to call functions with the return value of other functions directly. This requires some sort of parens, or precedence mechanism. Also, the language requires better error reporting for batch processing (which it already supports).

Perl: Greedy nature refuses to work

I am trying to replace a string with another string, but the greedy nature doesn't seem to be working for me. Below is my code where "PERFORM GET-APLCY" is identified and replaced properly, but string "PERFORM GET-APLCY-SOI-CVG-WVR" and many other such strings are being replaced by the the replacement string for "PERFORM GET-APLCY".
s/PERFORM $func[$i]\.*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;
where the full stop is optional during string match and replacement. I have also tried giving the pattern to be matched as $func[$i]\b
Please help me understand what the issue could be.
Thanks in advance,
Faez
Why GET-APLCY- should not match GET-APLCY., if the dot is optional?
Easy solution: sort your array by length in descending order.
#func = sort { length $b <=> length $a } #func
Testing script:
#!/usr/bin/perl
use warnings;
use strict;
use feature 'say';
my %hash = ('GET-APLCY' => 'REP1',
'GET-APLCY-SOI-CVG-WVR' => 'REP2',
'GET-APLCY-SOI-MNG-CVRW' => 'REP3',
);
my #func = sort { length $b <=> length $a } keys %hash;
while (<DATA>) {
chomp;
print;
print "\t -> \t";
for my $i (0 .. $#func) {
s/$func[$i]/$hash{$func[$i]}/;
}
say;
}
__DATA__
GET-APLCY param
GET-APLCY- param
GET-APLCY. param
GET-APLCY-SOI. param
GET-APLCY-SOI-CVG-WVR param
GET-APLCY-SOI-MNG-CVRW param
You appear to be looping over function names, and calling s/// for each one. An alternative is to use the e option, and do them all in one go (without a loop):
my %hash = (
'GET-APLCY' => 'replacement 1',
'GET-APLCY-SOI-CVG-WVR' => 'replacement 2',
);
s{
PERFORM \s+ # 'PERFORM' keyword
([A-Z-]+) # the original function name
\.? # an optional period
}{
"# PERFORM $1.\n" . $hash{$1};
}xmsge;
The e causes the replacement part to be evaluated as an expression. Basically, the first part finds all PERFORM calls (I'm assuming that the function names are all upper case with '-' between them – adjust otherwise). The second part replaces that line with the text you want to appear.
I've also used the x, m, and s options, which is what allows the comments in the regular expression, among other things. You can find more about these under perldoc perlop.
A plain version of the s-line should be:
s/PERFORM ([A-Z-]+)\.?/"# PERFORM $1.\n" . $hash{$1}/eg;
I guess that $func[$i] contains "GET-APLCY". If so, this is because the star only applies to the dot, an actual dot, not "any character". Try
s/PERFORM $func[$i].*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;
I'm pretty sure you trying to do some kind of loop for $i. And in that case most likely
GET-APLCY is located in #func array before GET-APLCY-SOI-CVG-WVR. So I recommend to reverse sort #func before entering loop.

foreach loop with a condition perl?

Is it possible to have a foreach loop with a condition in Perl?
I'm having to do a lot of character by character processing - where a foreach loop is very convenient. Note that I cannot use some libraries since this is for school.
I could write a for loop using substr with a condition if necessary, but I'd like to avoid that!
You should show us some code, including the sort of thing you would like to do.
In general, character-by-character processing of a string would be done in Perl by writing
for my $ch (split //, $string) { ... }
or, if it is more convenient
my #chars = split //, $string;
for (#chars) { ... }
or
my $i = 0;
while ($i < #chars) { my $char = $chars[$i++]; ... }
and the latter form can support multiple expressions in the while condition. Perl is very rich with different ways to do the similar things, and without knowing more about your problem it is impossible to say which is best for you.
Edit
It is important to note that none of these methods allow the original string to be modified. If that is the intention then you must use s///, tr/// or substr.
Note that substr has a fourth parameter that will replace the specified part of the original string. Note also that it can act as an lvalue and so take an assignment. In other words
substr $string, 0, 1, 'X';
can be written equivalently as
substr($string, 0, 1) = 'X';
If split is used to convert a string into a list of characters (actually one-character strings) then it can be modified in this state and recombined into a string using join. For instance
my #chars = split //, $string;
$chars[0] = 'X';
$string = join '', #chars;
does a similar thing to the above code using substr.
For example:
foreach my $l (#something) {
last if (condition);
# ...
}
will exit the loop if condition is true
You might investigate the next and last directives. More info in perldoc perlsyn.

How do I remove a a list of character sequences from the beginning of a string in Perl?

I have to read lines from a file and store them into a hash in Perl. Many of these lines have special character sequences at the beginning that I need to remove before storing. These character sequences are
| || ### ## ##||
For example, if it is ||https://ads, I need to get https://ads; if ###http, I need to get http.
I need to exclude these character sequences. I want to do this by having all the character sequences to exclude in a array and then check if the line starts with these character sequences and remove those. What is a good way to do this?
I've gone as far as:
our $ad_file = "C:/test/list.txt";
our %ads_list_hash = ();
my $lines = 0;
# List of lines to ignore
my #strip_characters = qw /| || ### ## ##||/;
# Create a list of substrings in the easylist.txt file
open my $ADS, '<', $ad_file or die "can't open $ad_file";
while(<$ADS>) {
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}
close $ADS;
I need to add the logic to remove the #strip_characters from the beginning of each line if any of them are present.
Probably a bit too complex and general for the task, but still..
my $strip = join "|", map {quotemeta} #strip_characters;
# avoid bare [] etc. in the RE
# ... later, in the while()
s/^(?:$strip)+//o;
# /o means "compile $strip into the regex once and for all"
Why don't you do it with a regex? Something like
$line =~ s/^[## |]+//;
should work.
If you want to remove a list of characters (according to your title), then a very simple regular expression will work.
Within the loop, add the following regular expression
while( <$ADS> ) {
chomp;
s/^[## \|]+//;
$ads_list_hash{$lines++} = $_;
}
Note the pipe charachter ('|') is escapted.
However, it appears that you want to remove a list of expressions. You can do the following
while( <$ADS> ) {
chomp;
s/^((\|)|(\|\|)|(###)|(##)|(##\|\|))+//;
$add_list_hash{$lines++} = $_;
}
You said that the list of expression is stored in an array or words. In your sample code, you create this array with 'qw'. If the list of expressions isn't known at compile time, you can build a regular expression in a variable, and use it.
my #strip_expression = ... // get an array of strip expressions
my $re = '^((' . join(')|(',#strip_expression) . '))+';
and then, use the following statement in the loop:
s/$re//;
Finaly, one thing not related to the question can be said about the code: It would be much more appropriate to use Array instead of Hash, to map an integer to a set of strings. Unless you have some other requirement, better have:
our #ads_list; // no need to initialize the array (or the hash) with empty list
...
while( <$ADS> ) {
chomp;
s/.../;
push #ads_list, $_;
}
$ads_list_hash{$lines} = $_;
$lines ++;
Don't do that. If you want an array, use an array:
push #ads_lines, $_;
Shawn's Rule of Programming #7: When creating data structures: if preserving the order is important, use an array; otherwise use a hash.
Because substitutions return whether or not they did anything you can use a
substitution to search the string for your pattern and remove it if it's there.
while( <$ADS> ) {
next unless s/^\s*(?:[#]{2,3}|(?:##)?[|]{1,2})\s*//;
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}