What is the right regex to match a relative path to an image file? - perl

I have this path ../../Capture.jpg. So far I've figured out this incomplete regex: '[../]+'. I want to check if user puts in the right path like ../../image file name. The file extensions can be jpg, png, ..

your [../]+ is not sufficient or correct for the job at hand, if you REALLY want to match a bunch of ../ at the start of a filename.
It's not completely clear what you want to do exactly, but the following will match one or more ../ at the start of a string:
/^((?:\.\.\/)+)/
basically:
^ to anchor to the start of the string being tested - will not match any ../ INSIDE the string
( and the balancing ) at the end: capture the contents within. All your ../../ will be available in a variable called $1
then I'm using (?: ) to wrap the next content. This groups the bit inside, but does NOT save the value inside a $1, $2, etc. More information soon...
The REAL pattern of interest is
\.\.\/
Since . and / are magic characters, they need 'escaping' with backslash. This tells Perl that the . and / do NOT have a special meaning at this point.
I've used the (?: ) wrapper to group them together, so that the + operates on all 3 characters of interest. The + operator means "one or more repetitions".
So, my pattern will match one or more repetitions of ../ which are anchored to the start of the string. Furthermore, the exact contents matched will be available in $1 if you are interested in doing something with that (eg count how many ../ you have)
Please ask if you have further questions, or I have misunderstood your goals.
EDIT: to suit your new requirements, and add a bit of bonus:
m!^\.\./\.\./(([^/]+)\.([^.]+))$!
Note first that I've used m!pattern! instead of /pattern/. Firstly, if Perl sees /pattern/ it assumes it's m/pattern/ but you can use an alternative character to wrap the patterns. This is useful if you actually want to use / in your pattern without having to go nuts with backslashes.
so:
^ exactly match only from the start
followed by exactly ../../
next I've used ( ) wrappers to capture the bits following. Explanation after...
ignoring the ( and ) now:
[^/]+ one or more repetitions (+) of any character that isn't /
. literally a dot - the one before the extension
[^./]+ one or more repetitions of any character that isn't . or /
Notice how the [^/]+ allows for any character including . but prevents another directory part from sneaking in. Thus, the filename could be foo.bar.jpg and it will be collected properly.
Notice how [^./]+ allows for any character in the extension except a dot - and also excluding / to prevent another directory segment from sneaking in.
Finally, $ is used to ensure we've reached the end of the pattern.
as for the captures:
$1 will contain all of foo.bar.jpg
$2 will contain foo.bar
$3 will contain jpg (not .jpg) but I'll leave it up to you to figure out what to change if you wish to capture the dot as well.
FINALLY - in a typical script, you might do something like:
if($filename =~ m!^\.\./\.\./(([^/]+)\.([^./]+))$!) {
print "You correctly entered ../../$1 giving basename=$2 and extension=$3 - Bravo!\n";
}
else {
print "you've failed to read the instructions properly\n";
}
As a bonus, I even tested that, and found 2 spolling mistaiks you'll never have to see
cheers.

# convert relative file paths to md links ...
# file paths and names with letters , nums - and _ s supported
$str =~ s! (\.\.\/([a-zA-Z0-9_\-\/\\]*)[\/\\]([a-zA-Z0-9_\-]*)\.([a-zA-Z0-9]*)) ! [$3]($1) !gm

If you don't care the path prefix, use:
$path =~ /\.(jpg|png)$/
or
substr($path, -4) ~~ ['.jpg', '.png']
With exactly '../../', use:
$path =~ m!^\.\./\.\./[^/]*\.(jpg|png)$!
With any number of '../'s, use:
$path =~ m!^(\.\./)*[^/]*\.(jpg|png)$!

Related

match string pattern by certain characters but exclude combinations of those characters

I have the following sample string:
'-Dparam="x" -f hello-world.txt bye1.txt foo_bar.txt -Dparam2="y"'
I am trying to use RegEx (PowerShell, .NET flavor) to extract the filenames hello-world.txt, bye1.txt, and foo_bar.txt.
The real use case could have any number of -D parameters, and the -f <filenames> argument could appear in any position between these other parameters. I can't easily use something like split to extract it as the delimiter positioning could change, so I thought RegEx might be a good proposition here.
My attempt is something like this in PowerShell (can be opened on any Windows system and copy pasted into it):
'-Dparam="x" -f hello-world.txt bye1.txt foo_bar.txt -Dparam2="y"' -replace '^.* -f ([a-zA-Z0-9_.\s-]+).*$','$1'
Desired output:
hello-world.txt bye1.txt foo_bar.txt
My problem is that I either only take hello-world.txt, or I get hello-world.txt all the way to the end of the string or next = symbol (as in the example above).
I am having trouble expressing that \s is allowed, since I need to capture multiple space-delimited filenames, but that the combination of \s-[a-zA-Z] is not allowed, as that indicates the start of the next argument.

Search for a match, after the match is found take the number after the match and add 4 to it, it is posible in perl?

I am a beginer in perl and I need to modify a txt file by keeping all the previous data in it and only modify the file by adding 4 to every number related to a specific tag (< COMPRESSED-SIZE >). The file have many lines and tags and looks like below, I need to find all the < COMPRESSED-SIZE > tags and add 4 to the number specified near the tag:
< SOURCE-START-ADDRESS >01< /SOURCE-START-ADDRESS >
< COMPRESSED-SIZE >132219< /COMPRESSED-SIZE >
< UNCOMPRESSED-SIZE >229376< /UNCOMPRESSED-SIZE >
So I guess I need to do something like: search for the keyword(match) and store the number 132219 in a variable and add the second number (4) to it, replace the result 132219 with 132223, the rest of the file must remain unchanged, only the numbers related to this tag must change. I cannot search for the number instead of the tag because the number could change while the tag will remain always the same. I also need to find all the tags with this name and replace the numbers near them by adding 4 to them. I already have the code for finding something after a keyword, because I needed to search also for another tag, but this script does something else, adds a number in front of a keyword. I think I could use this code for what i need, but I do not know how to make the calculation and keep the rest of the file intact or if it is posible in perl.
while (my $row = <$inputFileHandler>)
{
if(index($row,$Data_Pattern) != -1){
my $extract = substr($row, index($row,$Data_Pattern) + length($Data_Pattern), length($row));
my $counter_insert = sprintf "%08d", $counter;
my $spaces = " " x index($row,$Data_Pattern);
$data_to_send ="what i need to add" . $extract;
print {$outs} $spaces . $Data_Pattern . $data_to_send;
$counter = $counter + 1;
}
else
{
print {$outs} $row;
next;
}
}
Maybe you could help me with a block of code for my needs, $Data_Pattern is the match. Thank you very much!
This is a classic one-liner Perl task. Basically you would do something like
$ perl -i.bak -pe's/^< COMPRESSED-SIZE >\K(\d+)/$1 + 4/e' yourfile.txt
Which will in essence copy and replace your file with a new, edited file. This can be very dangerous, especially if you are a Perl newbie. The -i switch is here used with the .bak extension which saves a backup in yourfile.txt.bak. This does not make this operation safe, however, as running the command twice will overwrite the backup.
It is advisable to make a separate backup of the target file before using this command.
-i.bak edit "in-place", the file is overwritten, a backup of the original is created with extension .bak.
-p argument is treated as a file name, which is read, and printed back.
s/ // the substitution operator, which is applied to all lines of the file.
^ inside the regex looks for beginning of line.
\K keep the match that is to the left.
(\d+) capture () 1 or more digits \d+ and store them in $1
/e treat the right hand side of the substitution operator as an expression and use the result as the replacement string. In this case it will increase your number and return the sum.
The long version of this command is
while (<>) {
s/^< COMPRESSED-SIZE >\K(\d+)/$1 + 4/e
}
Which can be placed in a file and run with the -i switch.

Why are ##, #!, #, etc. not interpolated in strings?

First, please note that I ask this question out of curiosity, and I'm aware that using variable names like ## is probably not a good idea.
When using doubles quotes (or qq operator), scalars and arrays are interpolated :
$v = 5;
say "$v"; # prints: 5
$# = 6;
say "$#"; # prints: 6
#a = (1,2);
say "#a"; # prints: 1 2
Yet, with array names of the form #+special char like ##, #!, #,, #%, #; etc, the array isn't interpolated :
#; = (1,2);
say "#;"; # prints nothing
say #; ; # prints: 1 2
So here is my question : does anyone knows why such arrays aren't interpolated? Is it documented anywhere?
I couldn't find any information or documentation about that. There are too many articles/posts on google (or SO) about the basics of interpolation, so maybe the answer was just hidden in one of them, or at the 10th page of results..
If you wonder why I could need variable names like those :
The -n (and -p for that matter) flag adds a semicolon ; at the end of the code (I'm not sure it works on every version of perl though). So I can make this program perl -nE 'push#a,1;say"#a"}{say#a' shorter by doing instead perl -nE 'push#;,1;say"#;"}{say#', because that last ; convert say# to say#;. Well, actually I can't do that because #; isn't interpolated in double quotes. It won't be useful every day of course, but in some golfing challenges, why not!
It can be useful to obfuscate some code. (whether obfuscation is useful or not is another debate!)
Unfortunately I can't tell you why, but this restriction comes from code in toke.c that goes back to perl 5.000 (1994!). My best guess is that it's because Perl doesn't use any built-in array punctuation variables (except for #- and #+, added in 5.6 (2000)).
The code in S_scan_const only interprets # as the start of an array if the following character is
a word character (e.g. #x, #_, #1), or
a : (e.g. #::foo), or
a ' (e.g. #'foo (this is the old syntax for ::)), or
a { (e.g. #{foo}), or
a $ (e.g. #$foo), or
a + or - (the arrays #+ and #-), but not in regexes.
As you can see, the only punctuation arrays that are supported are #- and #+, and even then not inside a regex. Initially no punctuation arrays were supported; #- and #+ were special-cased in 2000. (The exception in regex patterns was added to make /[\c#-\c_]/ work; it used to interpolate #- first.)
There is a workaround: Because #{ is treated as the start of an array variable, the syntax "#{;}" works (but that doesn't help your golf code because it makes the code longer).
Perl's documentation says that the result is "not strictly predictable".
The following, from perldoc perlop (Perl 5.22.1), refers to interpolation of scalars. I presume it applies equally to arrays.
Note also that the interpolation code needs to make a decision on
where the interpolated scalar ends. For instance, whether
"a $x -> {c}" really means:
"a " . $x . " -> {c}";
or:
"a " . $x -> {c};
Most of the time, the longest possible text that does not include
spaces between components and which contains matching braces or
brackets. because the outcome may be determined by voting based on
heuristic estimators, the result is not strictly predictable.
Fortunately, it's usually correct for ambiguous cases.
Some things are just because "Larry coded it that way". Or as I used to say in class, "It works the way you think, provided you think like Larry thinks", sometimes adding "and it's my job to teach you how Larry thinks."

Perl $1 giving uninitialized value error

I am trying to extract a part of a string and put it into a new variable. The string I am looking at is:
maker-scaffold_26653|ref0016423-snap-gene-0.1
(inside a $gene_name variable)
and the thing I want to match is:
scaffold_26653|ref0016423
I'm using the following piece of code:
my $gene_name;
my $scaffold_name;
if ($gene_name =~ m/scaffold_[0-9]+\|ref[0-9]+/) {
$scaffold_name = $1;
print "$scaffold_name\n";
}
I'm getting the following error when trying to execute:
Use of uninitialized value $scaffold_name in concatenation (.) or string
I know that the pattern is right, because if I use $' instead of $1 I get
-snap-gene-0.1
I'm at a bit of a loss: why will $1 not work here?
If you want to use a value from the matching you have to make () arround the character in regex
To expand on Jens' answer, () in a regex signifies an anonymous capture group. The content matched in a capture group is stored in $1-9+ from left to right, so for example,
/(..):(..):(..)/
on an HH:MM:SS time string will store hours, minutes, and seconds in $1, $2, $3 respectively. Naturally this begins to become unwieldy and is not self-documenting, so you can assign the results to a list instead:
my ($hours, $mins, $secs) = $time =~ m/(..):(..):(..)/;
So your example could bypass the use of $ variables by doing direct assignment:
my ($scaffold_name) = $gene_name =~ m/(scaffold_[0-9]+[|]ref[0-9]+)/;
# $scaffold_name now contains 'scaffold_26653|ref0016423'
You can even get rid of the ugly =~ binding by using for as a topicalizer:
my $scaffold_name;
for ($gene_name) {
($scaffold_name) = m/(scaffold_\d+[|]ref\d+)/;
print $scaffold_name;
}
If things start to get more complex, I prefer to use named capture groups (introduced in Perl v5.10.0):
$gene_name =~ m{
(?<scaffold_name> # ?<name> creates a named capture group
scaffold_\d+? # 'scaffold' and its trailing digits
[|] # Literal pipe symbol
ref\d+ # 'ref' and its trailing digits
)
}xms; # The x flag lets us write more readable regexes
print $+{scaffold_name}, "\n";
The results of named capture groups are stored in the magic hash %+. Access is done just like any other hash lookup, with the capture groups as the keys. %+ is locally scoped in the same way the $ are, so it can be used as a drop-in replacement for them in most situations.
It's overkill for this particular example, but as regexes start to get larger and more complicated, this saves you the trouble of either having to scroll all the way back up and count anonymous capture groups from left to right to find which of those darn $ variables is holding the capture you wanted, or scan across a long list assignment to find where to add a new variable to hold a capture that got inserted in the middle.
My personal rule of thumb is to assign the results of anonymous captured to descriptively named lexically scoped variables for 3 or less captures, then switch to using named captures, comments, and indentation in regexes when more are necessary.

How does this Perl one liner to check if a directory is empty work?

I got this strange line of code today, it tells me 'empty' or 'not empty' depending on whether the CWD has any items (other than . and ..) in it.
I want to know how it works because it makes no sense to me.
perl -le 'print+(q=not =)[2==(()=<.* *>)].empty'
The bit I am interested in is <.* *>. I don't understand how it gets the names of all the files in the directory.
It's a golfed one-liner. The -e flag means to execute the rest of the command line as the program. The -l enables automatic line-end processing.
The <.* *> portion is a glob containing two patterns to expand: .* and *.
This portion
(q=not =)
is a list containing a single value -- the string "not". The q=...= is an alternate string delimiter, apparently used because the single-quote is being used to quote the one-liner.
The [...] portion is the subscript into that list. The value of the subscript will be either 0 (the value "not ") or 1 (nothing, which prints as the empty string) depending on the result of this comparison:
2 == (()=<.* *>)
There's a lot happening here. The comparison tests whether or not the glob returned a list of exactly two items (assumed to be . and ..) but how it does that is tricky. The inner parentheses denote an empty list. Assigning to this list puts the glob in list context so that it returns all the files in the directory. (In scalar context it would behave like an iterator and return only one at a time.) The assignment itself is evaluated in scalar context (being on the right hand side of the comparison) and therefore returns the number of elements assigned.
The leading + is to prevent Perl from parsing the list as arguments to print. The trailing .empty concatenates the string "empty" to whatever came out of the list (i.e. either "not " or the empty string).
<.* *>
is a glob consisting of two patterns: .* are all file names that start with . and * corresponds to all files (this is different than the usual DOS/Windows conventions).
(()=<.* *>)
evaluates the glob in list context, returning all the file names that match.
Then, the comparison with 2 puts it into scalar context so 2 is compared to the number of files returned. If that number is 2, then the only directory entries are . and .., period. ;-)
<.* *> means (glob(".*"), glob("*")). glob expands file patterns the same way the shell does.
I find that the B::Deparse module helps quite a bit in deciphering some stuff that throws off most programmers' eyes, such as the q=...= construct:
$ perl -MO=Deparse,-p,-q,-sC 2>/dev/null << EOF
> print+(q=not =)[2==(()=<.* *>)].empty
> EOF
use File::Glob ();
print((('not ')[(2 == (() = glob('.* *')))] . 'empty'));
Of course, this doesn't instantly produce "readable" code, but it surely converts some of the stumbling blocks.
The documentation for that feature is here. (Scroll near the end of the section)